corpora.dictionary – Construct word<->id mappings

This module implements the concept of Dictionary – a mapping between words and their integer ids.

Dictionaries can be created from a corpus and can later be pruned according to document frequency (removing (un)common words via the Dictionary.filterExtremes() method), save/loaded from disk via Dictionary.save() and Dictionary.load() methods etc.

class gensim.corpora.dictionary.Dictionary

Dictionary encapsulates mappings between normalized words and their integer ids.

The main function is doc2bow, which converts a collection of words to its bag-of-words representation, optionally also updating the dictionary mapping with new words and their ids.

doc2bow(document, allowUpdate=False)

Convert document (a list of words) into the bag-of-words format = list of (tokenId, tokenCount) 2-tuples. Each word is assumed to be a tokenized and normalized utf-8 encoded string.

If allowUpdate is set, then also update of dictionary in the process: create ids for new words. At the same time, update document frequencies – for each word appearing in this document, increase its self.docFreq by one.

If allowUpdate is not set, this function is const, ie. read-only.

filterExtremes(noBelow=5, noAbove=0.5)

Filter out tokens that appear in

  1. less than noBelow documents (absolute number) or
  2. more than noAbove documents (fraction of total corpus size, not absolute number).

After the pruning, shrink resulting gaps in word ids.

Note: The same word may have a different word id before and after the call to this function!

filterTokens(badIds)

Remove the selected tokens from all dictionary mappings.

badIds is a collection of word ids to be removed.

static fromDocuments(documents)

Build dictionary from a collection of documents. Each document is a list of tokens (ie. tokenized and normalized utf-8 encoded strings).

This is only a convenience wrapper for calling doc2bow on each document with allowUpdate=True.

>>> print Dictionary.fromDocuments(["máma mele maso".split(), "ema má máma".split()])
Dictionary(5 unique tokens)
classmethod load(fname)
Load a previously saved object from file (also see save).
rebuildDictionary()

Assign new word ids to all words.

This is done to make the ids more compact, ie. after some tokens have been removed via filterTokens() and there are gaps in the id series. Calling this method will remove the gaps.

save(fname)
Save the object to file via pickling (also see load).

Previous topic

corpora.bleicorpus – Corpus in Blei’s LDA-C format

Next topic

corpora.dmlcorpus – Corpus in DML-CZ format