utils – Various utility functions

This module contains various general utility functions.

class gensim.utils.SaveLoad

Objects which inherit from this class have save/load functions, which un/pickle them to disk.

This uses cPickle for de/serializing, so objects must not contains unpicklable attributes, such as lambda functions etc.

classmethod load(fname)
Load a previously saved object from file (also see save).
save(fname)
Save the object to file via pickling (also see load).
gensim.utils.deaccent(text)

Remove accentuation from the given string.

Input text is either a unicode string or utf8 encoded bytestring. Return input string with accents removed, as unicode.

>>> deaccent("Šéf chomutovských komunistů dostal poštou bílý prášek")
u'Sef chomutovskych komunistu dostal postou bily prasek'
gensim.utils.dictFromCorpus(corpus)

Scan corpus for all word ids that appear in it, then contruct and return a mapping which maps each wordId -> str(wordId).

This function is used whenever words need to be displayed (as opposed to just their ids) but no wordId->word mapping was provided. The resulting mapping only covers words actually used in the corpus, up to the highest wordId found.

gensim.utils.getMaxId(corpus)

Return highest feature id that appears in the corpus.

For empty corpora (no features at all), return -1.

gensim.utils.isCorpus(obj)

Check whether obj is a corpus.

NOTE: When called on an empty corpus (no documents), will return False.

gensim.utils.tokenize(text, lowercase=False, deacc=False, errors='strict', toLower=False, lower=False)

Iteratively yield tokens as unicode strings, optionally also lowercasing them and removing accent marks.

Input text may be either unicode or utf8-encoded byte string.

The tokens on output are maximal contiguous sequences of alphabetic characters (no digits!).

>>> list(tokenize('Nic nemůže letět rychlostí vyšší, než 300 tisíc kilometrů za sekundu!', deacc = True))
[u'Nic', u'nemuze', u'letet', u'rychlosti', u'vyssi', u'nez', u'tisic', u'kilometru', u'za', u'sekundu']

Previous topic

interfaces – Core gensim interfaces

Next topic

matutils – Math utils