Sampling and Text Processing from Online Libraries (decompser.py)


With the decomposer module, one can sample random documents (books, etc.) from Project Gutenberg and Archive.org and rearrange the texts using Markov chain algorithms, cut-up, or by swapping instances of a part of speech between two texts.


generativepoetry.decomposer.cutup(input, min_cutout_words=3, max_cutout_words=7) → List[str]

Simulates William S. Burroughs’ and Brion Gysin’s cut-up technique by separating an input text into non-whitespace blocks of text and then randomly grouping those into cut-outs between the minimum and maximum length of words.

Arguments:

input (str) – input string to be cut up min_cutout_words (int) – minimum number of words in cut out chunk max_cutout_words – maximum number of words in cutout chunk

generativepoetry.decomposer.get_gutenberg_document(url) → str

Downloads a document (book, etc.) from Project Gutenberg and returns it as a string.

Returns a ParsedText instance.

generativepoetry.decomposer.get_internet_archive_document(url) → str

Downloads a document (book, etc.) from Internet Archive and returns it as a string. The linked document must have a text version. PDF text extraction is not supported at this time. Returns a ParsedText instance.

generativepoetry.decomposer.markov(input: input_type, ngram_size=1, num_output_sentences=5) → List[str]

Markov chain text generation from markovify library, supports custom n-gram length :param input: :param ngram_size: :param num_output_sentences: :return:

generativepoetry.decomposer.random_gutenberg_document(language_filter='en') → str

Downloads a random document (book, etc.) from Project Gutenberg and returns it as a string.

Keyword arguments:

language_filter (str) – restrict the random document to a paritcular language (default: English)

generativepoetry.decomposer.reconcile_replacement_word(original_word_with_ws, original_word_tag, replacement_word, replacement_word_tag) → str

Modify replacement word if needed to fix subject/verb agreement and preserve the whitespace or lack of before and after the original word.

Arguments:

original_word_with_ws (str): (str) original word with surrounding whitespace original_word_tag (str): part-of-speech tag of original word replacement_word (str): word that is replacing original word replacement_word_tag (str): part-of-speech tag of replacement word

generativepoetry.decomposer.swap_parts_of_speech(text1, text2, parts_of_speech=['ADJ', 'NOUN']) -> (<class 'str'>, <class 'str'>)

Swap all the words of certain parts of speech from one text with those (with the same part of speech) from another text.

Keyword arguments:
parts_of_speech (list) – list of parts of speech tags to swap out. Must be from the list provided by spaCy:

https://spacy.io/api/annotation#pos-tagging

generativepoetry.decomposer.validate_url(url, expected_netloc='')

Validate that the provided string is indeed a URL from the anticipated source

Keyword arguments:

expected_netloc (str) – the expected site the URL should be from, i.e. archive.org or gutenberg.org

The ParsedText class has several methods for random sampling.

class generativepoetry.decomposer.ParsedText(text)
random_paragraph(minimum_sentences=3) → str

Returns a random sentence from the text.

Keyword Arguments:

minimum_tokens; allows for sampling a sentence of a minimum NLP tokens

random_sentence(minimum_tokens=1) → str

Returns a random sentence from the text.

Keyword Arguments:

minimum_tokens; allows for sampling a sentence of a minimum NLP tokens

random_sentences(num=5, minimum_tokens=1) → list

Returns a random sentence from the text.

Keyword Arguments:

minimum_tokens; allows for sampling a sentence of a minimum NLP tokens