Sampling and Text Processing from Online Libraries (decompser.py)¶
With the decomposer module, one can sample random documents (books, etc.) from Project Gutenberg and Archive.org and rearrange the texts using Markov chain algorithms, cut-up, or by swapping instances of a part of speech between two texts.
-
generativepoetry.decomposer.
cutup
(input, min_cutout_words=3, max_cutout_words=7) → List[str]¶ Simulates William S. Burroughs’ and Brion Gysin’s cut-up technique by separating an input text into non-whitespace blocks of text and then randomly grouping those into cut-outs between the minimum and maximum length of words.
- Arguments:
input (str) – input string to be cut up min_cutout_words (int) – minimum number of words in cut out chunk max_cutout_words – maximum number of words in cutout chunk
-
generativepoetry.decomposer.
get_gutenberg_document
(url) → str¶ Downloads a document (book, etc.) from Project Gutenberg and returns it as a string.
Returns a ParsedText instance.
-
generativepoetry.decomposer.
get_internet_archive_document
(url) → str¶ Downloads a document (book, etc.) from Internet Archive and returns it as a string. The linked document must have a text version. PDF text extraction is not supported at this time. Returns a ParsedText instance.
-
generativepoetry.decomposer.
markov
(input: input_type, ngram_size=1, num_output_sentences=5) → List[str]¶ Markov chain text generation from markovify library, supports custom n-gram length :param input: :param ngram_size: :param num_output_sentences: :return:
-
generativepoetry.decomposer.
random_gutenberg_document
(language_filter='en') → str¶ Downloads a random document (book, etc.) from Project Gutenberg and returns it as a string.
- Keyword arguments:
language_filter (str) – restrict the random document to a paritcular language (default: English)
-
generativepoetry.decomposer.
reconcile_replacement_word
(original_word_with_ws, original_word_tag, replacement_word, replacement_word_tag) → str¶ Modify replacement word if needed to fix subject/verb agreement and preserve the whitespace or lack of before and after the original word.
- Arguments:
original_word_with_ws (str): (str) original word with surrounding whitespace original_word_tag (str): part-of-speech tag of original word replacement_word (str): word that is replacing original word replacement_word_tag (str): part-of-speech tag of replacement word
-
generativepoetry.decomposer.
swap_parts_of_speech
(text1, text2, parts_of_speech=['ADJ', 'NOUN']) -> (<class 'str'>, <class 'str'>)¶ Swap all the words of certain parts of speech from one text with those (with the same part of speech) from another text.
- Keyword arguments:
- parts_of_speech (list) – list of parts of speech tags to swap out. Must be from the list provided by spaCy:
-
generativepoetry.decomposer.
validate_url
(url, expected_netloc='')¶ Validate that the provided string is indeed a URL from the anticipated source
- Keyword arguments:
expected_netloc (str) – the expected site the URL should be from, i.e. archive.org or gutenberg.org
The ParsedText class has several methods for random sampling.
-
class
generativepoetry.decomposer.
ParsedText
(text)¶ -
random_paragraph
(minimum_sentences=3) → str¶ Returns a random sentence from the text.
- Keyword Arguments:
minimum_tokens; allows for sampling a sentence of a minimum NLP tokens
-
random_sentence
(minimum_tokens=1) → str¶ Returns a random sentence from the text.
- Keyword Arguments:
minimum_tokens; allows for sampling a sentence of a minimum NLP tokens
-
random_sentences
(num=5, minimum_tokens=1) → list¶ Returns a random sentence from the text.
- Keyword Arguments:
minimum_tokens; allows for sampling a sentence of a minimum NLP tokens
-