nicbet/essence
{ "createdAt": "2016-08-29T02:07:36Z", "defaultBranch": "master", "description": "Essence is a library for Natural Language Processing and Text Summarization in Elixir.", "fullName": "nicbet/essence", "homepage": null, "language": "Elixir", "name": "essence", "pushedAt": "2021-04-09T17:54:17Z", "stargazersCount": 64, "topics": [], "updatedAt": "2024-07-08T13:04:06Z", "url": "https://github.com/nicbet/essence"}
Essence
Section titled “Essence”Essence is a Natural Language Processing (NLP) and Text Summarization library for Elixir. The work is currently in very early stages.
- Tokenization (Basic, done)
- Sentence Detection and Chunking (Basic, done)
- Vocabulary (Basic, done)
- Documents (Draft, done)
- Concordance (done)
- Readability (ARI done, SMOG done, FC todo, GF done, DC done, CL done)
- Reading Time estimates (how long would it take somebody to read the given text, useful for blog posts / articles)
- Speaking Time estimates (how long would it take somebody to present the given content, useful for speeches, presentations)
- Text Corpora
- Bi-Grams
- Tri-Grams
- n-Grams
- Stopwords for English
- Common Names in English (male, female, ambiguous)
- Dictionary words in English
- Dale-Challe’s dictionary of easy English words
- Frequency Measures: TF, TF/IDF, …
- Time-Series Documents
- Dispersion
- Similarity Measures
- Part of Speech Tagging
- Sentiment Analysis
- Classification
- Summarization
- Document Hierarchies
Installation
Section titled “Installation”If available in Hex, the package can be installed as:
- Add
essenceto your list of dependencies inmix.exs:
```elixirdef deps do [{:essence, "~> 0.2.0"}]end```Examples
Section titled “Examples”In the following examples we will use test/genesis.txt, which is a copy of
the book of genesis from the King James Bible
(http://www.gutenberg.org/ebooks/8001.txt.utf-8).
We provide a convenience method for reading the plain text of the book of
genesis into Essence via the method Essence.genesis/1
Let’s first create a document from the text:
iex> document = Essence.Document.from_text Essence.genesisWe can see that the text contains 1,533 paragraphs, 1,663 sentences and 44,741 tokens.
iex> document |> Essence.Document.enumerate_tokens |> Enum.countiex> document |> Essence.Document.paragraphs |> Enum.countiex> document |> Essence.Document.sentences |> Enum.countWhat might the first sentence of genesis be?
iex> Essence.Document.sentence document, 0Now let’s compute the frequency distribution for tokens in the book of genesis:
iex> fd = Essence.Vocabulary.freq_dist documentWhat is the vocabulary of this text?
iex> vocabulary = Essence.Vocabulary.vocabulary documentor alternatively we can use the frequency distribution for the equivalent expression:
iex> vocabulary = Map.keys fdWhat might the top 10 most frequent tokens be?
iex> vocabulary |> Enum.sort_by( fn(x) -> Map.get(fd, x) end, &>=/2 ) |> Enum.slice(1, 10)["and", "the", "of", ".", "And", ":", "his", "he", "to", ";"]Next, we can compute the lexical richness of the text:
iex> Essence.Vocabulary.lexical_richness document16.74438622754491Let’s get a concordance view on ‘Adam’:
iex> Essence.Document.concordance(document, "Adam")
nd brought them unto Adam to see what he wouldhem : and whatsoever Adam called every living ce name thereof . And Adam gave names to all cat the field ; but for Adam there was not found ap sleep to fall upon Adam , and he slept : andr unto the man . And Adam said , This is now boool of the day : and Adam and his wife hid themLORD God called unto Adam , and said unto him ,over thee . And unto Adam he said , Because tholt thou return . And Adam called his wife's namof all living . Unto Adam also and to his wifee tree of life . And Adam knew Eve his wife ; a and sevenfold . And Adam knew his wife again ;f the generations of Adam . In the day that Godnd called their name Adam , in the day when they were created . And Adam lived an hundred andth : And the days of Adam after he had begottennd all the days that Adam lived were nine hundr