nicbet/essence

Essence is a library for Natural Language Processing and Text Summarization in Elixir.

{
  "createdAt": "2016-08-29T02:07:36Z",
  "defaultBranch": "master",
  "description": "Essence is a library for Natural Language Processing and Text Summarization in Elixir.",
  "fullName": "nicbet/essence",
  "homepage": null,
  "language": "Elixir",
  "name": "essence",
  "pushedAt": "2021-04-09T17:54:17Z",
  "stargazersCount": 64,
  "topics": [],
  "updatedAt": "2024-07-08T13:04:06Z",
  "url": "https://github.com/nicbet/essence"
}

Build Status Hex.pm hex.pm downloads

Essence

Essence is a Natural Language Processing (NLP) and Text Summarization library for Elixir. The work is currently in very early stages.

ToDo

Installation

If available in Hex, the package can be installed as:

Add essence to your list of dependencies in mix.exs:

```elixir

def deps do
  [{:essence, "~> 0.2.0"}]
end
```

Examples

In the following examples we will use test/genesis.txt, which is a copy of the book of genesis from the King James Bible (http://www.gutenberg.org/ebooks/8001.txt.utf-8).

We provide a convenience method for reading the plain text of the book of genesis into Essence via the method Essence.genesis/1

Let’s first create a document from the text:

iex> document = Essence.Document.from_text Essence.genesis

We can see that the text contains 1,533 paragraphs, 1,663 sentences and 44,741 tokens.

iex> document |> Essence.Document.enumerate_tokens |> Enum.count
iex> document |> Essence.Document.paragraphs |> Enum.count
iex> document |> Essence.Document.sentences |> Enum.count

What might the first sentence of genesis be?

iex> Essence.Document.sentence document, 0

Now let’s compute the frequency distribution for tokens in the book of genesis:

iex> fd = Essence.Vocabulary.freq_dist document

What is the vocabulary of this text?

iex> vocabulary = Essence.Vocabulary.vocabulary document

or alternatively we can use the frequency distribution for the equivalent expression:

iex> vocabulary = Map.keys fd

What might the top 10 most frequent tokens be?

iex> vocabulary |> Enum.sort_by( fn(x) -> Map.get(fd, x) end, &>=/2 ) |> Enum.slice(1, 10)
["and", "the", "of", ".", "And", ":", "his", "he", "to", ";"]

Next, we can compute the lexical richness of the text:

iex> Essence.Vocabulary.lexical_richness document
16.74438622754491

Let’s get a concordance view on ‘Adam’:

iex> Essence.Document.concordance(document, "Adam")

nd brought them unto Adam to see what he would
hem : and whatsoever Adam called every living c
e name thereof . And Adam gave names to all cat
 the field ; but for Adam there was not found a
p sleep to fall upon Adam , and he slept : and
r unto the man . And Adam said , This is now bo
ool of the day : and Adam and his wife hid them
LORD God called unto Adam , and said unto him ,
over thee . And unto Adam he said , Because tho
lt thou return . And Adam called his wife's nam
of all living . Unto Adam also and to his wife
e tree of life . And Adam knew Eve his wife ; a
 and sevenfold . And Adam knew his wife again ;
f the generations of Adam . In the day that God
nd called their name Adam , in the day when the
y were created . And Adam lived an hundred and
th : And the days of Adam after he had begotten
nd all the days that Adam lived were nine hundr