L15 - NLP Flashcards
Define Bayesian
Statistical method that assigns probabilities to events.
What is NLP?
Natural Language Processing is the subset of AI tasked with analysing and interpreting human language.
Give some examples of how NLP is used?
Text summation
Sentiment analysis
Topic extraction
Question answering
Spam detection
What is the bayesian spam detection formula?
P( message | spam ) = P(spam|message)P(spam ) / P(message)
How is NLP used for spam detection?
Historic data of both spam and non-spam emails is fed into some NLP model.
Creates a (message X word) matrix where each row is a message, each col is a word, and the value of the count of that word in the message.
From this matrix, classifications can be made regarding whether a message containing certain words is spam or not. This is the NLP classification model.
New message can be run through the model to determine whether it’s spam.
Explain how NLP can be used to determine whether a block of text is lyrics from a rock song…
- Download a large set of rock song lyrics.
- Break lyrics down into a large set of words.
- Perform Stemming or Lemmatisation on each word.
- Count all occurances of words
- This can be used to predict whether other lyrics are a rock song or not.
Define Stemming…
Process of reducing a word to its root form by removing prefixes of suffixes.
Define Lemmatisation…
Reducing words to their lemma (base form), which is the contextual root of the word.
For example, the lemma of studies if study. Whereas the stem of studies is studi ( not a real world )
Why do we need stemming, lemmatisation or tokenisation?
Groups words or phrases with similar meanings, thus reducing the dimensional vector space of the models, and improves computability.
What is an issue with stemming that lemmatisation solves?
Stemming can lead to non-real worlds. E.g Studies -> Studi
In contrast, the lemma of studies is study.
What issue does Stemming and Lematisation solve?
High variation of similar words can lead to high dimensionality in the vector space. This degrades computation efficiency.
Thus, these reduce the dimensionality by grouping similar words to the same numeric value.
Define tokenisation…
Break a sentence or paragraph into individual words.
Give the Lemma of the following: was, changing, better, worse
be
change
good
bad
Why can tokenisation from english to another language cause issues?
Because other languages may not have the same sentence structure, and may have numerous to convey 1 word in english.
What is bag-of-words representation?
Vector based way of representing a text. Creates a Word Matrix containing all counts of each word in the text.
Each element int he vector represents the count of that word.
Columns contain all words in the text.
Each row is a sentence.
Can lead to sparse vectors.