09. Text Analytics Flashcards
What is Lemmatisation
Lemmatisation – reduces the inflected words by finding the correct dictionary base/root word that belongs to the language.
In Lemmatisation, root word is called Lemma.
For example, runs, running, ran are all forms of the word run, therefore run is the lemma of all these words.
What is stemming
Stemming is the process of reducing inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in the Language.
Stem (root) is the part of the word to which you add inflectional (changing/deriving) affixes such as (-ed, -ize, -s, -de, mis).
Stems are created by removing the suffixes or prefixes used with a word, which is called Suffix/Prefix Stripping.
Sometimes called porters stemming
What is the dimensionality in text analysis
It is the number of unique terms in the document. Various methods try to reduce this dimensionality to make the analysis simpler.
What is case folding
It means you ignore the difference between capitals and standard text
What does tokenizing doing
Tokenization is the task of separating (also called tokenizing) words from the body of text. Raw text is converted into collections of tokens after the tokenization, where each token is generally a word.
You need to define how you wish to break the text apart by i.e. punctuation
What does parsing mean
Parsing: reading an unstructured text and converting it into a formatted data. This normally involves adding structure to the data.
In text analysis what is meant by search and retrieval
Search and retrieval: search specific words/phrases, topics or entities like names of people and organisations into documents in a corpus.
What is text mining
Text mining: this involves applying analysis methods to discover relationships and patterns in large text collections
What is topic modelling
A topic consists of a cluster of words that frequently occur together and share the same theme. i.e. fluffy, meeow, purr, paw = the topic of cats. You need to refer to a corpus which would contain pre-labelled topics.
What does RSS mean
Real Simple Syndication
What are Regular Expressions
A method for defining parameters used for text mining i.e. $ is the symbol used to indicate the end of a text string
What is Zipf’s Law
Vaguely holds true the ith word occurs 1/ith word
1st ranked = 1/1, 2nd ranked = 1/2, 3rd ranked = 1/3
What is case folding
It ignores the capital letters / lower case detail of text
What is information content of words
“Stop” words have basically no information content (i.e. the, and etc) these should be removed to improve text analysis
What is TF
Term frequency
TF1(t,d)=SUMf(t,ti)
It is a count of the number of times that term appears in the corpus of documents
What is the IDF
Inverted Document Frequency
The document frequency is the number of documents in the corpus that contain the term, hence
The inverted document frequency is the inverse of that
What is the TFIDF
TFIDF (t,d) = TF(t,d) x IDF(t)
If this is higher the better. I high number means that this word is an important word.
What is sentiment analysis
Looking for opinions, often uses classifiers (niave bayes) and often has a binary result)
What is a word cloud
An image of the words found in a document with the more common words being bigger. First having removed the stop words.
What is part of speech tagging (POS)
Changing the words out to the corresponding noun verb etc
List some common regular expressions
means or
*matches zero or more instances of the previous letter
+ matches one or more instances of the previous letter
{2,4} matches two to four instances of the previous letter
^ means starts with
$means ends with
What does bag of words mean
All the words in the text but order of words is not preserved