09. Text Analytics Flashcards
What is Lemmatisation
Lemmatisation – reduces the inflected words by finding the correct dictionary base/root word that belongs to the language.
In Lemmatisation, root word is called Lemma.
For example, runs, running, ran are all forms of the word run, therefore run is the lemma of all these words.
What is stemming
Stemming is the process of reducing inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in the Language.
Stem (root) is the part of the word to which you add inflectional (changing/deriving) affixes such as (-ed, -ize, -s, -de, mis).
Stems are created by removing the suffixes or prefixes used with a word, which is called Suffix/Prefix Stripping.
Sometimes called porters stemming
What is the dimensionality in text analysis
It is the number of unique terms in the document. Various methods try to reduce this dimensionality to make the analysis simpler.
What is case folding
It means you ignore the difference between capitals and standard text
What does tokenizing doing
Tokenization is the task of separating (also called tokenizing) words from the body of text. Raw text is converted into collections of tokens after the tokenization, where each token is generally a word.
You need to define how you wish to break the text apart by i.e. punctuation
What does parsing mean
Parsing: reading an unstructured text and converting it into a formatted data. This normally involves adding structure to the data.
In text analysis what is meant by search and retrieval
Search and retrieval: search specific words/phrases, topics or entities like names of people and organisations into documents in a corpus.
What is text mining
Text mining: this involves applying analysis methods to discover relationships and patterns in large text collections
What is topic modelling
A topic consists of a cluster of words that frequently occur together and share the same theme. i.e. fluffy, meeow, purr, paw = the topic of cats. You need to refer to a corpus which would contain pre-labelled topics.
What does RSS mean
Real Simple Syndication
What are Regular Expressions
A method for defining parameters used for text mining i.e. $ is the symbol used to indicate the end of a text string
What is Zipf’s Law
Vaguely holds true the ith word occurs 1/ith word
1st ranked = 1/1, 2nd ranked = 1/2, 3rd ranked = 1/3
What is case folding
It ignores the capital letters / lower case detail of text
What is information content of words
“Stop” words have basically no information content (i.e. the, and etc) these should be removed to improve text analysis
What is TF
Term frequency
TF1(t,d)=SUMf(t,ti)
It is a count of the number of times that term appears in the corpus of documents