09. Text Analytics Flashcards

Question 1

Q

What is Lemmatisation

Answer

A

Lemmatisation – reduces the inflected words by finding the correct dictionary base/root word that belongs to the language.
In Lemmatisation, root word is called Lemma.
For example, runs, running, ran are all forms of the word run, therefore run is the lemma of all these words.

Question 2

Q

What is stemming

Answer

A

Stemming is the process of reducing inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in the Language.
Stem (root) is the part of the word to which you add inflectional (changing/deriving) affixes such as (-ed, -ize, -s, -de, mis).
Stems are created by removing the suffixes or prefixes used with a word, which is called Suffix/Prefix Stripping.

Sometimes called porters stemming

Question 3

Q

What is the dimensionality in text analysis

Answer

A

It is the number of unique terms in the document. Various methods try to reduce this dimensionality to make the analysis simpler.

Question 4

Q

What is case folding

Answer

A

It means you ignore the difference between capitals and standard text

Question 5

Q

What does tokenizing doing

Answer

A

Tokenization is the task of separating (also called tokenizing) words from the body of text. Raw text is converted into collections of tokens after the tokenization, where each token is generally a word.

You need to define how you wish to break the text apart by i.e. punctuation

Question 6

Q

What does parsing mean

Answer

A

Parsing: reading an unstructured text and converting it into a formatted data. This normally involves adding structure to the data.

Question 7

Q

In text analysis what is meant by search and retrieval

Answer

A

Search and retrieval: search specific words/phrases, topics or entities like names of people and organisations into documents in a corpus.

Question 8

Q

What is text mining

Answer

A

Text mining: this involves applying analysis methods to discover relationships and patterns in large text collections

Question 9

Q

What is topic modelling

Answer

A

A topic consists of a cluster of words that frequently occur together and share the same theme. i.e. fluffy, meeow, purr, paw = the topic of cats. You need to refer to a corpus which would contain pre-labelled topics.

Question 10

Q

What does RSS mean

Answer

A

Real Simple Syndication

Question 11

Q

What are Regular Expressions

Answer

A

A method for defining parameters used for text mining i.e. $ is the symbol used to indicate the end of a text string

Question 12

Q

What is Zipf’s Law

Answer

A

Vaguely holds true the ith word occurs 1/ith word

1st ranked = 1/1, 2nd ranked = 1/2, 3rd ranked = 1/3

Question 13

Q

What is case folding

Answer

A

It ignores the capital letters / lower case detail of text

Question 14

Q

What is information content of words

Answer

A

“Stop” words have basically no information content (i.e. the, and etc) these should be removed to improve text analysis

Question 15

Q

What is TF

Answer

A

Term frequency
TF1(t,d)=SUMf(t,ti)
It is a count of the number of times that term appears in the corpus of documents

Question 16

Q

What is the IDF

Answer

Study These Flashcards

A

Inverted Document Frequency

The document frequency is the number of documents in the corpus that contain the term, hence

The inverted document frequency is the inverse of that

Question 17

Q

What is the TFIDF

Answer

Study These Flashcards

A

TFIDF (t,d) = TF(t,d) x IDF(t)

If this is higher the better. I high number means that this word is an important word.

Question 18

Q

What is sentiment analysis

Answer

Study These Flashcards

A

Looking for opinions, often uses classifiers (niave bayes) and often has a binary result)

Question 19

Q

What is a word cloud

Answer

Study These Flashcards

A

An image of the words found in a document with the more common words being bigger. First having removed the stop words.

Question 20

Q

What is part of speech tagging (POS)

Answer

Study These Flashcards

A

Changing the words out to the corresponding noun verb etc

Question 21

Q

List some common regular expressions

Answer

Study These Flashcards

A

means or
*matches zero or more instances of the previous letter
+ matches one or more instances of the previous letter
{2,4} matches two to four instances of the previous letter
^ means starts with
$means ends with

Question 22

Q

What does bag of words mean

Answer

Study These Flashcards

A

All the words in the text but order of words is not preserved

09. Text Analytics Flashcards

(22 cards)