Week 7: Text Mining Flashcards
Language Models
Key components include: characters, strings, language (set of strings), corpus (collection of one or more texts)
n-gram
A contiguous sequence of n items, e.g. characters
n-gram Character Model
A probability distribution of n-character sequences using a Markov chain of order n-1.
P(c_1,…,c_n) = \prod_{i=1}^n P(c_i \mid c_{i-2}, c_{i-1})
Language Identification Problem
Identify the language that a text (corpus) is written in.
One approach is to build a 3-gram model and find the most probable language L^{} such that
L^{} = \arg \underset{L}{\max}\left{ P(L \mid c_1,…,c_n)\right}
Spelling Correction
Identify and fix errors in word spelling. Trigrams are often used for this.
Genre Classification
Identify category of text (e.g., news, novel, blogs, etc.). Trigrams are often used.
Smoothing
The process of assigning a non-zero probability to sequences that don’t appear in the training corpus.
Linear Interpolation Smoothing
Combining unigrams, bigrams, and trigrams using linear interpolation:
\hat{P}(c_i \mid c_{i-2}, c_{i-1}) = \lambda_1 P(c_i) + \lambda_2 P(c_i \mid c_{i-1}) + \lambda_3 P(c_i \mid c_{i-2}, c_{i-1})
Word
Sequence of characters
Vocabulary
Set of valid words in a language.
Out-of-vocabulary Words
Words with characters of the language that aren’t valid.
N-gram Word Models
They model sequences of words rather than characters.
As n increases, n-gram models tend to predict better, but the training overhead becomes more significant.
Bayesian Approach
Using spam as an example, a new msg can be classified using Bayes rule:
\underset{c \in \left{ spam, \text{¬spam} \right} }{\arg \max} P(c \mid msg) = \underset{c \in \left{ spam, \text{¬spam} \right} }{\arg \max} P(msg \mid c) P(c)
Using priors,
P(spam) = \frac{\left\lvert spam \right\rvert}{\left\lvert spam \right\rvert + \left\lvert \text{¬spam} \right\rvert}
Term-document Matrix
A 2-D matrix with frequencies of words in a collection of documents. Rows correspond to documents, and columns correspond to terms.
Topic Analysis
Identify topics in text.
Sentiment Analysis
Identify the polarity in text (i.e. positive, negative, or neutral). Can also classify based on emotion (i.e. anger, disgust joy).
Information Extraction
The process of acquiring knowledge by skimming tests and looking for special types of objects as well as relationships between them.
The pipeline is as follows:
1. Tokenisation
2. Complex words
3. Basic groups
4. Complex phrases
5. Structure merging
Tokenisation
Segment the text into tokens.
Complex Words
Recognise complex words such as collocations (i.e. “set up”, “joint venture”) or names (“Great Britain”, “Harry Potter”) appearing as combinations of lexical entries.
For example, company names can be recognised with the regex: Capitalised Word + (“Co”|”Inc”|”Ltd”)
Basic Groups
Chunk the text into units as following:
NG = Noun phrase or group
VG = Verb phrase or group
PR = Preposition