Week 7: Text Mining Flashcards
Language Models
Key components include: characters, strings, language (set of strings), corpus (collection of one or more texts)
n-gram
A contiguous sequence of n items, e.g. characters
n-gram Character Model
A probability distribution of n-character sequences using a Markov chain of order n-1.
P(c_1,…,c_n) = \prod_{i=1}^n P(c_i \mid c_{i-2}, c_{i-1})
Language Identification Problem
Identify the language that a text (corpus) is written in.
One approach is to build a 3-gram model and find the most probable language L^{} such that
L^{} = \arg \underset{L}{\max}\left{ P(L \mid c_1,…,c_n)\right}
Spelling Correction
Identify and fix errors in word spelling. Trigrams are often used for this.
Genre Classification
Identify category of text (e.g., news, novel, blogs, etc.). Trigrams are often used.
Smoothing
The process of assigning a non-zero probability to sequences that don’t appear in the training corpus.
Linear Interpolation Smoothing
Combining unigrams, bigrams, and trigrams using linear interpolation:
\hat{P}(c_i \mid c_{i-2}, c_{i-1}) = \lambda_1 P(c_i) + \lambda_2 P(c_i \mid c_{i-1}) + \lambda_3 P(c_i \mid c_{i-2}, c_{i-1})
Word
Sequence of characters
Vocabulary
Set of valid words in a language.
Out-of-vocabulary Words
Words with characters of the language that aren’t valid.
N-gram Word Models
They model sequences of words rather than characters.
As n increases, n-gram models tend to predict better, but the training overhead becomes more significant.
Bayesian Approach
Using spam as an example, a new msg can be classified using Bayes rule:
\underset{c \in \left{ spam, \text{¬spam} \right} }{\arg \max} P(c \mid msg) = \underset{c \in \left{ spam, \text{¬spam} \right} }{\arg \max} P(msg \mid c) P(c)
Using priors,
P(spam) = \frac{\left\lvert spam \right\rvert}{\left\lvert spam \right\rvert + \left\lvert \text{¬spam} \right\rvert}
Term-document Matrix
A 2-D matrix with frequencies of words in a collection of documents. Rows correspond to documents, and columns correspond to terms.
Topic Analysis
Identify topics in text.
Sentiment Analysis
Identify the polarity in text (i.e. positive, negative, or neutral). Can also classify based on emotion (i.e. anger, disgust joy).
Information Extraction
The process of acquiring knowledge by skimming tests and looking for special types of objects as well as relationships between them.
The pipeline is as follows:
1. Tokenisation
2. Complex words
3. Basic groups
4. Complex phrases
5. Structure merging
Tokenisation
Segment the text into tokens.
Complex Words
Recognise complex words such as collocations (i.e. “set up”, “joint venture”) or names (“Great Britain”, “Harry Potter”) appearing as combinations of lexical entries.
For example, company names can be recognised with the regex: Capitalised Word + (“Co”|”Inc”|”Ltd”)
Basic Groups
Chunk the text into units as following:
NG = Noun phrase or group
VG = Verb phrase or group
PR = Preposition
Complex Phrases
Define rules for complex relationships. For example, the following rule describes the creation of a joint venture:
Company + SetUpJointVenture((Company)*)
Structure Merging
Merges terms referring to the same entity.
Regular Expression (Regex)
Defines a search pattern, regex R is associated with a set of L(R) of strings.
Definitions:
^ Matches the starting position within the string. In line-based tools, it matches the starting position of any line.
. Matches any single character (many applications exclude newlines, and exactly which characters are considered newlines is flavor-, character-encoding-, and platform-specific, but it is safe to assume that the line feed character is included). Within POSIX bracket expressions, the dot character matches a literal dot. For example, a.c matches “abc”, etc., but [a.c] matches only “a”, “.”, or “c”.
[ ] A bracket expression. Matches a single character that is contained within the brackets. For example, [abc] matches “a”, “b”, or “c”. [a-z] specifies a range which matches any lowercase letter from “a” to “z”. These forms can be mixed: [abcx-z] matches “a”, “b”, “c”, “x”, “y”, or “z”, as does [a-cx-z].
The - character is treated as a literal character if it is the last or the first (after the ^, if present) character within the brackets: [abc-], [-abc]. Backslash escapes are not allowed. The ] character can be included in a bracket expression if it is the first (after the ^) character: []abc].
[^ ] Matches a single character that is not contained within the brackets. For example, [^abc] matches any character other than “a”, “b”, or “c”. [^a-z] matches any single character that is not a lowercase letter from “a” to “z”. Likewise, literal characters and ranges can be mixed.
$ Matches the ending position of the string or the position just before a string-ending newline. In line-based tools, it matches the ending position of any line.
( ) Defines a marked subexpression. The string matched within the parentheses can be recalled later (see the next entry, \n). A marked subexpression is also called a block or capturing group. BRE mode requires ( ).
\n Matches what the nth marked subexpression matched, where n is a digit from 1 to 9. This construct is defined in the POSIX standard.[35] Some tools allow referencing more than nine capturing groups. Also known as a back-reference, this feature is supported in BRE mode.
* Matches the preceding element zero or more times. For example, abc matches “ac”, “abc”, “abbbc”, etc. [xyz] matches “”, “x”, “y”, “z”, “zx”, “zyx”, “xyzzy”, and so on. (ab)* matches “”, “ab”, “abab”, “ababab”, and so on.
{m,n} Matches the preceding element at least m and not more than n times. For example, a{3,5} matches only “aaa”, “aaaa”, and “aaaaa”. This is not found in a few older instances of regexes. BRE mode requires {m,n}.
Examples:
.at
matches any three-character string ending with “at”, including “hat”, “cat”, “bat”, “4at”, “#at” and “ at” (starting with a space).
[hc]at
matches “hat” and “cat”.
[^b]at matches all strings matched by .at except “bat”.
[^hc]at
matches all strings matched by .at other than “hat” and “cat”.
^[hc]at
matches “hat” and “cat”, but only at the beginning of the string or line.
[hc]at$
matches “hat” and “cat”, but only at the end of the string or line.
[.] matches any single character surrounded by “[” and “]” since the brackets are escaped, for example: “[a]”, “[b]”, “[7]”, “[@]”, “[]]”, and “[ ]” (bracket space bracket).
s.* matches s followed by zero or more characters, for example: “s”, “saw”, “seed”, “s3w96.7”, and “s6#h%(»>m n mQ”.
Dimensionality
This is a major challenge in probabilistic modelling. As the number of dimensions grows, building models get tougher.
Factorisation
This method divides a probability distribution model into smaller components, allowing the computation of simpler models of multivariate data.
Factorisation using Independence
Joint probability distribution of n independent variables:
P(x_1,x_2,…x_n) = \prod_{j=1}^n P(x_j)
Factorisation into a Sequence of Conditional Distributions
Joint probability distribution of n potentially dependent variables:
P(x_1,…,x_n) = P(x_1) \prod_{j=2}^n P(x_j \mid x_1, x_2,…,x_{j-1})
First-Order Markov Model
P(x_j \mid x_{j-1},…,x_1) = P(x_j \mid x_{j-1})
Markov Assumption
When predicting the future, the past doesn’t matter, only the present. All information relevant to x_j is contained in the immediately preceding variable x_{j-1}
k-th Order Markov Models
x_i depends on the k previous measurements.
P(x_1,…,x_n) = P(x_1) \prod_{j=1}^n P(x_j \mid x_{j-1},…,x_{j-k})
Hidden Markov Model
This is a generalisation of Markov models including hidden variables. y_1,…,y_n: observations in a sequences, e.g. words. x_1,…,x_n: hidden states in a sequence, e.g. the tags.
Observation y_j only depends on state x_j. The hidden state x_j depends on the previous state x_{j-1} and may also depend on the previous word y_{j-1}.
P(y_1,…,y_n, x_1,…,x_n) = P(x_1)P(y_1 \mid x_1) \prod_{j=2}^n P(y_j \mid x_j) P(x_j \mid x_{j-1})
Text Representation
The task of finding texts relevant to a user’s query. May involve finding a result set of documents matching the user’s query after searching a corpus of documents.
Term-Document Matrix
Each document is represented by a vector \boldsymbol{u} = (u_1,u_2,…,u_n), where u_j is the weight of term t_j in document u
u_j can be boolean or a real-valued number associate with the term frequency.
Cosine Distance
d(\boldsymbol{u}, boldsymbol{v}) = \frac{\sum_{i=1}^n u_i v_i}{\sqrt{\sum_{i=1}^n u_i^2}\sqrt{\sum_{i=1}^n v_i^2}}. The smaller the angle, the higher the similarity measure.
Semantic Context
The meaning of the texts with respect to their context. Preserving semantic meaning is a big problem in NLP.
Polysemy
When a single word can have different meaning.
Synonym
When different words have the same meaning.
Information Retrieval
In the context of NLP, similarity is preferred of precision (equality or inequality). Methods tend to be interactive with user feedback about the relevance of results.
Retrieval by Content
Find k objects that are most similar to an object query. Queries can be represented as term vectors.
Case Folding
Convert terms to lowercase.
Stemming
Reducing all words to their stem.
Word Embeddings
Use similar representations for words with similar meanings.
Metadata Parsing
Exploit data outside the documents.
Latent Semantic Indexing (LSI)
LSI assumes the existence of a hidden semantic structure in documents. It uses PCA on the m \times n document-term matrix to compute a m \times k matrix of lower dimension. It computes the k principal component directions in the m \times n space, finding terms that appear frequently together and forming them into a single principal component.
PageRank Algorithm
This is Google’s algorithm for web search. It finds pages relevant to a query, but also ranks them. Pages with many in-links are ranked higher. Links for high-ranked pages are weighted more heavily.
Initialise each page p with R(p) = 1. Iteratively update the page ranks as follows until convergence:
R(p) = \frac{1-d}{n} + d \sum_{i=1}^l \frac{R(in_i)}{C(in_i)}
Note that d is a dampening factor.
Precision
The proportion of documents in the result set that are actually relevant.
True Positive / (True Positive + False Positive)
Note that false Positives are the retrieved documents that aren’t relevant.
Recall
The proportion of relevant documents appearing in the result set.
True Positive / (True Positive + False Negative)
Precision-recall Curve
Plotting precision against recall. It’s usually a negatively sloped curve.
F1 Score
\frac{2 \cdot precision \cdot recall}{precision+recall}
TF-IDF Score
TF-IDF highlights rare query terms.
TF-IDF = \sum_{j=1}^k TF(t_j) \cdot IDF(t_j)
t_j is the number of time a term t_j appears in a document.
IDF
IDF(t_j) = \log \left( \frac{1+n}{1+n_j} + 1 \right)