Week 7: Text Mining Flashcards by Henry Cao

Language Models

Key components include: characters, strings, language (set of strings), corpus (collection of one or more texts)

How well did you know this?

Not at all

Perfectly

n-gram

A contiguous sequence of n items, e.g. characters

How well did you know this?

Not at all

Perfectly

n-gram Character Model

A probability distribution of n-character sequences using a Markov chain of order n-1.

P(c_1,…,c_n) = \prod_{i=1}^n P(c_i \mid c_{i-2}, c_{i-1})

How well did you know this?

Not at all

Perfectly

Language Identification Problem

Identify the language that a text (corpus) is written in.

One approach is to build a 3-gram model and find the most probable language L^{} such that
L^{} = \arg \underset{L}{\max}\left{ P(L \mid c_1,…,c_n)\right}

How well did you know this?

Not at all

Perfectly

Spelling Correction

Identify and fix errors in word spelling. Trigrams are often used for this.

How well did you know this?

Not at all

Perfectly

Genre Classification

Identify category of text (e.g., news, novel, blogs, etc.). Trigrams are often used.

How well did you know this?

Not at all

Perfectly

Smoothing

The process of assigning a non-zero probability to sequences that don’t appear in the training corpus.

How well did you know this?

Not at all

Perfectly

Linear Interpolation Smoothing

Combining unigrams, bigrams, and trigrams using linear interpolation:

\hat{P}(c_i \mid c_{i-2}, c_{i-1}) = \lambda_1 P(c_i) + \lambda_2 P(c_i \mid c_{i-1}) + \lambda_3 P(c_i \mid c_{i-2}, c_{i-1})

How well did you know this?

Not at all

Perfectly

Word

Sequence of characters

How well did you know this?

Not at all

Perfectly

Vocabulary

Set of valid words in a language.

How well did you know this?

Not at all

Perfectly

Out-of-vocabulary Words

Words with characters of the language that aren’t valid.

How well did you know this?

Not at all

Perfectly

N-gram Word Models

They model sequences of words rather than characters.

As n increases, n-gram models tend to predict better, but the training overhead becomes more significant.

How well did you know this?

Not at all

Perfectly

Bayesian Approach

Using spam as an example, a new msg can be classified using Bayes rule:

\underset{c \in \left{ spam, \text{¬spam} \right} }{\arg \max} P(c \mid msg) = \underset{c \in \left{ spam, \text{¬spam} \right} }{\arg \max} P(msg \mid c) P(c)

Using priors,
P(spam) = \frac{\left\lvert spam \right\rvert}{\left\lvert spam \right\rvert + \left\lvert \text{¬spam} \right\rvert}

How well did you know this?

Not at all

Perfectly

Term-document Matrix

A 2-D matrix with frequencies of words in a collection of documents. Rows correspond to documents, and columns correspond to terms.

How well did you know this?

Not at all

Perfectly

Topic Analysis

Identify topics in text.

How well did you know this?

Not at all

Perfectly

Sentiment Analysis

Identify the polarity in text (i.e. positive, negative, or neutral). Can also classify based on emotion (i.e. anger, disgust joy).

How well did you know this?

Not at all

Perfectly

Information Extraction

The process of acquiring knowledge by skimming tests and looking for special types of objects as well as relationships between them.

The pipeline is as follows:
1. Tokenisation
2. Complex words
3. Basic groups
4. Complex phrases
5. Structure merging

How well did you know this?

Not at all

Perfectly

Tokenisation

Segment the text into tokens.

How well did you know this?

Not at all

Perfectly

Complex Words

Recognise complex words such as collocations (i.e. “set up”, “joint venture”) or names (“Great Britain”, “Harry Potter”) appearing as combinations of lexical entries.

For example, company names can be recognised with the regex: Capitalised Word + (“Co”|”Inc”|”Ltd”)

How well did you know this?

Not at all

Perfectly

Basic Groups

Chunk the text into units as following:

NG = Noun phrase or group

VG = Verb phrase or group

PR = Preposition

How well did you know this?

Not at all

Perfectly

Complex Phrases

Study These Flashcards

Define rules for complex relationships. For example, the following rule describes the creation of a joint venture:

Company + SetUpJointVenture((Company)*)

Structure Merging

Study These Flashcards

Merges terms referring to the same entity.

Regular Expression (Regex)

Study These Flashcards

Defines a search pattern, regex R is associated with a set of L(R) of strings.

Definitions:
^ Matches the starting position within the string. In line-based tools, it matches the starting position of any line.
. Matches any single character (many applications exclude newlines, and exactly which characters are considered newlines is flavor-, character-encoding-, and platform-specific, but it is safe to assume that the line feed character is included). Within POSIX bracket expressions, the dot character matches a literal dot. For example, a.c matches “abc”, etc., but [a.c] matches only “a”, “.”, or “c”.
[ ] A bracket expression. Matches a single character that is contained within the brackets. For example, [abc] matches “a”, “b”, or “c”. [a-z] specifies a range which matches any lowercase letter from “a” to “z”. These forms can be mixed: [abcx-z] matches “a”, “b”, “c”, “x”, “y”, or “z”, as does [a-cx-z].

The - character is treated as a literal character if it is the last or the first (after the ^, if present) character within the brackets: [abc-], [-abc]. Backslash escapes are not allowed. The ] character can be included in a bracket expression if it is the first (after the ^) character: []abc].
[^ ] Matches a single character that is not contained within the brackets. For example, [^abc] matches any character other than “a”, “b”, or “c”. [^a-z] matches any single character that is not a lowercase letter from “a” to “z”. Likewise, literal characters and ranges can be mixed.
$ Matches the ending position of the string or the position just before a string-ending newline. In line-based tools, it matches the ending position of any line.
( ) Defines a marked subexpression. The string matched within the parentheses can be recalled later (see the next entry, \n). A marked subexpression is also called a block or capturing group. BRE mode requires ( ).
\n Matches what the nth marked subexpression matched, where n is a digit from 1 to 9. This construct is defined in the POSIX standard.[35] Some tools allow referencing more than nine capturing groups. Also known as a back-reference, this feature is supported in BRE mode.
* Matches the preceding element zero or more times. For example, abc matches “ac”, “abc”, “abbbc”, etc. [xyz] matches “”, “x”, “y”, “z”, “zx”, “zyx”, “xyzzy”, and so on. (ab)* matches “”, “ab”, “abab”, “ababab”, and so on.
{m,n} Matches the preceding element at least m and not more than n times. For example, a{3,5} matches only “aaa”, “aaaa”, and “aaaaa”. This is not found in a few older instances of regexes. BRE mode requires {m,n}.

Examples:
.at
matches any three-character string ending with “at”, including “hat”, “cat”, “bat”, “4at”, “#at” and “ at” (starting with a space).

[hc]at
matches “hat” and “cat”.

[^b]at matches all strings matched by .at except “bat”.

[^hc]at
matches all strings matched by .at other than “hat” and “cat”.

^[hc]at
matches “hat” and “cat”, but only at the beginning of the string or line.

[hc]at$
matches “hat” and “cat”, but only at the end of the string or line.

[.] matches any single character surrounded by “[” and “]” since the brackets are escaped, for example: “[a]”, “[b]”, “[7]”, “[@]”, “[]]”, and “[ ]” (bracket space bracket).

s.* matches s followed by zero or more characters, for example: “s”, “saw”, “seed”, “s3w96.7”, and “s6#h%(»>m n mQ”.

Dimensionality

Study These Flashcards

This is a major challenge in probabilistic modelling. As the number of dimensions grows, building models get tougher.

Factorisation

This method divides a probability distribution model into smaller components, allowing the computation of simpler models of multivariate data.

Factorisation using Independence

Joint probability distribution of n independent variables: P(x_1,x_2,...x_n) = \prod_{j=1}^n P(x_j)

Factorisation into a Sequence of Conditional Distributions

Joint probability distribution of n potentially dependent variables: P(x_1,...,x_n) = P(x_1) \prod_{j=2}^n P(x_j \mid x_1, x_2,...,x_{j-1})

First-Order Markov Model

P(x_j \mid x_{j-1},...,x_1) = P(x_j \mid x_{j-1})

Markov Assumption

When predicting the future, the past doesn't matter, only the present. All information relevant to x_j is contained in the immediately preceding variable x_{j-1}

k-th Order Markov Models

x_i depends on the k previous measurements. P(x_1,...,x_n) = P(x_1) \prod_{j=1}^n P(x_j \mid x_{j-1},...,x_{j-k})

Hidden Markov Model

This is a generalisation of Markov models including hidden variables. y_1,...,y_n: observations in a sequences, e.g. words. x_1,...,x_n: hidden states in a sequence, e.g. the tags. Observation y_j only depends on state x_j. The hidden state x_j depends on the previous state x_{j-1} and may also depend on the previous word y_{j-1}. P(y_1,...,y_n, x_1,...,x_n) = P(x_1)P(y_1 \mid x_1) \prod_{j=2}^n P(y_j \mid x_j) P(x_j \mid x_{j-1})

Text Representation

The task of finding texts relevant to a user's query. May involve finding a result set of documents matching the user's query after searching a corpus of documents.

Term-Document Matrix

Each document is represented by a vector \boldsymbol{u} = (u_1,u_2,...,u_n), where u_j is the weight of term t_j in document u u_j can be boolean or a real-valued number associate with the term frequency.

Cosine Distance

d(\boldsymbol{u}, boldsymbol{v}) = \frac{\sum_{i=1}^n u_i v_i}{\sqrt{\sum_{i=1}^n u_i^2}\sqrt{\sum_{i=1}^n v_i^2}}. The smaller the angle, the higher the similarity measure.

Semantic Context

The meaning of the texts with respect to their context. Preserving semantic meaning is a big problem in NLP.

Polysemy

When a single word can have different meaning.

Synonym

When different words have the same meaning.

Information Retrieval

In the context of NLP, similarity is preferred of precision (equality or inequality). Methods tend to be interactive with user feedback about the relevance of results.

Retrieval by Content

Find k objects that are most similar to an object query. Queries can be represented as term vectors.

Case Folding

Convert terms to lowercase.

Stemming

Reducing all words to their stem.

Word Embeddings

Use similar representations for words with similar meanings.

Metadata Parsing

Exploit data outside the documents.

Latent Semantic Indexing (LSI)

LSI assumes the existence of a hidden semantic structure in documents. It uses PCA on the m \times n document-term matrix to compute a m \times k matrix of lower dimension. It computes the k principal component directions in the m \times n space, finding terms that appear frequently together and forming them into a single principal component.

PageRank Algorithm

This is Google's algorithm for web search. It finds pages relevant to a query, but also ranks them. Pages with many in-links are ranked higher. Links for high-ranked pages are weighted more heavily. Initialise each page p with R(p) = 1. Iteratively update the page ranks as follows until convergence: R(p) = \frac{1-d}{n} + d \sum_{i=1}^l \frac{R(in_i)}{C(in_i)} Note that d is a dampening factor.

Precision

The proportion of documents in the result set that are actually relevant. True Positive / (True Positive + False Positive) Note that false Positives are the retrieved documents that aren't relevant.

Recall

The proportion of relevant documents appearing in the result set. True Positive / (True Positive + False Negative)

Precision-recall Curve

Plotting precision against recall. It's usually a negatively sloped curve.

F1 Score

\frac{2 \cdot precision \cdot recall}{precision+recall}

TF-IDF Score

TF-IDF highlights rare query terms. TF-IDF = \sum_{j=1}^k TF(t_j) \cdot IDF(t_j) t_j is the number of time a term t_j appears in a document.

IDF

IDF(t_j) = \log \left( \frac{1+n}{1+n_j} + 1 \right)

Week 7: Text Mining Flashcards

(51 cards)