Week 7: Text Mining Flashcards

1
Q

Language Models

A

Key components include: characters, strings, language (set of strings), corpus (collection of one or more texts)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

n-gram

A

A contiguous sequence of n items, e.g. characters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

n-gram Character Model

A

A probability distribution of n-character sequences using a Markov chain of order n-1.

P(c_1,…,c_n) = \prod_{i=1}^n P(c_i \mid c_{i-2}, c_{i-1})

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Language Identification Problem

A

Identify the language that a text (corpus) is written in.

One approach is to build a 3-gram model and find the most probable language L^{} such that
L^{
} = \arg \underset{L}{\max}\left{ P(L \mid c_1,…,c_n)\right}

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Spelling Correction

A

Identify and fix errors in word spelling. Trigrams are often used for this.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Genre Classification

A

Identify category of text (e.g., news, novel, blogs, etc.). Trigrams are often used.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Smoothing

A

The process of assigning a non-zero probability to sequences that don’t appear in the training corpus.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Linear Interpolation Smoothing

A

Combining unigrams, bigrams, and trigrams using linear interpolation:

\hat{P}(c_i \mid c_{i-2}, c_{i-1}) = \lambda_1 P(c_i) + \lambda_2 P(c_i \mid c_{i-1}) + \lambda_3 P(c_i \mid c_{i-2}, c_{i-1})

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Word

A

Sequence of characters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Vocabulary

A

Set of valid words in a language.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Out-of-vocabulary Words

A

Words with characters of the language that aren’t valid.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

N-gram Word Models

A

They model sequences of words rather than characters.

As n increases, n-gram models tend to predict better, but the training overhead becomes more significant.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Bayesian Approach

A

Using spam as an example, a new msg can be classified using Bayes rule:

\underset{c \in \left{ spam, \text{¬spam} \right} }{\arg \max} P(c \mid msg) = \underset{c \in \left{ spam, \text{¬spam} \right} }{\arg \max} P(msg \mid c) P(c)

Using priors,
P(spam) = \frac{\left\lvert spam \right\rvert}{\left\lvert spam \right\rvert + \left\lvert \text{¬spam} \right\rvert}

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Term-document Matrix

A

A 2-D matrix with frequencies of words in a collection of documents. Rows correspond to documents, and columns correspond to terms.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Topic Analysis

A

Identify topics in text.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Sentiment Analysis

A

Identify the polarity in text (i.e. positive, negative, or neutral). Can also classify based on emotion (i.e. anger, disgust joy).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Information Extraction

A

The process of acquiring knowledge by skimming tests and looking for special types of objects as well as relationships between them.

The pipeline is as follows:
1. Tokenisation
2. Complex words
3. Basic groups
4. Complex phrases
5. Structure merging

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Tokenisation

A

Segment the text into tokens.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Complex Words

A

Recognise complex words such as collocations (i.e. “set up”, “joint venture”) or names (“Great Britain”, “Harry Potter”) appearing as combinations of lexical entries.

For example, company names can be recognised with the regex: Capitalised Word + (“Co”|”Inc”|”Ltd”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Basic Groups

A

Chunk the text into units as following:

NG = Noun phrase or group

VG = Verb phrase or group

PR = Preposition

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Complex Phrases

A

Define rules for complex relationships. For example, the following rule describes the creation of a joint venture:

Company + SetUpJointVenture((Company)*)

22
Q

Structure Merging

A

Merges terms referring to the same entity.

23
Q

Regular Expression (Regex)

A

Defines a search pattern, regex R is associated with a set of L(R) of strings.

Definitions:
^ Matches the starting position within the string. In line-based tools, it matches the starting position of any line.
. Matches any single character (many applications exclude newlines, and exactly which characters are considered newlines is flavor-, character-encoding-, and platform-specific, but it is safe to assume that the line feed character is included). Within POSIX bracket expressions, the dot character matches a literal dot. For example, a.c matches “abc”, etc., but [a.c] matches only “a”, “.”, or “c”.
[ ] A bracket expression. Matches a single character that is contained within the brackets. For example, [abc] matches “a”, “b”, or “c”. [a-z] specifies a range which matches any lowercase letter from “a” to “z”. These forms can be mixed: [abcx-z] matches “a”, “b”, “c”, “x”, “y”, or “z”, as does [a-cx-z].

The - character is treated as a literal character if it is the last or the first (after the ^, if present) character within the brackets: [abc-], [-abc]. Backslash escapes are not allowed. The ] character can be included in a bracket expression if it is the first (after the ^) character: []abc].
[^ ] Matches a single character that is not contained within the brackets. For example, [^abc] matches any character other than “a”, “b”, or “c”. [^a-z] matches any single character that is not a lowercase letter from “a” to “z”. Likewise, literal characters and ranges can be mixed.
$ Matches the ending position of the string or the position just before a string-ending newline. In line-based tools, it matches the ending position of any line.
( ) Defines a marked subexpression. The string matched within the parentheses can be recalled later (see the next entry, \n). A marked subexpression is also called a block or capturing group. BRE mode requires ( ).
\n Matches what the nth marked subexpression matched, where n is a digit from 1 to 9. This construct is defined in the POSIX standard.[35] Some tools allow referencing more than nine capturing groups. Also known as a back-reference, this feature is supported in BRE mode.
* Matches the preceding element zero or more times. For example, abc matches “ac”, “abc”, “abbbc”, etc. [xyz] matches “”, “x”, “y”, “z”, “zx”, “zyx”, “xyzzy”, and so on. (ab)* matches “”, “ab”, “abab”, “ababab”, and so on.
{m,n} Matches the preceding element at least m and not more than n times. For example, a{3,5} matches only “aaa”, “aaaa”, and “aaaaa”. This is not found in a few older instances of regexes. BRE mode requires {m,n}.

Examples:
.at
matches any three-character string ending with “at”, including “hat”, “cat”, “bat”, “4at”, “#at” and “ at” (starting with a space).

[hc]at
matches “hat” and “cat”.

[^b]at matches all strings matched by .at except “bat”.

[^hc]at
matches all strings matched by .at other than “hat” and “cat”.

^[hc]at
matches “hat” and “cat”, but only at the beginning of the string or line.

[hc]at$
matches “hat” and “cat”, but only at the end of the string or line.

[.] matches any single character surrounded by “[” and “]” since the brackets are escaped, for example: “[a]”, “[b]”, “[7]”, “[@]”, “[]]”, and “[ ]” (bracket space bracket).

s.* matches s followed by zero or more characters, for example: “s”, “saw”, “seed”, “s3w96.7”, and “s6#h%(»>m n mQ”.

24
Q

Dimensionality

A

This is a major challenge in probabilistic modelling. As the number of dimensions grows, building models get tougher.

25
Q

Factorisation

A

This method divides a probability distribution model into smaller components, allowing the computation of simpler models of multivariate data.

26
Q

Factorisation using Independence

A

Joint probability distribution of n independent variables:

P(x_1,x_2,…x_n) = \prod_{j=1}^n P(x_j)

27
Q

Factorisation into a Sequence of Conditional Distributions

A

Joint probability distribution of n potentially dependent variables:

P(x_1,…,x_n) = P(x_1) \prod_{j=2}^n P(x_j \mid x_1, x_2,…,x_{j-1})

28
Q

First-Order Markov Model

A

P(x_j \mid x_{j-1},…,x_1) = P(x_j \mid x_{j-1})

29
Q

Markov Assumption

A

When predicting the future, the past doesn’t matter, only the present. All information relevant to x_j is contained in the immediately preceding variable x_{j-1}

30
Q

k-th Order Markov Models

A

x_i depends on the k previous measurements.

P(x_1,…,x_n) = P(x_1) \prod_{j=1}^n P(x_j \mid x_{j-1},…,x_{j-k})

31
Q

Hidden Markov Model

A

This is a generalisation of Markov models including hidden variables. y_1,…,y_n: observations in a sequences, e.g. words. x_1,…,x_n: hidden states in a sequence, e.g. the tags.
Observation y_j only depends on state x_j. The hidden state x_j depends on the previous state x_{j-1} and may also depend on the previous word y_{j-1}.

P(y_1,…,y_n, x_1,…,x_n) = P(x_1)P(y_1 \mid x_1) \prod_{j=2}^n P(y_j \mid x_j) P(x_j \mid x_{j-1})

32
Q

Text Representation

A

The task of finding texts relevant to a user’s query. May involve finding a result set of documents matching the user’s query after searching a corpus of documents.

33
Q

Term-Document Matrix

A

Each document is represented by a vector \boldsymbol{u} = (u_1,u_2,…,u_n), where u_j is the weight of term t_j in document u

u_j can be boolean or a real-valued number associate with the term frequency.

34
Q

Cosine Distance

A

d(\boldsymbol{u}, boldsymbol{v}) = \frac{\sum_{i=1}^n u_i v_i}{\sqrt{\sum_{i=1}^n u_i^2}\sqrt{\sum_{i=1}^n v_i^2}}. The smaller the angle, the higher the similarity measure.

35
Q

Semantic Context

A

The meaning of the texts with respect to their context. Preserving semantic meaning is a big problem in NLP.

36
Q

Polysemy

A

When a single word can have different meaning.

37
Q

Synonym

A

When different words have the same meaning.

38
Q

Information Retrieval

A

In the context of NLP, similarity is preferred of precision (equality or inequality). Methods tend to be interactive with user feedback about the relevance of results.

39
Q

Retrieval by Content

A

Find k objects that are most similar to an object query. Queries can be represented as term vectors.

40
Q

Case Folding

A

Convert terms to lowercase.

41
Q

Stemming

A

Reducing all words to their stem.

42
Q

Word Embeddings

A

Use similar representations for words with similar meanings.

43
Q

Metadata Parsing

A

Exploit data outside the documents.

44
Q

Latent Semantic Indexing (LSI)

A

LSI assumes the existence of a hidden semantic structure in documents. It uses PCA on the m \times n document-term matrix to compute a m \times k matrix of lower dimension. It computes the k principal component directions in the m \times n space, finding terms that appear frequently together and forming them into a single principal component.

45
Q

PageRank Algorithm

A

This is Google’s algorithm for web search. It finds pages relevant to a query, but also ranks them. Pages with many in-links are ranked higher. Links for high-ranked pages are weighted more heavily.

Initialise each page p with R(p) = 1. Iteratively update the page ranks as follows until convergence:

R(p) = \frac{1-d}{n} + d \sum_{i=1}^l \frac{R(in_i)}{C(in_i)}

Note that d is a dampening factor.

46
Q

Precision

A

The proportion of documents in the result set that are actually relevant.

True Positive / (True Positive + False Positive)

Note that false Positives are the retrieved documents that aren’t relevant.

47
Q

Recall

A

The proportion of relevant documents appearing in the result set.

True Positive / (True Positive + False Negative)

48
Q

Precision-recall Curve

A

Plotting precision against recall. It’s usually a negatively sloped curve.

49
Q

F1 Score

A

\frac{2 \cdot precision \cdot recall}{precision+recall}

50
Q

TF-IDF Score

A

TF-IDF highlights rare query terms.

TF-IDF = \sum_{j=1}^k TF(t_j) \cdot IDF(t_j)

t_j is the number of time a term t_j appears in a document.

51
Q

IDF

A

IDF(t_j) = \log \left( \frac{1+n}{1+n_j} + 1 \right)