Week 8: Clustering and Text Mining Flashcards

1
Q

What are other types of embeddings in R, after word2vec?

A

GloVe, fastText

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What type of Gaussian model mixture is k-means?

A

all prior class proportions 1/K, EII model, all posteriors are either 0 or 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are embeddings?

A

Short, dense vectors that can be used to represent words?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the intuition of skip-gram/ word2vec?

A
  1. Treat the target word and a neighbouring context word as positive examples
  2. Randomly sample other words in the lexicon to get negative samples
  3. Use logistic regression to train a classifier to distinguish those two cases
  4. Use the learned weights as the embeddings
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is case folding?

A

a kind of normalisation that is mapping everything to lowercase

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the purpose of tf-idf?

A

Raw frequency is not the best measure of association between words because words like “the” and “good” occur frequently and aren’t informative. tf-idf gives weight to words that appear in fewer documents

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are some popular lexicons for sentiment analysis? (3)

A

AFINN, NRC, bing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Code to turn a document term matrix into a dataframe in R?

A

tidy()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How do you calculate tf, idf and tf-idf?

A

term frequency: the frequency of the word t in document d / the total number of terms in the document
tf(t,d)= count(t,d)/total number of terms in the document

inverse document frequency: number of documents / number of documents the term occurs in
idf = log10(N/df(t))

tf-idf = tf*idf

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How do you create a vocabulary of words in R?

A
it = itoken(wordvector) #create index tokens
full\_vocab = create\_vocabulary(it) #create full vocabulary
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is stemming?

A

naive version of morphological analysis that consists of chopping of word-final affixes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is a term-document matrix?

A
  • each row represents a word in the vocabulary
  • each column represents a document
  • each cell represents the number of times a particular word occurs in a particular document
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What assumption does skip-gram make?

A

all context words are independent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How do you create a token co-occurrence matrix in R?

A

create_tcm(it, vectorizer, skip_grams_window = 5)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How does the skip-gram algorithm adjust during training?

A

adjust initial embeddings to maximise the similarity (dot product) of the (w, cpos) pairs drawn from the positive examples and minimise the similarity (dot product) of the (w, cneg) pairs from the negative examples

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are other names for LCA and LPA

A
LCA = binomial mixture model 
LPA = gaussian mixture model
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What does the likelihood tell us?

A

How well the model fits the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Code to create a corpus in R?

A

Corpus(textsource)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

How do you use grep to match and count matches from data? (3 ways)

A

grep(“expression”, data) #returns indices of all matching expressions from data
grep(“expression”, data, value = TRUE) #returns all the actual matching expressions from data. gives whole string, not just matching part
length(grep(“expression”, data)) #returns number of matches of expression in data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is the assumption of gaussian mixture modelling?

A

The data within each cluster is normally distributed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What are vector semantics for words?

A

represent a word as a point in multidimensional semantic space that is derived from the distributions of word neighbours

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What are the 2 things that ^ represents in RE

A

Caret ^ matches the start of a line
or negates the contents of [] if it is [^…]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

How do you use regexpr and gregexpr and regmatches to match data?

A

r = regexpr(“expression”, data) #gives index of each match and length of each match (only for first match of each index)
r = gregexpr(“expression”, data) #gives index of each match and length of each match (all matches)
regmatches(data, r) #returns the actual matches obtained from regexpr or gregexpr

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is lemmatisation?

A

the task of determining that two words have the same root, despite their surface differences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What are the terms of tf-idf

A
tf = term frequency 
idf = inverse document frequency
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What are some forms of text representation? (4)

A
  • Bag-of-words: word count or word proportion for each word
  • Time-series: label each token and put in order
  • Tf-idf, embeddings
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is a document-term matrix (DTM)?

A

Each row is a document, each column is a word

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What are 3 types of text normalisation?

A
  • tokenising (segmenting) words
  • normalising word formats
  • segmenting sentences
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What is the distributional hypothesis?

A

words that occur in similar contexts tend to have similar meanings

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What is the kleene * and knleene + in RE?

A

* zero or more occurrences of the immediately previous character or regular expression
+ one or more occurrences of the immediately previous character or regular expression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Code to create a document term matrix in R?

A

DocumentTermMatrix(corpus)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What is a clitic?

A

A clitic is a part of a word that can’t stand on its own, can only occur attached to another word. E.g. we’re

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What is gaussian mixture modelling trying to find?

A

Hidden groups within the data that are not recorded

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What is Zipf’s law?

A

the frequency that a word appears is inversely proportional to its rank

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

What is information retrieval?

A

the task of finding the document d from the D documents in some collection that best matches a query q

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

What does the value of the cosine metric represent?

A

The cosine value ranges from 1 for vectors pointing in the same direction, through 0 for orthogonal vectors, to -1 for opposite vectors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

How do you specify a range in RE?

A

”-“ e.g. [2-5]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

Code for implementing mclust in R?

A

Mclust(data, G=2, modelNames = “E”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

What is a regular expression?

A

an algebraic notion for characterising a set of strings

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

What type of vector representations come from a term-term matrix?

A

Sparse because most are 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

What is word normalisation?

A

the task of putting words/tokens in a standard format

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

How do you calculate the cosine similarity in R?

A

sin2(matrix1, matrix2, method=“cosine”, norm=“l2”)

43
Q

What are \b and \B in RE?

A

\b matches word boundary
\B matches non-word boundary

44
Q

What is the precedence hierarchy in RE?

A
  1. Parenthesis ()
  2. Counters * + ? {}
  3. Sequences and anchors ^ $
  4. Disjunction |
45
Q

What do {n} and {m,n} mean in RE?

A

n occurrences of the previous char or expression
from n to m occurrences of the previous char or expression

46
Q

What is the classifier to train for skip-grams?

A

train a classifier such that a given tuple (w,c) of a target word w paired with a candidate/context word c, it will return the probability that c is a real context word P(+|w,c)

47
Q

What is a token learner and a token segmenter?

A
  • A token learner takes a raw training corpus and induces a vocabulary, a set of tokens
  • A token segmenter takes a raw test sentence and segments it into the tokens in the vocabulary
48
Q

What does Penn Treebank tokenisation do?

A

It separates out clitics (doesn’t becomes does n’t), keeps hyphenated words together, separates out all punctuatio

49
Q

What are two properties of word embeddings?

A
  • analogies: look at the analogies in vector space el.g. King - man + woman = queen
  • Bias: semantics derived automatically from language corps contain human-like biases
50
Q

In multivariate model based clustering with 2 observed features, what do the mean and standard deviation become?

A

Mean becomes a vector of 2 means
Standard deviation becomes a 2x2 variance covariance matrix determining the shape of the cluster

51
Q

What is the likelihood, defined by the statistical model and the assumption?

A

p(data|paramaters) = p(y|theta)

52
Q

When you have fitted an Mclust model to fit_mclust in R, what information can you obtain from the result?

A

fit_mclust$paramaters #gives means, variances, proportions, modelName
fit_mclust$bic #gives the bic for each option and shows which models have the lowest. These are negative in the mclust package, take absolute value
fit_mclust$loglik #gives log likelihood used to calculate bic manually
fit_mclust$classification #gives the cluster classification vector
fit_mclust$uncertainty #gives uncertainty of each point

plot(fit_mclust, “density”) #gives density plot

53
Q

What is ? in RE?

A

the preceding character or nothing
e.g. colou?r

54
Q

What is inflectional stemming and stemming to root?

A
  • Inflectional stemming: remove plurals, normalise verb tenses, remove other affixes
  • Stemming to root: reduce word to most basic element
55
Q

How can you tokenise text in R?

A

library(tidytext)
unnest_tokens(data, outputcolumn, inputcolumn) #takes one term per row and automatically removes punctuation

56
Q

What is an advantage of word2vec/skip grams?

A

Self-supervised learning. Don’t need labels

57
Q

What is the most sophisticated way of lemmatisation?

A

complete morphological parsing of the word
ie. takes cats and parses it into the two morphemes cat and s

58
Q

What is the period . in RE?

A

A wildcard expression that matches any single character

59
Q

What is the basis of information criteria (BIC) formula?

A

BIC = -2.log(l) + m.log(n)

l = likelihood = p(data|theta) 
-2.log(l) = deviance = reconstruction loss = fit 
m = number of parameters 
n = number of observations/examples 
m.log(n) = file size = complexity
60
Q

What are the stringr options for matching strings?

A

str_view(“string”, “expression”) #only matches the first expression
str_detect(“string”, “expression”) #returns true/false depending on whether the string matches the expression
str_extract(“string”, “expression”) #extracts the first match
str_extract_all(“string”, “expression”) #extracts all matches into a vector
str_match_all(“string”, “expression”) #similar to str_extract_all except output is matrix with column for each

61
Q

What are the problems with sentiment analysis? (4)

A
  • Not great with longer tasks
  • negation
  • context-dependency
  • You need partly labelled data
62
Q

What do LCA and LPA stand for? what are they types of?

A
latent class analysis
latent profile analysis
types of model based clustering
63
Q

What is morphology?

A

Morphology is how words are build up from stems, the central morpheme of the word and affixes

64
Q

What is a corpus and a lexicon?

A

Corpus = a collection of documents (our whole dataset)
Lexicon: set of all unique words in a corpus

65
Q

How does the byte-pair encoding algorithm work for tokenisation?

A
  • it is a token learner
  • starts with vocabulary of all characters
  • chooses the two symbols that are most frequently adjacent, adds a new merged symbol to the vocabulary and replaces every adjacent pair in the corpus with the new merged symbol
  • continues to count and merge, creating new longer and longer character strings, until k merges have been done creating k novel tokens
66
Q

What are the steps of the expectation maximisation (EM) algorithm?

A
  1. Guess the parameters
  2. Work out the posterior probability of being M/F assuming normality (E step)
  3. Update the parameters (M step)
    * repeat steps 1 and 2 until parameters stop changing
67
Q

What are \d and \D in RE?

A

\d is any digit
\D is any non-digit

68
Q

What is the probability that c is a context word P(+|w,c)?

A

P(+|w,c) = c.w = 1/1+exp(-c.w)
where . is the dot product

69
Q

What is $ in RE?

A

matches the end of a line

70
Q

What is the overall sentiment in sentiment analysis?

A

Sentiment = total positive words - total negative words

71
Q

What is the basic workflow for text analysis? (5 steps)

A
  1. Get some text
  2. Organise text into corpus
  3. Pre-process
  4. Create representation
  5. Perform analysis as usual
72
Q

What is sentiment analysis?

A

task of classifying the polarity of a given text (ie is it a good/bad review)

73
Q

When will two column vectors be similar in a term-document matrix?

A

When the documents contain similar words

74
Q

What is tokenisation?

A

Tokenisation is the task of segmenting running text into words

75
Q

What are some typical steps in text pre-processing? (8)

A
  • Stemming (running -> run)
  • Lemmatisation (were -> is)
  • lower casing
  • Stop word removal
  • Punctuation removal
  • Number removal
  • Spell correction
  • Tokenisation
76
Q

What is the formula for the cosine similarity measure?

A

cosine(v,w) = (v.w)/(|v||w|) (where . is dot product)

77
Q

How can you remove stop words in R?

A

anti_join(data, stop_words) #removes stop words on an unnest_tokens object

78
Q

Where does Zipf’s law often deviate?

A

There are often deviations at high ranks, as a corpus often contains fewer rare words than predicted by a single power law

79
Q

What are \n and \t in RE?

A

\n is newline
\t is tab

80
Q

What does [] mean in RE?

A

anything within the brackets can be included e.g. colo[ou]r means colour or color

81
Q

When will two row vectors be similar in a term-document matrix?

A

Similar rows mean that the words are similar because they occur in similar documents

82
Q

How does mclust select a model in R?

A

mclust fits all the models with up to specified number of clusters, computes the BIC of each model and chooses the model with the best BIC

83
Q

what does data %>% pull(variable) do in R?

A

same as data$variable

84
Q

How does latent class analysis work?

A

LCA lets the variables follow any distribution, as long as they are unrelated to each other (independent) within classes.

85
Q

How do you get the dot product of 2 vectors?

A

v.w = v1w1 + v2w2 + … + vNwN

86
Q

What are the number of parameters in a multivariate Gaussian mixture model:

A

pi(k,X): number of classes -1 : K-1
mu(k): K*p (p is number of features)
var: K*p (or just p when variances are equal over classes)
covariances: K*p*(p-1)/2 (or p*(p-1)/2 when covariances are equal over classes) (or 0 when variables are uncorrelated, spherical clusters)

m = (K-1) + Kp + Kp + Kp(p-1)/2

87
Q

What do sub and gsub do?

A

replace matching expression with replacement

sub(“expression”, “replacement” data) (only first match of each index)
gsub(“expression”, “replacement” data) (all matches)

88
Q

What is the overall probability curve made up of?

A

The weighted sum of two normal curves

89
Q

How do you map words to indices in R? ie create a vectorizer

A

vectorizer = vocab_vectorizer(small_vocab)

90
Q

What is sentence segmentation?

A

segmenting a text into sentences

91
Q

What is a term-term matrix or word co-occurrance matrix?

A

Columns and rows both represent words

  • each cell records the number of times the row (target) word and the column (context) word co-occur in some context in some training corpus
  • the context could be a document or smaller such as a window around the word
92
Q

What are all the parameters learned in skip-gram?

A

two matrices W and C each containing an embedding for every one of the |V| words in the vocabulary V

93
Q

How do you do an escape in R?

A

\

94
Q

How do you get the tf-idf in R?

A

bind_tf_idf(tidytext dataset, tokens, documents, counts) #takes a tidytext dataset as input with one row per token per document

95
Q

What are 3 alternatives to the BIC?

A

AIC: Akaike information criterion - same as BIC but penalty is m
AIC3: same as AIC but penalty is 3m/2
ICL: Integrated information criterion - same as BIC but reconstruction loss includes the assigned clusters

96
Q

How can you say or in RE?

A

pipe |

97
Q

What is the gaussian mixture model formula?

A
p(x) = pi(1,x) Normal(mean1, var1) + (1-pi(1,x)) Normal(mean2, var2) 
where pi(1,x) is the probability that variable x takes on value 1 (e.g. probability that person is man) 
ie proportion of values expected in each cluster
98
Q

What are the identifiers for each parameterisation of a GMM and what to they measure?

A

E for equal, V for variable, I for identity matrix

  • Volume (size of clusters in data space)
  • Shape (circle or ellipse)
  • Orientation (the angle of the ellipse)

E.g. VVE model has variable volume, variable shape, equal orientation

99
Q

What does reducing the error rate of an RE involve?

A
  • Increasing precision: minimising false positives (strings that were incorrectly matched)
  • Increasing recall: minimising false negatives (strings that were incorrectly missed)
100
Q

How do you get part of a string and reference back to that part in an RE?

A

Use capture group to store the expression in memory
the (.*)er they were, the \1er they will be

101
Q

How can you identify clusters that are not ellipses using the GMM?

A

start with the usual Gaussian mixture solution, merge similar components to create non-Gaussian clusters

102
Q

How is the posterior probability calculated based on the current estimates of mean and sd?

A

pi(man,x) = point on male curve/point on total curve

103
Q

What is the aim of the BIC?

A

tradeoff between complexity and file size
the lower the better

104
Q

What is the code for merging components of clusters in R?

A

clustCombi(data=x)