Week 8: Clustering and Text Mining Flashcards

Question

What are the terms of tf-idf

Answer 1

``` tf = term frequency idf = inverse document frequency ```

Answer 2

- Bag-of-words: word count or word proportion for each word - Time-series: label each token and put in order - Tf-idf, embeddings

Answer 3

Each row is a document, each column is a word

Answer 4

- tokenising (segmenting) words - normalising word formats - segmenting sentences

Answer 5

words that occur in similar contexts tend to have similar meanings

Answer 6

\* zero or more occurrences of the immediately previous character or regular expression + one or more occurrences of the immediately previous character or regular expression

Answer 7

DocumentTermMatrix(corpus)

Answer 8

A clitic is a part of a word that can’t stand on its own, can only occur attached to another word. E.g. we’re

Answer 9

Hidden groups within the data that are not recorded

Answer 10

the frequency that a word appears is inversely proportional to its rank

Answer 11

the task of finding the document d from the D documents in some collection that best matches a query q

Answer 12

The cosine value ranges from 1 for vectors pointing in the same direction, through 0 for orthogonal vectors, to -1 for opposite vectors

Answer 13

"-" e.g. [2-5]

Answer 14

Mclust(data, G=2, modelNames = “E”)

Answer 15

an algebraic notion for characterising a set of strings

Answer 16

Sparse because most are 0

Answer 17

the task of putting words/tokens in a standard format

Answer 18

sin2(matrix1, matrix2, method=“cosine”, norm=“l2”)

Answer 19

\b matches word boundary \B matches non-word boundary

Answer 20

1. Parenthesis () 2. Counters \* + ? {} 3. Sequences and anchors ^ $ 4. Disjunction |

Answer 21

n occurrences of the previous char or expression from n to m occurrences of the previous char or expression

Answer 22

train a classifier such that a given tuple (w,c) of a target word w paired with a candidate/context word c, it will return the probability that c is a real context word P(+|w,c)

Answer 23

- A token learner takes a raw training corpus and induces a vocabulary, a set of tokens - A token segmenter takes a raw test sentence and segments it into the tokens in the vocabulary

Answer 24

It separates out clitics (doesn’t becomes does n’t), keeps hyphenated words together, separates out all punctuatio

Answer 25

- analogies: look at the analogies in vector space el.g. King - man + woman = queen - Bias: semantics derived automatically from language corps contain human-like biases

Answer 26

Mean becomes a vector of 2 means Standard deviation becomes a 2x2 variance covariance matrix determining the shape of the cluster

Answer 27

p(data|paramaters) = p(y|theta)

Answer 28

fit\_mclust$paramaters #gives means, variances, proportions, modelName fit\_mclust$bic #gives the bic for each option and shows which models have the lowest. These are negative in the mclust package, take absolute value fit\_mclust$loglik #gives log likelihood used to calculate bic manually fit\_mclust$classification #gives the cluster classification vector fit\_mclust$uncertainty #gives uncertainty of each point plot(fit\_mclust, “density”) #gives density plot

Answer 29

the preceding character or nothing e.g. colou?r

Answer 30

- Inflectional stemming: remove plurals, normalise verb tenses, remove other affixes - Stemming to root: reduce word to most basic element

Answer 31

library(tidytext) unnest\_tokens(data, outputcolumn, inputcolumn) #takes one term per row and automatically removes punctuation

Answer 32

Self-supervised learning. Don't need labels

Answer 33

complete morphological parsing of the word ie. takes cats and parses it into the two morphemes cat and s

Answer 34

A wildcard expression that matches any single character

Answer 35

BIC = -2.log(l) + m.log(n) ``` l = likelihood = p(data|theta) -2.log(l) = deviance = reconstruction loss = fit m = number of parameters n = number of observations/examples m.log(n) = file size = complexity ```

Answer 36

str\_view(“string", “expression”) #only matches the first expression str\_detect(“string”, “expression”) #returns true/false depending on whether the string matches the expression str\_extract(“string”, “expression”) #extracts the first match str\_extract\_all(“string”, “expression”) #extracts all matches into a vector str\_match\_all(“string”, “expression”) #similar to str\_extract\_all except output is matrix with column for each

Answer 37

- Not great with longer tasks - negation - context-dependency - You need partly labelled data

Answer 38

``` latent class analysis latent profile analysis types of model based clustering ```

Answer 39

Morphology is how words are build up from stems, the central morpheme of the word and affixes

Answer 40

Corpus = a collection of documents (our whole dataset) Lexicon: set of all unique words in a corpus

Answer 41

- it is a token learner - starts with vocabulary of all characters - chooses the two symbols that are most frequently adjacent, adds a new merged symbol to the vocabulary and replaces every adjacent pair in the corpus with the new merged symbol - continues to count and merge, creating new longer and longer character strings, until k merges have been done creating k novel tokens

Answer 42

0. Guess the parameters 1. Work out the posterior probability of being M/F assuming normality (E step) 2. Update the parameters (M step) \* repeat steps 1 and 2 until parameters stop changing

Answer 43

\d is any digit \D is any non-digit

Answer 44

P(+|w,c) = c.w = 1/1+exp(-c.w) where . is the dot product

Answer 45

matches the end of a line

Answer 46

Sentiment = total positive words - total negative words

Answer 47

1. Get some text 2. Organise text into corpus 3. Pre-process 4. Create representation 5. Perform analysis as usual

Answer 48

task of classifying the polarity of a given text (ie is it a good/bad review)

Answer 49

When the documents contain similar words

Answer 50

Tokenisation is the task of segmenting running text into words

Answer 51

- Stemming (running -\> run) - Lemmatisation (were -\> is) - lower casing - Stop word removal - Punctuation removal - Number removal - Spell correction - Tokenisation

Answer 52

cosine(v,w) = (v.w)/(|v||w|) (where . is dot product)

Answer 53

anti\_join(data, stop\_words) #removes stop words on an unnest\_tokens object

Answer 54

There are often deviations at high ranks, as a corpus often contains fewer rare words than predicted by a single power law

Answer 55

\n is newline \t is tab

Answer 56

anything within the brackets can be included e.g. colo[ou]r means colour or color

Answer 57

Similar rows mean that the words are similar because they occur in similar documents

Answer 58

mclust fits all the models with up to specified number of clusters, computes the BIC of each model and chooses the model with the best BIC

Answer 59

same as data$variable

Answer 60

LCA lets the variables follow any distribution, as long as they are unrelated to each other (independent) within classes.

Answer 61

v.w = v1w1 + v2w2 + … + vNwN

Answer 62

pi(k,X): number of classes -1 : K-1 mu(k): K\*p (p is number of features) var: K\*p (or just p when variances are equal over classes) covariances: K\*p\*(p-1)/2 (or p\*(p-1)/2 when covariances are equal over classes) (or 0 when variables are uncorrelated, spherical clusters) m = (K-1) + Kp + Kp + Kp(p-1)/2

Answer 63

#replace matching expression with replacement sub(“expression”, “replacement” data) (only first match of each index) gsub(“expression”, “replacement” data) (all matches)

Answer 64

The weighted sum of two normal curves

Answer 65

vectorizer = vocab\_vectorizer(small\_vocab)

Answer 66

segmenting a text into sentences

Answer 67

Columns and rows both represent words - each cell records the number of times the row (target) word and the column (context) word co-occur in some context in some training corpus - the context could be a document or smaller such as a window around the word

Answer 68

two matrices W and C each containing an embedding for every one of the |V| words in the vocabulary V

Answer 69

bind\_tf\_idf(tidytext dataset, tokens, documents, counts) #takes a tidytext dataset as input with one row per token per document

Answer 70

AIC: Akaike information criterion - same as BIC but penalty is m AIC3: same as AIC but penalty is 3m/2 ICL: Integrated information criterion - same as BIC but reconstruction loss includes the assigned clusters

Answer 71

``` p(x) = pi(1,x) Normal(mean1, var1) + (1-pi(1,x)) Normal(mean2, var2) where pi(1,x) is the probability that variable x takes on value 1 (e.g. probability that person is man) ie proportion of values expected in each cluster ```

Answer 72

E for equal, V for variable, I for identity matrix - Volume (size of clusters in data space) - Shape (circle or ellipse) - Orientation (the angle of the ellipse) E.g. VVE model has variable volume, variable shape, equal orientation

Answer 73

- Increasing precision: minimising false positives (strings that were incorrectly matched) - Increasing recall: minimising false negatives (strings that were incorrectly missed)

Answer 74

Use capture group to store the expression in memory the (.\*)er they were, the \1er they will be

Answer 75

start with the usual Gaussian mixture solution, merge similar components to create non-Gaussian clusters

Answer 76

pi(man,x) = point on male curve/point on total curve

Answer 77

tradeoff between complexity and file size the lower the better

Answer 78

clustCombi(data=x)

Week 8: Clustering and Text Mining Flashcards

(104 cards)