Week 8: Clustering and Text Mining Flashcards by Annie Clarnette

RevERSED

Stemming (running -> run)
Lemmatisation (were -> is)
lower casing
Stop word removal
Punctuation removal
Number removal
Spell correction
Tokenisation

What are some typical steps in text pre-processing? (8)

How well did you know this?

Not at all

Perfectly

RevERSED

the task of finding the document d from the D documents in some collection that best matches a query q

What is information retrieval?

How well did you know this?

Not at all

Perfectly

RevERSED

Short, dense vectors that can be used to represent words?

What are embeddings?

How well did you know this?

Not at all

Perfectly

RevERSED

it is a token learner
starts with vocabulary of all characters
chooses the two symbols that are most frequently adjacent, adds a new merged symbol to the vocabulary and replaces every adjacent pair in the corpus with the new merged symbol
continues to count and merge, creating new longer and longer character strings, until k merges have been done creating k novel tokens

How does the byte-pair encoding algorithm work for tokenisation?

How well did you know this?

Not at all

Perfectly

RevERSED

it = itoken(wordvector) #create index tokens
full\_vocab = create\_vocabulary(it) #create full vocabulary

How do you create a vocabulary of words in R?

How well did you know this?

Not at all

Perfectly

RevERSED

train a classifier such that a given tuple (w,c) of a target word w paired with a candidate/context word c, it will return the probability that c is a real context word P(+|w,c)

What is the classifier to train for skip-grams?

How well did you know this?

Not at all

Perfectly

RevERSED

DocumentTermMatrix(corpus)

Code to create a document term matrix in R?

How well did you know this?

Not at all

Perfectly

RevERSED

tidy()

Code to turn a document term matrix into a dataframe in R?

How well did you know this?

Not at all

Perfectly

RevERSED

Similar rows mean that the words are similar because they occur in similar documents

When will two row vectors be similar in a term-document matrix?

How well did you know this?

Not at all

Perfectly

RevERSED

Sentiment = total positive words - total negative words

What is the overall sentiment in sentiment analysis?

How well did you know this?

Not at all

Perfectly

RevERSED

Each row is a document, each column is a word

What is a document-term matrix (DTM)?

How well did you know this?

Not at all

Perfectly

RevERSED

Hidden groups within the data that are not recorded

What is gaussian mixture modelling trying to find?

How well did you know this?

Not at all

Perfectly

RevERSED

latent class analysis
latent profile analysis
types of model based clustering

What do LCA and LPA stand for? what are they types of?

How well did you know this?

Not at all

Perfectly

RevERSED

Corpus(textsource)

Code to create a corpus in R?

How well did you know this?

Not at all

Perfectly

RevERSED

It separates out clitics (doesn’t becomes does n’t), keeps hyphenated words together, separates out all punctuatio

What does Penn Treebank tokenisation do?

How well did you know this?

Not at all

Perfectly

RevERSED

tokenising (segmenting) words
normalising word formats
segmenting sentences

What are 3 types of text normalisation?

How well did you know this?

Not at all

Perfectly

RevERSED

create_tcm(it, vectorizer, skip_grams_window = 5)

How do you create a token co-occurrence matrix in R?

How well did you know this?

Not at all

Perfectly

RevERSED

Use capture group to store the expression in memory
the (.*)er they were, the \1er they will be

How do you get part of a string and reference back to that part in an RE?

How well did you know this?

Not at all

Perfectly

RevERSED

adjust initial embeddings to maximise the similarity (dot product) of the (w, cpos) pairs drawn from the positive examples and minimise the similarity (dot product) of the (w, cneg) pairs from the negative examples

How does the skip-gram algorithm adjust during training?

How well did you know this?

Not at all

Perfectly

RevERSED

words that occur in similar contexts tend to have similar meanings

What is the distributional hypothesis?

How well did you know this?

Not at all

Perfectly

RevERSED

clustCombi(data=x)

What is the code for merging components of clusters in R?

How well did you know this?

Not at all

Perfectly

RevERSED

two matrices W and C each containing an embedding for every one of the |V| words in the vocabulary V

What are all the parameters learned in skip-gram?

How well did you know this?

Not at all

Perfectly

RevERSED

\n is newline
\t is tab

What are \n and \t in RE?

How well did you know this?

Not at all

Perfectly

RevERSED

AFINN, NRC, bing

What are some popular lexicons for sentiment analysis? (3)

How well did you know this?

Not at all

Perfectly

# RevERSED complete morphological parsing of the word ie. takes cats and parses it into the two morphemes cat and s

What is the most sophisticated way of lemmatisation?

# RevERSED all prior class proportions 1/K, EII model, all posteriors are either 0 or 1

What type of Gaussian model mixture is k-means?

# RevERSED ``` tf = term frequency idf = inverse document frequency ```

What are the terms of tf-idf

# RevERSED ``` p(x) = pi(1,x) Normal(mean1, var1) + (1-pi(1,x)) Normal(mean2, var2) where pi(1,x) is the probability that variable x takes on value 1 (e.g. probability that person is man) ie proportion of values expected in each cluster ```

What is the gaussian mixture model formula?

# RevERSED the task of putting words/tokens in a standard format

What is word normalisation?

# RevERSED represent a word as a point in multidimensional semantic space that is derived from the distributions of word neighbours

What are vector semantics for words?

# RevERSED an algebraic notion for characterising a set of strings

What is a regular expression?

# RevERSED P(+|w,c) = c.w = 1/1+exp(-c.w) where . is the dot product

What is the probability that c is a context word P(+|w,c)?

# RevERSED the frequency that a word appears is inversely proportional to its rank

What is Zipf's law?

# RevERSED segmenting a text into sentences

What is sentence segmentation?

# RevERSED same as data$variable

what does data %\>% pull(variable) do in R?

# RevERSED The data within each cluster is normally distributed

What is the assumption of gaussian mixture modelling?

# RevERSED n occurrences of the previous char or expression from n to m occurrences of the previous char or expression

What do {n} and {m,n} mean in RE?

# RevERSED AIC: Akaike information criterion - same as BIC but penalty is m AIC3: same as AIC but penalty is 3m/2 ICL: Integrated information criterion - same as BIC but reconstruction loss includes the assigned clusters

What are 3 alternatives to the BIC?

# RevERSED ``` LCA = binomial mixture model LPA = gaussian mixture model ```

What are other names for LCA and LPA

# RevERSED cosine(v,w) = (v.w)/(|v||w|) (where . is dot product)

What is the formula for the cosine similarity measure?

# RevERSED A clitic is a part of a word that can’t stand on its own, can only occur attached to another word. E.g. we’re

What is a clitic?

# RevERSED Corpus = a collection of documents (our whole dataset) Lexicon: set of all unique words in a corpus

What is a corpus and a lexicon?

# RevERSED v.w = v1w1 + v2w2 + … + vNwN

How do you get the dot product of 2 vectors?

# RevERSED Mclust(data, G=2, modelNames = “E”)

Code for implementing mclust in R?

# RevERSED library(tidytext) unnest\_tokens(data, outputcolumn, inputcolumn) #takes one term per row and automatically removes punctuation

How can you tokenise text in R?

# RevERSED Columns and rows both represent words - each cell records the number of times the row (target) word and the column (context) word co-occur in some context in some training corpus - the context could be a document or smaller such as a window around the word

What is a term-term matrix or word co-occurrance matrix?

# RevERSED bind\_tf\_idf(tidytext dataset, tokens, documents, counts) #takes a tidytext dataset as input with one row per token per document

How do you get the tf-idf in R?

# RevERSED #replace matching expression with replacement sub(“expression”, “replacement” data) (only first match of each index) gsub(“expression”, “replacement” data) (all matches)

What do sub and gsub do?

# RevERSED LCA lets the variables follow any distribution, as long as they are unrelated to each other (independent) within classes.

How does latent class analysis work?

# RevERSED - Increasing precision: minimising false positives (strings that were incorrectly matched) - Increasing recall: minimising false negatives (strings that were incorrectly missed)

What does reducing the error rate of an RE involve?

# RevERSED \b matches word boundary \B matches non-word boundary

What are \b and \B in RE?

# RevERSED There are often deviations at high ranks, as a corpus often contains fewer rare words than predicted by a single power law

Where does Zipf's law often deviate?

# RevERSED Tokenisation is the task of segmenting running text into words

What is tokenisation?

# RevERSED \* zero or more occurrences of the immediately previous character or regular expression + one or more occurrences of the immediately previous character or regular expression

What is the kleene \* and knleene + in RE?

# RevERSED grep(“expression”, data) #returns indices of all matching expressions from data grep(“expression”, data, value = TRUE) #returns all the actual matching expressions from data. gives whole string, not just matching part length(grep(“expression”, data)) #returns number of matches of expression in data

How do you use grep to match and count matches from data? (3 ways)

# RevERSED - each row represents a word in the vocabulary - each column represents a document - each cell represents the number of times a particular word occurs in a particular document

What is a term-document matrix?

# RevERSED "-" e.g. [2-5]

How do you specify a range in RE?

# RevERSED 1. Get some text 2. Organise text into corpus 3. Pre-process 4. Create representation 5. Perform analysis as usual

What is the basic workflow for text analysis? (5 steps)

# RevERSED - A token learner takes a raw training corpus and induces a vocabulary, a set of tokens - A token segmenter takes a raw test sentence and segments it into the tokens in the vocabulary

What is a token learner and a token segmenter?

# RevERSED 1. Treat the target word and a neighbouring context word as positive examples 2. Randomly sample other words in the lexicon to get negative samples 3. Use logistic regression to train a classifier to distinguish those two cases 4. Use the learned weights as the embeddings

What is the intuition of skip-gram/ word2vec?

# RevERSED anything within the brackets can be included e.g. colo[ou]r means colour or color

What does [] mean in RE?

# RevERSED r = regexpr(“expression”, data) #gives index of each match and length of each match (only for first match of each index) r = gregexpr(“expression”, data) #gives index of each match and length of each match (all matches) regmatches(data, r) #returns the actual matches obtained from regexpr or gregexpr

How do you use regexpr and gregexpr and regmatches to match data?

# RevERSED - analogies: look at the analogies in vector space el.g. King - man + woman = queen - Bias: semantics derived automatically from language corps contain human-like biases

What are two properties of word embeddings?

# RevERSED 1. Parenthesis () 2. Counters \* + ? {} 3. Sequences and anchors ^ $ 4. Disjunction |

What is the precedence hierarchy in RE?

# RevERSED pi(k,X): number of classes -1 : K-1 mu(k): K\*p (p is number of features) var: K\*p (or just p when variances are equal over classes) covariances: K\*p\*(p-1)/2 (or p\*(p-1)/2 when covariances are equal over classes) (or 0 when variables are uncorrelated, spherical clusters) m = (K-1) + Kp + Kp + Kp(p-1)/2

What are the number of parameters in a multivariate Gaussian mixture model:

# RevERSED str\_view(“string", “expression”) #only matches the first expression str\_detect(“string”, “expression”) #returns true/false depending on whether the string matches the expression str\_extract(“string”, “expression”) #extracts the first match str\_extract\_all(“string”, “expression”) #extracts all matches into a vector str\_match\_all(“string”, “expression”) #similar to str\_extract\_all except output is matrix with column for each

What are the stringr options for matching strings?

# RevERSED When the documents contain similar words

When will two column vectors be similar in a term-document matrix?

# RevERSED pipe |

How can you say or in RE?

# RevERSED Raw frequency is not the best measure of association between words because words like “the” and “good” occur frequently and aren’t informative. tf-idf gives weight to words that appear in fewer documents

What is the purpose of tf-idf?

# RevERSED task of classifying the polarity of a given text (ie is it a good/bad review)

What is sentiment analysis?

# RevERSED \\

How do you do an escape in R?

# RevERSED all context words are independent

What assumption does skip-gram make?

# RevERSED Self-supervised learning. Don't need labels

What is an advantage of word2vec/skip grams?

# RevERSED matches the end of a line

What is $ in RE?

# RevERSED a kind of normalisation that is mapping everything to lowercase

What is case folding?

# RevERSED - Inflectional stemming: remove plurals, normalise verb tenses, remove other affixes - Stemming to root: reduce word to most basic element

What is inflectional stemming and stemming to root?

# RevERSED anti\_join(data, stop\_words) #removes stop words on an unnest\_tokens object

How can you remove stop words in R?

# RevERSED How well the model fits the data

What does the likelihood tell us?

# RevERSED start with the usual Gaussian mixture solution, merge similar components to create non-Gaussian clusters

How can you identify clusters that are not ellipses using the GMM?

# RevERSED mclust fits all the models with up to specified number of clusters, computes the BIC of each model and chooses the model with the best BIC

How does mclust select a model in R?

# RevERSED \d is any digit \D is any non-digit

What are \d and \D in RE?

# RevERSED naive version of morphological analysis that consists of chopping of word-final affixes

What is stemming?

# RevERSED the task of determining that two words have the same root, despite their surface differences

What is lemmatisation?

# RevERSED vectorizer = vocab\_vectorizer(small\_vocab)

How do you map words to indices in R? ie create a vectorizer

# RevERSED term frequency: the frequency of the word t in document d / the total number of terms in the document tf(t,d)= count(t,d)/total number of terms in the document inverse document frequency: number of documents / number of documents the term occurs in idf = log10(N/df(t)) tf-idf = tf\*idf

How do you calculate tf, idf and tf-idf?

# RevERSED GloVe, fastText

What are other types of embeddings in R, after word2vec?

# RevERSED Sparse because most are 0

What type of vector representations come from a term-term matrix?

# RevERSED p(data|paramaters) = p(y|theta)

What is the likelihood, defined by the statistical model and the assumption?

# RevERSED - Not great with longer tasks - negation - context-dependency - You need partly labelled data

What are the problems with sentiment analysis? (4)

# RevERSED the preceding character or nothing e.g. colou?r

What is ? in RE?

# RevERSED tradeoff between complexity and file size the lower the better

What is the aim of the BIC?

# RevERSED E for equal, V for variable, I for identity matrix - Volume (size of clusters in data space) - Shape (circle or ellipse) - Orientation (the angle of the ellipse) E.g. VVE model has variable volume, variable shape, equal orientation

What are the identifiers for each parameterisation of a GMM and what to they measure?

# RevERSED - Bag-of-words: word count or word proportion for each word - Time-series: label each token and put in order - Tf-idf, embeddings

What are some forms of text representation? (4)

# RevERSED The weighted sum of two normal curves

What is the overall probability curve made up of?

# RevERSED Morphology is how words are build up from stems, the central morpheme of the word and affixes

What is morphology?

# RevERSED sin2(matrix1, matrix2, method=“cosine”, norm=“l2”)

How do you calculate the cosine similarity in R?

# RevERSED The cosine value ranges from 1 for vectors pointing in the same direction, through 0 for orthogonal vectors, to -1 for opposite vectors

What does the value of the cosine metric represent?

# RevERSED pi(man,x) = point on male curve/point on total curve

How is the posterior probability calculated based on the current estimates of mean and sd?

# RevERSED A wildcard expression that matches any single character

What is the period . in RE?

# RevERSED 0. Guess the parameters 1. Work out the posterior probability of being M/F assuming normality (E step) 2. Update the parameters (M step) \* repeat steps 1 and 2 until parameters stop changing

What are the steps of the expectation maximisation (EM) algorithm?

# RevERSED Caret ^ matches the start of a line or negates the contents of [] if it is [^...]

What are the 2 things that ^ represents in RE

# RevERSED Mean becomes a vector of 2 means Standard deviation becomes a 2x2 variance covariance matrix determining the shape of the cluster

In multivariate model based clustering with 2 observed features, what do the mean and standard deviation become?

# RevERSED fit\_mclust$paramaters #gives means, variances, proportions, modelName fit\_mclust$bic #gives the bic for each option and shows which models have the lowest. These are negative in the mclust package, take absolute value fit\_mclust$loglik #gives log likelihood used to calculate bic manually fit\_mclust$classification #gives the cluster classification vector fit\_mclust$uncertainty #gives uncertainty of each point plot(fit\_mclust, “density”) #gives density plot

When you have fitted an Mclust model to fit\_mclust in R, what information can you obtain from the result?

# RevERSED BIC = -2.log(l) + m.log(n) ``` l = likelihood = p(data|theta) -2.log(l) = deviance = reconstruction loss = fit m = number of parameters n = number of observations/examples m.log(n) = file size = complexity ```

What is the basis of information criteria (BIC) formula?

Week 8: Clustering and Text Mining Flashcards

(104 cards)