Week 8: Clustering and Text Mining Flashcards
- Stemming (running -> run)
- Lemmatisation (were -> is)
- lower casing
- Stop word removal
- Punctuation removal
- Number removal
- Spell correction
- Tokenisation
What are some typical steps in text pre-processing? (8)
the task of finding the document d from the D documents in some collection that best matches a query q
What is information retrieval?
Short, dense vectors that can be used to represent words?
What are embeddings?
- it is a token learner
- starts with vocabulary of all characters
- chooses the two symbols that are most frequently adjacent, adds a new merged symbol to the vocabulary and replaces every adjacent pair in the corpus with the new merged symbol
- continues to count and merge, creating new longer and longer character strings, until k merges have been done creating k novel tokens
How does the byte-pair encoding algorithm work for tokenisation?
it = itoken(wordvector) #create index tokens full\_vocab = create\_vocabulary(it) #create full vocabulary
How do you create a vocabulary of words in R?
train a classifier such that a given tuple (w,c) of a target word w paired with a candidate/context word c, it will return the probability that c is a real context word P(+|w,c)
What is the classifier to train for skip-grams?
Code to create a document term matrix in R?
Code to turn a document term matrix into a dataframe in R?
Similar rows mean that the words are similar because they occur in similar documents
When will two row vectors be similar in a term-document matrix?
Sentiment = total positive words - total negative words
What is the overall sentiment in sentiment analysis?
Each row is a document, each column is a word
What is a document-term matrix (DTM)?
Hidden groups within the data that are not recorded
What is gaussian mixture modelling trying to find?
latent class analysis latent profile analysis types of model based clustering
What do LCA and LPA stand for? what are they types of?
Code to create a corpus in R?
It separates out clitics (doesn’t becomes does n’t), keeps hyphenated words together, separates out all punctuatio
What does Penn Treebank tokenisation do?
- tokenising (segmenting) words
- normalising word formats
- segmenting sentences
What are 3 types of text normalisation?
create_tcm(it, vectorizer, skip_grams_window = 5)
How do you create a token co-occurrence matrix in R?
Use capture group to store the expression in memory
the (.*)er they were, the \1er they will be
How do you get part of a string and reference back to that part in an RE?
adjust initial embeddings to maximise the similarity (dot product) of the (w, cpos) pairs drawn from the positive examples and minimise the similarity (dot product) of the (w, cneg) pairs from the negative examples
How does the skip-gram algorithm adjust during training?
words that occur in similar contexts tend to have similar meanings
What is the distributional hypothesis?
What is the code for merging components of clusters in R?
two matrices W and C each containing an embedding for every one of the |V| words in the vocabulary V
What are all the parameters learned in skip-gram?
\n is newline
\t is tab
What are \n and \t in RE?
AFINN, NRC, bing
What are some popular lexicons for sentiment analysis? (3)
complete morphological parsing of the word
ie. takes cats and parses it into the two morphemes cat and s
What is the most sophisticated way of lemmatisation?
all prior class proportions 1/K, EII model, all posteriors are either 0 or 1
What type of Gaussian model mixture is k-means?
tf = term frequency idf = inverse document frequency
What are the terms of tf-idf
p(x) = pi(1,x) Normal(mean1, var1) + (1-pi(1,x)) Normal(mean2, var2) where pi(1,x) is the probability that variable x takes on value 1 (e.g. probability that person is man) ie proportion of values expected in each cluster
What is the gaussian mixture model formula?
the task of putting words/tokens in a standard format
What is word normalisation?
represent a word as a point in multidimensional semantic space that is derived from the distributions of word neighbours
What are vector semantics for words?
an algebraic notion for characterising a set of strings
What is a regular expression?
P(+|w,c) = c.w = 1/1+exp(-c.w)
where . is the dot product
What is the probability that c is a context word P(+|w,c)?
the frequency that a word appears is inversely proportional to its rank
What is Zipf’s law?
segmenting a text into sentences
What is sentence segmentation?
same as data$variable
what does data %>% pull(variable) do in R?
The data within each cluster is normally distributed
What is the assumption of gaussian mixture modelling?
n occurrences of the previous char or expression
from n to m occurrences of the previous char or expression
What do {n} and {m,n} mean in RE?
AIC: Akaike information criterion - same as BIC but penalty is m
AIC3: same as AIC but penalty is 3m/2
ICL: Integrated information criterion - same as BIC but reconstruction loss includes the assigned clusters
What are 3 alternatives to the BIC?
LCA = binomial mixture model LPA = gaussian mixture model
What are other names for LCA and LPA
cosine(v,w) = (v.w)/(|v||w|) (where . is dot product)
What is the formula for the cosine similarity measure?
A clitic is a part of a word that can’t stand on its own, can only occur attached to another word. E.g. we’re
What is a clitic?