Week 8: Clustering and Text Mining Flashcards
RevERSED
- Stemming (running -> run)
- Lemmatisation (were -> is)
- lower casing
- Stop word removal
- Punctuation removal
- Number removal
- Spell correction
- Tokenisation
What are some typical steps in text pre-processing? (8)
RevERSED
the task of finding the document d from the D documents in some collection that best matches a query q
What is information retrieval?
RevERSED
Short, dense vectors that can be used to represent words?
What are embeddings?
RevERSED
- it is a token learner
- starts with vocabulary of all characters
- chooses the two symbols that are most frequently adjacent, adds a new merged symbol to the vocabulary and replaces every adjacent pair in the corpus with the new merged symbol
- continues to count and merge, creating new longer and longer character strings, until k merges have been done creating k novel tokens
How does the byte-pair encoding algorithm work for tokenisation?
RevERSED
it = itoken(wordvector) #create index tokens full\_vocab = create\_vocabulary(it) #create full vocabulary
How do you create a vocabulary of words in R?
RevERSED
train a classifier such that a given tuple (w,c) of a target word w paired with a candidate/context word c, it will return the probability that c is a real context word P(+|w,c)
What is the classifier to train for skip-grams?
RevERSED
DocumentTermMatrix(corpus)
Code to create a document term matrix in R?
RevERSED
tidy()
Code to turn a document term matrix into a dataframe in R?
RevERSED
Similar rows mean that the words are similar because they occur in similar documents
When will two row vectors be similar in a term-document matrix?
RevERSED
Sentiment = total positive words - total negative words
What is the overall sentiment in sentiment analysis?
RevERSED
Each row is a document, each column is a word
What is a document-term matrix (DTM)?
RevERSED
Hidden groups within the data that are not recorded
What is gaussian mixture modelling trying to find?
RevERSED
latent class analysis latent profile analysis types of model based clustering
What do LCA and LPA stand for? what are they types of?
RevERSED
Corpus(textsource)
Code to create a corpus in R?
RevERSED
It separates out clitics (doesn’t becomes does n’t), keeps hyphenated words together, separates out all punctuatio
What does Penn Treebank tokenisation do?
RevERSED
- tokenising (segmenting) words
- normalising word formats
- segmenting sentences
What are 3 types of text normalisation?
RevERSED
create_tcm(it, vectorizer, skip_grams_window = 5)
How do you create a token co-occurrence matrix in R?
RevERSED
Use capture group to store the expression in memory
the (.*)er they were, the \1er they will be
How do you get part of a string and reference back to that part in an RE?
RevERSED
adjust initial embeddings to maximise the similarity (dot product) of the (w, cpos) pairs drawn from the positive examples and minimise the similarity (dot product) of the (w, cneg) pairs from the negative examples
How does the skip-gram algorithm adjust during training?
RevERSED
words that occur in similar contexts tend to have similar meanings
What is the distributional hypothesis?
RevERSED
clustCombi(data=x)
What is the code for merging components of clusters in R?
RevERSED
two matrices W and C each containing an embedding for every one of the |V| words in the vocabulary V
What are all the parameters learned in skip-gram?
RevERSED
\n is newline
\t is tab
What are \n and \t in RE?
RevERSED
AFINN, NRC, bing
What are some popular lexicons for sentiment analysis? (3)
RevERSED
complete morphological parsing of the word
ie. takes cats and parses it into the two morphemes cat and s
What is the most sophisticated way of lemmatisation?
RevERSED
all prior class proportions 1/K, EII model, all posteriors are either 0 or 1
What type of Gaussian model mixture is k-means?
RevERSED
tf = term frequency idf = inverse document frequency
What are the terms of tf-idf
RevERSED
p(x) = pi(1,x) Normal(mean1, var1) + (1-pi(1,x)) Normal(mean2, var2) where pi(1,x) is the probability that variable x takes on value 1 (e.g. probability that person is man) ie proportion of values expected in each cluster
What is the gaussian mixture model formula?
RevERSED
the task of putting words/tokens in a standard format
What is word normalisation?
RevERSED
represent a word as a point in multidimensional semantic space that is derived from the distributions of word neighbours
What are vector semantics for words?
RevERSED
an algebraic notion for characterising a set of strings
What is a regular expression?
RevERSED
P(+|w,c) = c.w = 1/1+exp(-c.w)
where . is the dot product
What is the probability that c is a context word P(+|w,c)?
RevERSED
the frequency that a word appears is inversely proportional to its rank
What is Zipf’s law?
RevERSED
segmenting a text into sentences
What is sentence segmentation?
RevERSED
same as data$variable
what does data %>% pull(variable) do in R?
RevERSED
The data within each cluster is normally distributed
What is the assumption of gaussian mixture modelling?
RevERSED
n occurrences of the previous char or expression
from n to m occurrences of the previous char or expression
What do {n} and {m,n} mean in RE?
RevERSED
AIC: Akaike information criterion - same as BIC but penalty is m
AIC3: same as AIC but penalty is 3m/2
ICL: Integrated information criterion - same as BIC but reconstruction loss includes the assigned clusters
What are 3 alternatives to the BIC?
RevERSED
LCA = binomial mixture model LPA = gaussian mixture model
What are other names for LCA and LPA
RevERSED
cosine(v,w) = (v.w)/(|v||w|) (where . is dot product)
What is the formula for the cosine similarity measure?
RevERSED
A clitic is a part of a word that can’t stand on its own, can only occur attached to another word. E.g. we’re
What is a clitic?