Week 8: Clustering and Text Mining Flashcards
What are other types of embeddings in R, after word2vec?
GloVe, fastText
What type of Gaussian model mixture is k-means?
all prior class proportions 1/K, EII model, all posteriors are either 0 or 1
What are embeddings?
Short, dense vectors that can be used to represent words?
What is the intuition of skip-gram/ word2vec?
- Treat the target word and a neighbouring context word as positive examples
- Randomly sample other words in the lexicon to get negative samples
- Use logistic regression to train a classifier to distinguish those two cases
- Use the learned weights as the embeddings
What is case folding?
a kind of normalisation that is mapping everything to lowercase
What is the purpose of tf-idf?
Raw frequency is not the best measure of association between words because words like “the” and “good” occur frequently and aren’t informative. tf-idf gives weight to words that appear in fewer documents
What are some popular lexicons for sentiment analysis? (3)
AFINN, NRC, bing
Code to turn a document term matrix into a dataframe in R?
tidy()
How do you calculate tf, idf and tf-idf?
term frequency: the frequency of the word t in document d / the total number of terms in the document
tf(t,d)= count(t,d)/total number of terms in the document
inverse document frequency: number of documents / number of documents the term occurs in
idf = log10(N/df(t))
tf-idf = tf*idf
How do you create a vocabulary of words in R?
it = itoken(wordvector) #create index tokens full\_vocab = create\_vocabulary(it) #create full vocabulary
What is stemming?
naive version of morphological analysis that consists of chopping of word-final affixes
What is a term-document matrix?
- each row represents a word in the vocabulary
- each column represents a document
- each cell represents the number of times a particular word occurs in a particular document
What assumption does skip-gram make?
all context words are independent
How do you create a token co-occurrence matrix in R?
create_tcm(it, vectorizer, skip_grams_window = 5)
How does the skip-gram algorithm adjust during training?
adjust initial embeddings to maximise the similarity (dot product) of the (w, cpos) pairs drawn from the positive examples and minimise the similarity (dot product) of the (w, cneg) pairs from the negative examples
What are other names for LCA and LPA
LCA = binomial mixture model LPA = gaussian mixture model
What does the likelihood tell us?
How well the model fits the data
Code to create a corpus in R?
Corpus(textsource)
How do you use grep to match and count matches from data? (3 ways)
grep(“expression”, data) #returns indices of all matching expressions from data
grep(“expression”, data, value = TRUE) #returns all the actual matching expressions from data. gives whole string, not just matching part
length(grep(“expression”, data)) #returns number of matches of expression in data
What is the assumption of gaussian mixture modelling?
The data within each cluster is normally distributed
What are vector semantics for words?
represent a word as a point in multidimensional semantic space that is derived from the distributions of word neighbours
What are the 2 things that ^ represents in RE
Caret ^ matches the start of a line
or negates the contents of [] if it is [^…]
How do you use regexpr and gregexpr and regmatches to match data?
r = regexpr(“expression”, data) #gives index of each match and length of each match (only for first match of each index)
r = gregexpr(“expression”, data) #gives index of each match and length of each match (all matches)
regmatches(data, r) #returns the actual matches obtained from regexpr or gregexpr
What is lemmatisation?
the task of determining that two words have the same root, despite their surface differences
What are the terms of tf-idf
tf = term frequency idf = inverse document frequency
What are some forms of text representation? (4)
- Bag-of-words: word count or word proportion for each word
- Time-series: label each token and put in order
- Tf-idf, embeddings
What is a document-term matrix (DTM)?
Each row is a document, each column is a word
What are 3 types of text normalisation?
- tokenising (segmenting) words
- normalising word formats
- segmenting sentences
What is the distributional hypothesis?
words that occur in similar contexts tend to have similar meanings
What is the kleene * and knleene + in RE?
* zero or more occurrences of the immediately previous character or regular expression
+ one or more occurrences of the immediately previous character or regular expression
Code to create a document term matrix in R?
DocumentTermMatrix(corpus)
What is a clitic?
A clitic is a part of a word that can’t stand on its own, can only occur attached to another word. E.g. we’re
What is gaussian mixture modelling trying to find?
Hidden groups within the data that are not recorded
What is Zipf’s law?
the frequency that a word appears is inversely proportional to its rank
What is information retrieval?
the task of finding the document d from the D documents in some collection that best matches a query q
What does the value of the cosine metric represent?
The cosine value ranges from 1 for vectors pointing in the same direction, through 0 for orthogonal vectors, to -1 for opposite vectors
How do you specify a range in RE?
”-“ e.g. [2-5]
Code for implementing mclust in R?
Mclust(data, G=2, modelNames = “E”)
What is a regular expression?
an algebraic notion for characterising a set of strings
What type of vector representations come from a term-term matrix?
Sparse because most are 0
What is word normalisation?
the task of putting words/tokens in a standard format