Week 8: Clustering and Text Mining Flashcards
RevERSED
- Stemming (running -> run)
- Lemmatisation (were -> is)
- lower casing
- Stop word removal
- Punctuation removal
- Number removal
- Spell correction
- Tokenisation
What are some typical steps in text pre-processing? (8)
RevERSED
the task of finding the document d from the D documents in some collection that best matches a query q
What is information retrieval?
RevERSED
Short, dense vectors that can be used to represent words?
What are embeddings?
RevERSED
- it is a token learner
- starts with vocabulary of all characters
- chooses the two symbols that are most frequently adjacent, adds a new merged symbol to the vocabulary and replaces every adjacent pair in the corpus with the new merged symbol
- continues to count and merge, creating new longer and longer character strings, until k merges have been done creating k novel tokens
How does the byte-pair encoding algorithm work for tokenisation?
RevERSED
it = itoken(wordvector) #create index tokens full\_vocab = create\_vocabulary(it) #create full vocabulary
How do you create a vocabulary of words in R?
RevERSED
train a classifier such that a given tuple (w,c) of a target word w paired with a candidate/context word c, it will return the probability that c is a real context word P(+|w,c)
What is the classifier to train for skip-grams?
RevERSED
DocumentTermMatrix(corpus)
Code to create a document term matrix in R?
RevERSED
tidy()
Code to turn a document term matrix into a dataframe in R?
RevERSED
Similar rows mean that the words are similar because they occur in similar documents
When will two row vectors be similar in a term-document matrix?
RevERSED
Sentiment = total positive words - total negative words
What is the overall sentiment in sentiment analysis?
RevERSED
Each row is a document, each column is a word
What is a document-term matrix (DTM)?
RevERSED
Hidden groups within the data that are not recorded
What is gaussian mixture modelling trying to find?
RevERSED
latent class analysis latent profile analysis types of model based clustering
What do LCA and LPA stand for? what are they types of?
RevERSED
Corpus(textsource)
Code to create a corpus in R?
RevERSED
It separates out clitics (doesn’t becomes does n’t), keeps hyphenated words together, separates out all punctuatio
What does Penn Treebank tokenisation do?
RevERSED
- tokenising (segmenting) words
- normalising word formats
- segmenting sentences
What are 3 types of text normalisation?
RevERSED
create_tcm(it, vectorizer, skip_grams_window = 5)
How do you create a token co-occurrence matrix in R?
RevERSED
Use capture group to store the expression in memory
the (.*)er they were, the \1er they will be
How do you get part of a string and reference back to that part in an RE?
RevERSED
adjust initial embeddings to maximise the similarity (dot product) of the (w, cpos) pairs drawn from the positive examples and minimise the similarity (dot product) of the (w, cneg) pairs from the negative examples
How does the skip-gram algorithm adjust during training?
RevERSED
words that occur in similar contexts tend to have similar meanings
What is the distributional hypothesis?
RevERSED
clustCombi(data=x)
What is the code for merging components of clusters in R?
RevERSED
two matrices W and C each containing an embedding for every one of the |V| words in the vocabulary V
What are all the parameters learned in skip-gram?
RevERSED
\n is newline
\t is tab
What are \n and \t in RE?
RevERSED
AFINN, NRC, bing
What are some popular lexicons for sentiment analysis? (3)