LDA2VEC Flashcards
What is Latent Dirichlet Allocation (LDA)?
LDA is a probabilistic topic model and it treats documents as a bag-of-words.
What is Latent Dirichlet Allocation (LDA) to vector (lda2vec)?
lda2vec builds document representations on top of word embeddings.
What is a topic modeling?
Topic models often assume that word usage is correlated with topic occurence. It divides the documents into clusters according to word usage.
What is Bag-of-words?
Document as a vector with dim = vocabulary size. Each dimension of this vector corresponds to the count or occurrence of a word in a document.
LDA is a general Machine Learning (ML) technique, How?
Used in unsupervised ML problems where the input is a collection of fixed-length vectors and the goal is to explore the structure of this data.
What is the biggest disadvantage of LDA?
The LDA model learns a document vector that predicts words inside of that document while disregarding any structure or how these words interact on a local level.
When LDA produce value?
[1] A good estimate of the number of topics
[2] Manually assign a distinct nominator/‘topic’ to the different topic vector.
[3] The topic vectors will be interpretable
What is the problem in bag-of-words representation?
Figuring out which dimensions in the document vectors are semantically related. (solution: word embeddings)
What is word embeddings?
Words that occur in the same context are represented by vectors in close proximity to each other.
How to visulize word embedding space?
t-Distributed Stochastic Neighbor Embedding (t-SNE) is a dimensionality reduction method that you can use to visualise high-dimensional data in 2D.
What is CBOW in word2vec embedding?
In the bag-of-words architecture (CBOW) the pivot word is predicted based on a set of surrounding context words (i.e. given ‘thank’, ‘such’, ‘you’, ‘top’ the model has to predict ‘awesome’)
What is skip-gram in word2vec embedding?
In the skip-gram architecture, the pivot word is used to predict the surrounding context words (i.e. given ‘awesome’ predict ‘thank’, ‘such’, ‘you’, ‘top’ ).
What is bad about word2vec?
The word2vec model learns a word vector that predicts context words across different documents. As a result, document-specific information is mixed together in the word embeddings.
what is lda2vec?
Inspired by Latent Dirichlet Allocation (LDA), the word2vec model is expanded to simultaneously learn word, document and topic vectors.
what is Lda2vec process?
The pivot word vector and a document vector are used to obtain a context vector. This context vector is then used to predict context words.