L16 - Topic Modelling Flashcards
Define the field of Text Mining…
Process of obtaining informative data from unstructured data.
What type of modelling approach is Topic Modelling?
Statistic modelling approach that uses unsupervised machine learning
Give a high level explanation of what Topic Modelling does. Give an example…
Analyses unstructured text data and clusters the data based on criteria, establishing topics (clusters).
E.g.: Analysis of a text will identify certain topics within the text which would enable the model to predict the purpose of the text such as an invoice, a rock song, a literature review, spam email etc.
What is the input and output of Topic Modelling?
Bag of words - Corpus (collection of text)
Topics - Clusters of words which are used to make predictions.
What are the 2 topic modelling techniques called?
Latent Semantic Analysis (LSA)
Latent Dirichlet Allocation (LDA)
What is Latent Semantic Analysis? Give the step by step process
- A technique to establish the relationship between documents and the words they contain.
- Based on the assumption that words with similar meanings will appear in similar documents.
- Generates a word x document matrix where each row is a word, each col is a document, and each cell is the count of that word in that document.
- Perform Singular Value Decomposition on each row to reduce dimensionality whilst retaining important features.
- Use Cosine Similarity to establish document similarities
Explain Latent Dirichlet Allocation…
A generative statistical model that assumes that a document contains words that enable the topic of the document to be deduced.
Maps a document to a list of relevant topics.
What is Cosine Similarity?
Vector based method for finding document similarity. If Cosine angle between 2 document vectors is close to 1, they are similar.
Give an example of Topic Modelling in use…
Customer Service Tickets - Based on the content of a customers query, topic modelling can allocate to correct team and appropriately tag the ticket.
What are the advantages and disadvantages of Topic Modelling?
Advantages:
- Simple input (term-document matrix)
- Quick and simple topic breakdown by percentage.
Disadvantages:
- Prone to overfitting
- Ineffective on short texts
What is the main con of topic modelling? What is the solution to this?
Ineffective on shorter text
Solution : Word Embedding