Module 3 Flashcards
What are the 5 main types of tasks performed with text?
classification
clustering
sentiment analysis
conversation simulation
text generation
What is sentiment analysis?
Process of identifying the subjective information about an object from the text
How can we perform classification or clustering on raw texts?
Tokenization
Stop-word removal
Stemming
Text normalization
Define tokenization
A token is a unit of processing text e.g. sentences, words. So tokenization is the process of converting a text corpus into a set of tokens (segments). It is a useful unit for linguistic processing.
What are stop words?
Stop words include and, you, is, at, the, in, and a. In particular, articles, prepositions, conjunctions, and sometime pronouns.
What is stemming?
Process of cutting the end or beginning of the word to convert it into its original root.
What is a widely used text-normalization method?
Term frequency/inverse document frequency which is a vector space, representations-based normalization. It evaluates how important the word is to the document. TF measures the frequency of a term in a document and IDF measures how rarely the term appears in the document.
What does bag of words do?
This method treats the tokenized document as a list of words and does not care about their position in the sentence. It then transforms the tokens into a list of words and their frequency. Useful in text clustering algorithms
What does N-grams do?
Similar to bag of words but it separates words based on their adjacent words so it takes the positioning of words into account. N refers to how many words it captures, usually 2 or 3 will do
What does part of speech tagging do?
A method that explains and annotates the particular roles played by words in a sentence
What is an example of a word embedding algorithm?
Word2Vec. It identifies the role of the word based on its context. It uses a neural network and converts the given corpus into a vector of words. It establishes an accurate association between words by converting the words into vectors and using the Euclidean distance
What two major tasks does Word2Vec perform?
using a context to predict words to that context, i.e., Continuous bag-of-Words (CBOW) or using the word to predict the context (skip-gram)
What is lemmitization?
Similar to stemming but it also considers the role of the word in the sentence as well.
What are the downsides of bag of words and n-grams
Bag of words fails to capture the structure of a sentence and n-grams is inefficient in terms of resource use
What is sentiment analysis and the three types of info that are required to automatically be identified?
opinion mining; the process that automatically enables machines to understand opinion.
The three types of info are:
- Polarity (pos/neg)
- Subject (topic being discussed)
- Opinion Holder (who is expressing opinion)