NLP Concepts Flashcards
What is NLP?
NLP, is teaching computers to understand and communicate with us in the same way we communicate with each other.
Typical apps include, language translators, chatbots, assistants like Siri, Alexa, sentiment analysis, text generation, classification, etc.
What is an NLP pipeline?
It is a series of processing tasks used to transform raw text data into a structured format suitable for ML.
Name the steps in a NLP pipeline
Data acquisition
Text pre-processing
Feature extraction
Modeling
Evaluation
Deployment
Monitoring
What is tokenization?
It is the process of breaking down a sentence into words or words into sub-words or characters.
These are called tokens. They are the building blocks of NLP models.
They also help build a vocabulary.
Why do we tokenize the data in a NLP pipeline?
Token - building block
Convert unstructured text to structured format.
Makes pre-processing easier
Each token can be a feature.
What is stemming?
It is the process of reducing a word to its root word. The root may or many not be part of the vocabulary in that language.
What is lemmatization?
It is a smarter version of stemming. The root word that is derived is a part of the language vocabulary.
Ex: ate, eaten, eating have a root (lemma) - eat
What is a corpus?
It is the entire body of documents we use in the context of our NLP app.
In a bigger context, it is a collection of texts or writings that scientists use to understand how language works.
What does a count vectorizer do?
NLP technique used to transform a collection of vectors into a feature matrix.
Used to implement BoW
What is Bag of Words (BOW)
NLP technique for text analysis such as text classification, sentiment analysis, etc.
Unordered collection of words.
Does not retain semantic meaning and word order.
Retains word frequency of each document.
Characterized by sparse vectors.
What is TF-IDF?
It is a statistic used to evaluate the importance of a term (word or phrase) within a document relative to a collection of documents, or corpus.
How is TF calculated?
Number of times a term occurs in a document / # of terms in a document.
How is IDF calculated?
log (# of documents in a corpus / # of documents in which the term appears).
TF-IDF is the product of TF and IDF.
log is to the base e.
What are word embeddings?
Dense vector representations of words in a high-dimensional vector space.
Designed to capture the semantic meaning and relationships between words.
What is Word2Vec?
It is a method for transforming words into numerical vectors that capture the meaning and context of words based on their co-occurrence in large text datasets.
Generates word embeddings - converts words into vectors using neural networks.