NLP Concepts Flashcards

Question 1

Q

What is NLP?

Answer

A

NLP, is teaching computers to understand and communicate with us in the same way we communicate with each other.

Typical apps include, language translators, chatbots, assistants like Siri, Alexa, sentiment analysis, text generation, classification, etc.

Question 2

Q

What is an NLP pipeline?

Answer

A

It is a series of processing tasks used to transform raw text data into a structured format suitable for ML.

Question 3

Q

Name the steps in a NLP pipeline

Answer

A

Data acquisition
Text pre-processing
Feature extraction
Modeling
Evaluation
Deployment
Monitoring

Question 4

Q

What is tokenization?

Answer

A

It is the process of breaking down a sentence into words or words into sub-words or characters.

These are called tokens. They are the building blocks of NLP models.

They also help build a vocabulary.

Question 5

Q

Why do we tokenize the data in a NLP pipeline?

Answer

A

Token - building block
Convert unstructured text to structured format.
Makes pre-processing easier
Each token can be a feature.

Question 6

Q

What is stemming?

Answer

A

It is the process of reducing a word to its root word. The root may or many not be part of the vocabulary in that language.

Question 7

Q

What is lemmatization?

Answer

A

It is a smarter version of stemming. The root word that is derived is a part of the language vocabulary.

Ex: ate, eaten, eating have a root (lemma) - eat

Question 8

Q

What is a corpus?

Answer

A

It is the entire body of documents we use in the context of our NLP app.

In a bigger context, it is a collection of texts or writings that scientists use to understand how language works.

Question 9

Q

What does a count vectorizer do?

Answer

A

NLP technique used to transform a collection of vectors into a feature matrix.
Used to implement BoW

Question 10

Q

What is Bag of Words (BOW)

Answer

A

NLP technique for text analysis such as text classification, sentiment analysis, etc.

Unordered collection of words.
Does not retain semantic meaning and word order.
Retains word frequency of each document.
Characterized by sparse vectors.

Question 11

Q

What is TF-IDF?

Answer

A

It is a statistic used to evaluate the importance of a term (word or phrase) within a document relative to a collection of documents, or corpus.

Question 12

Q

How is TF calculated?

Answer

A

Number of times a term occurs in a document / # of terms in a document.

Question 13

Q

How is IDF calculated?

Answer

A

log (# of documents in a corpus / # of documents in which the term appears).

TF-IDF is the product of TF and IDF.

log is to the base e.

Question 14

Q

What are word embeddings?

Answer

A

Dense vector representations of words in a high-dimensional vector space.

Designed to capture the semantic meaning and relationships between words.

Question 15

Q

What is Word2Vec?

Answer

A

It is a method for transforming words into numerical vectors that capture the meaning and context of words based on their co-occurrence in large text datasets.

Generates word embeddings - converts words into vectors using neural networks.

Question 16

Q

What is a continuous bag of words (CBOW)

Answer

Study These Flashcards

A

It predicts a target word based on the surrounding context words.

Question 17

Q

What is skip gram?

Answer

Study These Flashcards

A

It predicts the context words around the target words. It is the opposite of CBOW.

Question 18

Q

What is N-gram?

Answer

Study These Flashcards

A

It is a contiguous sequence of N items (typically words), used for various NLP tasks, like language modeling or text analysis.

They are useful for understanding the relationships and patterns in sequential data, such as predicting the next word in a sentence.

Question 19

Q

What are some of things we do in pre-processing?

Answer

Study These Flashcards

A

It is a data cleaning step. It involves lower casing, removing URLs, HTML tags, stop words, punctuations, stemming lemmatization.

This is done on a case by case basis

Question 20

Q

Why do we pre-process the text?

Answer

Study These Flashcards

A

The idea is to retain only the tokens that retain the crux of the document so that the vocabulary size is reduced thereby reducing the # of features in the dataset.

Question 21

Q

What is a transformer?

Answer

Study These Flashcards

A

Deep learning architecture characterized by the self-attention mechanism.
It captures contextual information.
Used in language translation, sentiment analysis, text generation.

NLP Concepts Flashcards

(21 cards)