NLP basics Flashcards

1
Q

Why NLP is Needed ?

A

Bridging Human & Machine Communication – Computers process structured data, but human language is complex, ambiguous, and unstructured. NLP helps machines understand, interpret, and generate human language.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are Real-World Applications of NLP?

A

Search Engines (Google, Bing)

Chatbots & Virtual Assistants (Siri, Alexa)

Machine Translation (Google Translate)

Sentiment Analysis (Social Media Monitoring)

Spam Detection (Gmail filters)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is main aim of BoW?

A

The main aim of Bag of Words (BoW) in NLP is to convert text data into numerical form so that machine learning models can process it.

BoW ignores grammar and word order but focuses on word frequency in a document to represent its content.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Example of BoW representation?

A

How it Works
Tokenization – Break the text into individual words (tokens).

Vocabulary Creation – Collect all unique words from the dataset.

Vectorization – Represent each document as a vector where:
Each unique word is a feature (column).
The value is the word count (or sometimes
binary presence).

Example
Text Data:
1️⃣ “I love NLP and Machine Learning”
2️⃣ “Machine Learning is amazing”

Vocabulary:
{I, love, NLP, and, Machine, Learning, is, amazing}

BoW Representation of above sentences or doc:
1️⃣ 1 1 1 1 1 1 0 0
2️⃣ 0 0 0 0 1 1 1 1
This numeric representation can now be used for training models!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the Limitations of BoW?

A

Ignores Word Order – “NLP is great” and “Great is NLP” have the same representation.

Sparse Representation – Large vocabularies create huge matrices with many zeros.

Fails to Capture Meaning – Doesn’t consider context or semantics.

Sensitive to Common Words – Frequently occurring words dominate, even if they don’t add much meaning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How Bow sensitive to common words

A

BoW is sensitive to common words because it relies purely on word frequency. This means:

Frequent words dominate – Words like “the,” “is,” and “and” appear very often but don’t carry much meaning. However, since BoW is based on raw counts, these words will have high importance.

Important but rare words get overshadowed – Words that are actually meaningful (like “artificial” or “neural”) might appear only a few times but can be lost in the noise of frequent words.

For example, if you have these two sentences:
“The AI model is powerful.”
“it is changing the world.”
A BoW representation may give high importance to “is” and “the” rather than “AI” or “powerful,” which are more relevant.

💡 Solution? This is where TF-IDF comes in! It reduces the impact of common words and gives more weight to rare but meaningful ones.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Why TF-IDF ?

A

TF-IDF (Term Frequency - Inverse Document Frequency):

In NLP, we need to extract important words from text while ignoring common words (like “the,” “is,” “and”).

BoW treats all words equally, which is a problem because frequent words dominate.

TF-IDF solves this by reducing the importance of common words and highlighting rare, meaningful words

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is TF-IDF ?

A

Definition of TF-IDF
TF-IDF (Term Frequency - Inverse Document Frequency) is a numerical statistic that reflects how important a word is in a document relative to a collection of documents (corpus).

It consists of two parts:

Term Frequency (TF) → How frequently a word appears in a document.

Inverse Document Frequency (IDF) → How rare the word is across all documents.
TF-IDF = TF × IDF

This helps assign higher scores to important words and lower scores to commonly occurring words in NLP tasks like text classification, search engines, and keyword extraction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

TF Formula

A

TF Measures how often a word appears in a document.
Term Frequency (TF) =
Totalnumberofwordsinthedocument/
Numberoftimesawordappearsinadocument

Higher TF → More frequent word in the document.

But it doesn’t consider how common the word is across all documents.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

IDF Formula

A

IDF Reduces the importance of common words by assigning lower weights.

IDF=log of
Totalnumberofdocuments/
Number of documents containing the word

If a word appears in many documents, IDF becomes small → lowers its importance.

If a word is rare, IDF becomes large → boosts its importance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

TF-IDF Formula

A

Final TF-IDF Formula:

𝑇𝐹−𝐼𝐷𝐹=𝑇𝐹×𝐼𝐷𝐹

High TF-IDF score → Important word in a specific document but rare across all documents.

Low TF-IDF score → Either a very common word or an irrelevant word.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the disadvantages of BoW and TF-IDF ?

A

1️⃣ They ignore meaning – “king” and “queen” are just different words, no relation.
2️⃣ They don’t capture context – “bank” (riverbank) and “bank” (finance) are treated the same.
3️⃣ Sparse & Huge – BoW/TF-IDF create massive vectors with many zeroes.

The solution is word embeddings

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is Word Embeddings ?

A

Word Embeddings:
Word embeddings are a way to represent words as dense vectors in a high-dimensional space.
They capture semantic meaning, meaning similar words have similar vector representations.
Example: “King” and “Queen” will have vectors close to each other, unlike “King” and “Apple.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is word2vec ?

A

Word2Vec (A Method to Create Word Embeddings)
Word2Vec is a technique to generate word embeddings.
It is a neural network-based model that learns vector representations of words from a large corpus of text.
It has two main approaches: CBOW and Skip-gram.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Explain how CBOW Works with an example?

A

Step 1: Input Your Book
You provide the entire book(or a dataset) as input to the CBOW model.
The model reads all the words and their contexts.

Step 2: Learn Word Embeddings
CBOW will analyze each word’s context and learn embeddings.
Example:
Sentence: “The cat sits on the mat.”
Training Example: “The cat sits on ___.” → Predict “mat”
It learns that “cat”, “dog”, and “pet” often appear in similar contexts → so they get similar embeddings.

Step 3: Generate the Vocabulary & Word Vectors
The model creates a vocabulary set from the book (all unique words).
Each word gets a word vector (set of numbers representing the word in multi-dimensional space).

Step 4: Represent Dataset Numerically
Now, your dataset (book content) is converted into numerical representations using these learned embeddings.

This makes it ready for machine learning models (e.g., for spam detection, sentiment analysis, etc.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Limitations of CBOW ?

A

CBOW learns context only from the dataset you provide (a book, a collection of documents, etc.). It does not have any built-in knowledge—it purely learns from the words and their relationships in your given data.

So, if your dataset is small or lacks diversity, CBOW might not learn good embeddings. That’s why large datasets or pretrained embeddings (like Google’s Word2Vec) are often used.

🔹 Limitations:
❌ Struggles with rare words.
❌ Requires a lot of data for good embeddings.

🔹 Solution for Rare Words?
Skip-gram (better for rare words).
Pretrained embeddings (e.g., Google Word2Vec,
FastText)

17
Q

How skip-gram works?

A

Skip-gram: A Word2Vec model that learns word embeddings by predicting the context words given a single target word.

Skip-gram learns word embeddings only from the given dataset. It does not have predefined knowledge; it builds relationships based on the words present in the input data.

CBOW predicts the target word from surrounding words, so it relies on context words.

Rare words don’t have enough context, so their embeddings don’t become meaningful.

Skip-gram flips the approach: Instead of predicting a target word, it predicts context words from a given word.

Skip-gram Example
Consider the sentence:
“The cat sits on the mat.”

How CBOW Works:
🔹 Input: [“The”, “sits”, “on”, “the”]
🔹 Predict: “cat”

How Skip-gram Works:
🔹 Input: “cat”
🔹 Predict: [“The”, “sits”, “on”, “the”]

Key Difference:

CBOW: Predicts the target word from context.
Skip-gram: Predicts context from a single word.

18
Q

compare CBOW and Skip-gram ?

A

CBOW (Continuous Bag of Words)
Predicts target word from surrounding context words
Faster training (processes multiple words at once)
Struggles with rare words (averaging loses information)
Works well for large datasets with common words
Less accurate word embeddings

Skip-gram
Predicts surrounding context words from a target word
Slower training (processes one word at a time)
Performs well on rare words (learns from individual word pairs)
Works better for small datasets and detailed relationships
More accurate word embeddings

19
Q

What is GloVe ?

A

GloVe (Global Vectors for Word Representation) is a word embedding technique that learns word meanings by analyzing word co-occurrence in a large corpus. Unlike Word2Vec (which learns embeddings from local context windows), GloVe captures both local and global context to generate better word representations.

GloVe creates word embeddings using a co-occurrence matrix, which captures how often words appear together in a large dataset.

It finds relationships between words by factorizing this matrix, resulting in meaningful word vectors (words used in similar contexts have similar vectors).

Unlike Word2Vec (which only learns from local context windows), GloVe learns from the entire corpus at once.

20
Q

what are multiple ways to convert text into numerical vectors?

A

Traditional Methods
1️⃣ BoW (Bag of Words) → Counts word occurrences (loses order & meaning).
2️⃣ TF-IDF → Weighs words based on importance (still loses context).

Advanced Word Embeddings
3️⃣ Word2Vec (CBOW & Skip-gram) → Learns context-based word vectors.
4️⃣ GloVe → Learns embeddings from a co-occurrence matrix (global context).
5️⃣ FastText → Like Word2Vec but considers subwords (better for rare words).

State-of-the-Art Models
6️⃣ BERT (Transformers) → Learns deep contextualized word embeddings.
7️⃣ GPT (Transformer-based) → Generates text with context-aware embeddings.