Week 4 Flashcards
- What is the main difference between NLP and NLU?
A. NLP focuses on speech only, while NLU handles images
B. NLP processes language, NLU interprets meaning and intent
C. NLP uses syntax, while NLU uses translation
D. NLU is for structured data, NLP for unstructured
Correct Answer: B
Which of the following best describes tokenization?
A. Removing noise from text
B. Converting structured data to unstructured format
C. Splitting text into smaller units like words or phrases
D. Translating text into another language
✅ Correct Answer: C
What is the key limitation of stemming?
A. It increases model accuracy
B. It can create non-dictionary words
C. It requires part-of-speech tags
D. It is only useful for speech data
Correct Answer: B
Which NLP task helps identify words like ‘David’, ‘Apple Store’, or ‘Nigeria’?
A. Tokenization
B. Text Summarization
C. Named Entity Recognition
D. Sentiment Analysis
Correct Answer: C
Lemmatization is a more refined version of stemming.
✅ True
Corpus refers to a single sentence in a dataset.
❌ False — it’s a collection of text documents.
N-grams are useful for analyzing patterns in text classification.
✅ True
Text normalization increases the number of unique tokens
❌ False — it reduces them.
Q1: Write a Python snippet using SpaCy to tokenize the sentence: “I love playing football in Port-Harcourt.”
import spacy
nlp = spacy.load(“en_core_web_sm”)
doc = nlp(“I love playing football in Port-Harcourt.”)
tokens = [token.text for token in doc]
print(tokens) # Output: [‘I’, ‘love’, ‘playing’, ‘football’, ‘in’, ‘Port’, ‘-‘, ‘Harcourt’, ‘.’]
Q2: Using SpaCy, write code that performs lemmatization on the sentence: “The kids are playing outside.”
import spacy
nlp = spacy.load(“en_core_web_sm”)
doc = nlp(“The kids are playing outside.”)
lemmas = [token.lemma_ for token in doc]
print(lemmas) # Output: [‘the’, ‘kid’, ‘be’, ‘play’, ‘outside’]
How do stemming and lemmatization contribute to improving text classification models?
✅ Sample Answer:
Both processes reduce words to their root form, helping reduce redundancy and dimensionality in text data. This allows models to generalize better across word variants (e.g., “run”, “runs”, “ran”), improving accuracy and reducing overfitting. However, stemming may create invalid roots, while lemmatization is more accurate.
In what real-world scenarios would tokenization and named entity recognition (NER) be critical?
✅ Sample Answer:
Tokenization is foundational for any NLP task—search engines, spam filters, chatbots. NER is critical in information extraction: legal document analysis, medical diagnosis systems, customer service systems (e.g., identifying names, dates, product names from complaints).
A telecommunications company wants to use NLP to categorize customer complaints automatically.
Q: Design a step-by-step solution using NLP techniques.
✅ Solution:
Data Collection: Gather customer messages from email/chat logs.
Tokenization: Split each message into tokens (words or phrases).
Text Normalization: Apply lemmatization to standardize words.
Keyword Extraction: Use TF-IDF or spaCy’s noun_chunks to extract important phrases.
Text Classification: Train a supervised model (e.g., Logistic Regression or Naive Bayes) with complaint category labels (Billing, Network, Customer Service).
Evaluation: Measure model accuracy, precision, recall.
Deployment: Use a pipeline that receives input, processes text, and returns a department classification.