Week 2 Flashcards
What is a Lexicon in NLP?
A lexicon is a collection of words, often with definitions, parts of speech, and other linguistic information used to analyze language.
What are Stopwords?
Stopwords are common words (e.g., “the”, “and”, “is”) that are often removed in NLP tasks because they add little value to the meaning of a text.
Define WordNet in NLP
WordNet is a semantic dictionary of English, grouping words into synsets (synonym sets) with definitions and usage examples
What is a Synset in WordNet?
A synset is a set of synonyms that represent one concept in WordNet, providing a richer structure than traditional dictionaries.
Describe the difference between Stemming and Lemmatization.
Stemming removes affixes to find the word stem (e.g., “running” to “run”), while lemmatization reduces a word to its dictionary form (lemma), considering context.
What is Tokenization?
Tokenization is the process of breaking text into individual tokens (words or punctuation), which simplifies text analysis.
Why is Named Entity Recognition (NER) important in NLP?
NER identifies and classifies entities in text, like names, locations, and dates, helping to extract specific information.
What are N-Grams in language modeling?
N-Grams are sequences of N words used to capture word patterns in text, such as bigrams (2 words) and trigrams (3 words).
What is the Bag of Words (BoW) model?
BoW is a representation that treats text as a collection of words, ignoring grammar and word order, focusing only on word frequency.
Describe TF-IDF and its purpose.
TF-IDF (Term Frequency-Inverse Document Frequency) highlights important words in a document by balancing term frequency with rarity across all documents.
What is Text Classification?
Text classification is the task of categorizing text into predefined labels, like spam detection or sentiment analysis.
Why are Decision Trees used in NLP?
Decision trees are used for text classification tasks due to their simplicity and interpretability, breaking down decisions based on word features.
What is the Naïve Bayes algorithm’s role in NLP?
Naïve Bayes is a probabilistic algorithm used for text classification, assuming feature independence to simplify calculations.
Explain Zero Counts and Smoothing in Naïve Bayes.
Smoothing assigns a small probability to unseen words in the training data, preventing zero probabilities in calculations.
What is Model Evaluation in text classification?
Model evaluation measures a model’s accuracy and reliability on tasks, often using metrics like precision, recall, and F1-score.