w2 information retrival (also tokenization flashcards)
what is stemming
Stemming is a text normalization technique that reduces words to their base or root form by stripping affixes. It may not result in real words.
Example of stemming
“running” → “run”
“better” → “bet”
What is the main disadvantage of stemming?
Stemming can result in non-dictionary words or incomplete forms that may lose their meaning, e.g., “studies” → “studi”.
What is lemmatization in NLP?
Lemmatization is a text normalization technique that reduces words to their dictionary base form (lemma), considering context and grammar (like part of speech).
Example of lemmatization
“running” → “run”
“better” → “good”
“studies” → “study”
What is the key difference between stemming and lemmatization?
Stemming cuts off affixes to get the root form without regard for correctness, while lemmatization considers grammatical context and returns the dictionary form.
Which method is more accurate: stemming or lemmatization?
Lemmatization is more accurate because it results in meaningful words based on linguistic analysis, while stemming is faster but less precise.
Which NLP task might prefer stemming over lemmatization?
Stemming might be preferred in tasks where speed is more critical than precision, like real-time search engines.
Why does lemmatization need part-of-speech (POS) tagging?
POS tagging helps lemmatizers to identify the correct base form of a word because the lemma depends on whether the word is a noun, verb, etc.
What is tokenization in NLP
Tokenization is the process of splitting text into smaller units called tokens, which could be words, subwords, or characters, for easier processing by models.
What type of tokenization do Pre-Trained Language Models often use
Pre-Trained Language Models often use subword tokenization (e.g., BPE, WordPiece) to break words into smaller meaningful units like subwords or characters, optimizing for unseen words and reducing vocabulary size
Why do Pre-Trained Language Models prefer subword tokenization
Subword tokenization allows these models to handle out-of-vocabulary words efficiently by breaking rare or unknown words into smaller, known subword units.
What is the difference between word-level tokenization and subword tokenization?
Word-level splits text by words but struggles with out-of-vocabulary words.
Subword-level splits words into smaller components to better handle rare or unseen words, such as “unbelievable” → [“un”, “believ”, “able”].
What kind of tokenization do Large Language Models (LLMs) like GPT typically use?
Large Language Models (LLMs) use byte pair encoding (BPE) or similar subword techniques to handle diverse text inputs efficiently, including rare or complex words.
Why do Large Language Models (LLMs) need more sophisticated tokenization than smaller Pre-Trained Models?
LLMs need sophisticated tokenization methods because they deal with larger vocabularies, diverse inputs, and multilingual corpora, requiring more nuanced token splitting to ensure efficient processing and understanding.
What role does tokenization play in Large Language Models (LLMs)?
Tokenization in LLMs transforms raw text into tokens that represent both frequent and rare words, allowing these models to generalize across a vast range of vocabulary and improve text generation and understanding.
How does tokenization in LLMs differ from tokenization in traditional Pre-Trained Language Models?
LLMs often use more complex tokenization methods like BPE or SentencePiece, which balance model performance and computational efficiency across large text corpora.
Traditional pre-trained models may use simpler tokenization schemes focused on task-specific datasets.
What is Byte Pair Encoding (BPE) in tokenization?
BPE is a subword tokenization technique that starts with individual characters and merges the most frequent character pairs to form subwords, optimizing for both common and rare words.
What is the purpose of using subword tokenization in both Pre-Trained Models and LLMs?
Subword tokenization strikes a balance between handling a large vocabulary, minimizing out-of-vocabulary words, and reducing the size of the tokenized text input for computational efficiency.
what is a web crawler
A Web Crawler is a programme that recursively downloads
webpages.
what is the search index and what is the difference between forward and reverse index
forward index sorts by document then words.
checks each document for the words it has.
reverse index sorts by words then document.
shows which documents have this word in them
what is a document
any unit of text index in the system and avaiable for retrival
it can be any piece of text big or small,
what is a collection
a set of ducments that may satisfy the user’s request
what is a term
an item (word/phrase) that occurs in a collection and helsp algorithm in finding relevant documents in the collection