w2 information retrival (also tokenization flashcards) Flashcards by Anne Jiao

what is stemming

Stemming is a text normalization technique that reduces words to their base or root form by stripping affixes. It may not result in real words.

How well did you know this?

Not at all

Perfectly

Example of stemming

“running” → “run”
“better” → “bet”

How well did you know this?

Not at all

Perfectly

What is the main disadvantage of stemming?

Stemming can result in non-dictionary words or incomplete forms that may lose their meaning, e.g., “studies” → “studi”.

How well did you know this?

Not at all

Perfectly

What is lemmatization in NLP?

Lemmatization is a text normalization technique that reduces words to their dictionary base form (lemma), considering context and grammar (like part of speech).

How well did you know this?

Not at all

Perfectly

Example of lemmatization

“running” → “run”
“better” → “good”
“studies” → “study”

How well did you know this?

Not at all

Perfectly

What is the key difference between stemming and lemmatization?

Stemming cuts off affixes to get the root form without regard for correctness, while lemmatization considers grammatical context and returns the dictionary form.

How well did you know this?

Not at all

Perfectly

Which method is more accurate: stemming or lemmatization?

Lemmatization is more accurate because it results in meaningful words based on linguistic analysis, while stemming is faster but less precise.

How well did you know this?

Not at all

Perfectly

Which NLP task might prefer stemming over lemmatization?

Stemming might be preferred in tasks where speed is more critical than precision, like real-time search engines.

How well did you know this?

Not at all

Perfectly

Why does lemmatization need part-of-speech (POS) tagging?

POS tagging helps lemmatizers to identify the correct base form of a word because the lemma depends on whether the word is a noun, verb, etc.

How well did you know this?

Not at all

Perfectly

What is tokenization in NLP

Tokenization is the process of splitting text into smaller units called tokens, which could be words, subwords, or characters, for easier processing by models.

How well did you know this?

Not at all

Perfectly

What type of tokenization do Pre-Trained Language Models often use

Pre-Trained Language Models often use subword tokenization (e.g., BPE, WordPiece) to break words into smaller meaningful units like subwords or characters, optimizing for unseen words and reducing vocabulary size

How well did you know this?

Not at all

Perfectly

Why do Pre-Trained Language Models prefer subword tokenization

Subword tokenization allows these models to handle out-of-vocabulary words efficiently by breaking rare or unknown words into smaller, known subword units.

How well did you know this?

Not at all

Perfectly

What is the difference between word-level tokenization and subword tokenization?

Word-level splits text by words but struggles with out-of-vocabulary words.

Subword-level splits words into smaller components to better handle rare or unseen words, such as “unbelievable” → [“un”, “believ”, “able”].

How well did you know this?

Not at all

Perfectly

What kind of tokenization do Large Language Models (LLMs) like GPT typically use?

Large Language Models (LLMs) use byte pair encoding (BPE) or similar subword techniques to handle diverse text inputs efficiently, including rare or complex words.

How well did you know this?

Not at all

Perfectly

Why do Large Language Models (LLMs) need more sophisticated tokenization than smaller Pre-Trained Models?

LLMs need sophisticated tokenization methods because they deal with larger vocabularies, diverse inputs, and multilingual corpora, requiring more nuanced token splitting to ensure efficient processing and understanding.

How well did you know this?

Not at all

Perfectly

What role does tokenization play in Large Language Models (LLMs)?

Tokenization in LLMs transforms raw text into tokens that represent both frequent and rare words, allowing these models to generalize across a vast range of vocabulary and improve text generation and understanding.

How does tokenization in LLMs differ from tokenization in traditional Pre-Trained Language Models?

LLMs often use more complex tokenization methods like BPE or SentencePiece, which balance model performance and computational efficiency across large text corpora.

Traditional pre-trained models may use simpler tokenization schemes focused on task-specific datasets.

What is Byte Pair Encoding (BPE) in tokenization?

BPE is a subword tokenization technique that starts with individual characters and merges the most frequent character pairs to form subwords, optimizing for both common and rare words.

What is the purpose of using subword tokenization in both Pre-Trained Models and LLMs?

Subword tokenization strikes a balance between handling a large vocabulary, minimizing out-of-vocabulary words, and reducing the size of the tokenized text input for computational efficiency.

what is a web crawler

A Web Crawler is a programme that recursively downloads
webpages.

what is the search index and what is the difference between forward and reverse index

forward index sorts by document then words.
checks each document for the words it has.

reverse index sorts by words then document.
shows which documents have this word in them

what is a document

any unit of text index in the system and avaiable for retrival

it can be any piece of text big or small,

what is a collection

a set of ducments that may satisfy the user’s request

what is a term

an item (word/phrase) that occurs in a collection and helsp algorithm in finding relevant documents in the collection

what is a query

the request the user gave to the search algorithm, respresented as a set of search terms

are all words equall

no some words are just fluff, frequency analysis

what is zipf law and what does it relate to

its a formula 1/k^s / sum of (1/n^s) s represents languages the normalized frequency of the element of rank k its for frequency analysis but it shows that the most frequent words are filler words so stop words can be dropped

why do we prefer cosine similarity over euclidian distance for measuring similarity of two documents

euclidan distance is a vector. longer documents will have longer vectors which doesnt translate into higher importance cosine similary measures the angle between two vectors so its consistent no matter the size of document length normalization

recall the importance of term frequency

the more common a term is in a document, makes the document more relevant. but there is an issue when all your documents have the same main key words. when you are searching cats with stripes and wearing blue hats because cats will be so common its almost like a stop word so you should empahsize wearing blue hats in your search

daily reminder to watch a video about PageRank which is something about how you get your website to show up higher on page there's an algorithm idk

what should we measure to determine the success of our product

whether it shows the most relevant results first, precision

what is precision at k

because algorithm measurs cosine similarity you can order the documents by their similarity and introduce cut-off point, k, precision at this cut off point ie 3 right right wrong 2/3 precision at cutoff three

what is mean preciison at k

the average precision at k for all queries

what is mean reciprocal rank

measures how high on average, algorithm places the first relevant document that it returns how often will you be satisfied with the first webpage?

how do you define reciprocal rank

RR = 1/rank of the first relevant document in ordered list

what are regular expressions that i should know

Special characters: . ^ $ * ? {m}\ [] | ?: Sequences: \b \B\d \D \s \w \W Flags: re.I, re.IGNORECASE Functions: re.compile, re.search, re.match, re.split, re.sub