w2 information retrival (also tokenization flashcards)

1
Q

what is stemming

A

Stemming is a text normalization technique that reduces words to their base or root form by stripping affixes. It may not result in real words.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Example of stemming

A

“running” → “run”
“better” → “bet”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the main disadvantage of stemming?

A

Stemming can result in non-dictionary words or incomplete forms that may lose their meaning, e.g., “studies” → “studi”.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is lemmatization in NLP?

A

Lemmatization is a text normalization technique that reduces words to their dictionary base form (lemma), considering context and grammar (like part of speech).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Example of lemmatization

A

“running” → “run”
“better” → “good”
“studies” → “study”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the key difference between stemming and lemmatization?

A

Stemming cuts off affixes to get the root form without regard for correctness, while lemmatization considers grammatical context and returns the dictionary form.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Which method is more accurate: stemming or lemmatization?

A

Lemmatization is more accurate because it results in meaningful words based on linguistic analysis, while stemming is faster but less precise.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Which NLP task might prefer stemming over lemmatization?

A

Stemming might be preferred in tasks where speed is more critical than precision, like real-time search engines.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Why does lemmatization need part-of-speech (POS) tagging?

A

POS tagging helps lemmatizers to identify the correct base form of a word because the lemma depends on whether the word is a noun, verb, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is tokenization in NLP

A

Tokenization is the process of splitting text into smaller units called tokens, which could be words, subwords, or characters, for easier processing by models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What type of tokenization do Pre-Trained Language Models often use

A

Pre-Trained Language Models often use subword tokenization (e.g., BPE, WordPiece) to break words into smaller meaningful units like subwords or characters, optimizing for unseen words and reducing vocabulary size

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Why do Pre-Trained Language Models prefer subword tokenization

A

Subword tokenization allows these models to handle out-of-vocabulary words efficiently by breaking rare or unknown words into smaller, known subword units.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the difference between word-level tokenization and subword tokenization?

A

Word-level splits text by words but struggles with out-of-vocabulary words.

Subword-level splits words into smaller components to better handle rare or unseen words, such as “unbelievable” → [“un”, “believ”, “able”].

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What kind of tokenization do Large Language Models (LLMs) like GPT typically use?

A

Large Language Models (LLMs) use byte pair encoding (BPE) or similar subword techniques to handle diverse text inputs efficiently, including rare or complex words.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Why do Large Language Models (LLMs) need more sophisticated tokenization than smaller Pre-Trained Models?

A

LLMs need sophisticated tokenization methods because they deal with larger vocabularies, diverse inputs, and multilingual corpora, requiring more nuanced token splitting to ensure efficient processing and understanding.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What role does tokenization play in Large Language Models (LLMs)?

A

Tokenization in LLMs transforms raw text into tokens that represent both frequent and rare words, allowing these models to generalize across a vast range of vocabulary and improve text generation and understanding.

16
Q

How does tokenization in LLMs differ from tokenization in traditional Pre-Trained Language Models?

A

LLMs often use more complex tokenization methods like BPE or SentencePiece, which balance model performance and computational efficiency across large text corpora.

Traditional pre-trained models may use simpler tokenization schemes focused on task-specific datasets.

17
Q

What is Byte Pair Encoding (BPE) in tokenization?

A

BPE is a subword tokenization technique that starts with individual characters and merges the most frequent character pairs to form subwords, optimizing for both common and rare words.

18
Q

What is the purpose of using subword tokenization in both Pre-Trained Models and LLMs?

A

Subword tokenization strikes a balance between handling a large vocabulary, minimizing out-of-vocabulary words, and reducing the size of the tokenized text input for computational efficiency.

19
Q

what is a web crawler

A

A Web Crawler is a programme that recursively downloads
webpages.

20
Q

what is the search index and what is the difference between forward and reverse index

A

forward index sorts by document then words.
checks each document for the words it has.

reverse index sorts by words then document.
shows which documents have this word in them

21
Q

what is a document

A

any unit of text index in the system and avaiable for retrival

it can be any piece of text big or small,

22
Q

what is a collection

A

a set of ducments that may satisfy the user’s request

23
Q

what is a term

A

an item (word/phrase) that occurs in a collection and helsp algorithm in finding relevant documents in the collection

24
what is a query
the request the user gave to the search algorithm, respresented as a set of search terms
25
are all words equall
no some words are just fluff, frequency analysis
26
what is zipf law and what does it relate to
its a formula 1/k^s / sum of (1/n^s) s represents languages the normalized frequency of the element of rank k its for frequency analysis but it shows that the most frequent words are filler words so stop words can be dropped
27
why do we prefer cosine similarity over euclidian distance for measuring similarity of two documents
euclidan distance is a vector. longer documents will have longer vectors which doesnt translate into higher importance cosine similary measures the angle between two vectors so its consistent no matter the size of document length normalization
28
recall the importance of term frequency
the more common a term is in a document, makes the document more relevant. but there is an issue when all your documents have the same main key words. when you are searching cats with stripes and wearing blue hats because cats will be so common its almost like a stop word so you should empahsize wearing blue hats in your search
29
daily reminder to watch a video about PageRank which is something about how you get your website to show up higher on page there's an algorithm idk
30
what should we measure to determine the success of our product
whether it shows the most relevant results first, precision
31
what is precision at k
because algorithm measurs cosine similarity you can order the documents by their similarity and introduce cut-off point, k, precision at this cut off point ie 3 right right wrong 2/3 precision at cutoff three
32
what is mean preciison at k
the average precision at k for all queries
33
what is mean reciprocal rank
measures how high on average, algorithm places the first relevant document that it returns how often will you be satisfied with the first webpage?
34
how do you define reciprocal rank
RR = 1/rank of the first relevant document in ordered list
35
what are regular expressions that i should know
Special characters: . ^ $ * ? {m}\ [] | ?: Sequences: \b \B\d \D \s \w \W Flags: re.I, re.IGNORECASE Functions: re.compile, re.search, re.match, re.split, re.sub