w2 L1 information retrival Flashcards

1
Q

why do LLM rely on statstical methods for tokenization as opposed to morpoholigcal ways

A

different languages have different morphoigicy so its not scalable/ robust for different languages

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what is a webcrawler

A

a programm that recursivly downlaods wepages

the first web search engine was wandex

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what is a search index

A

data structure used by search engines to store and retrieve information efficiently. It acts as a digital catalog that organizes and maps data, allowing for fast retrieval of relevant documents, webpages, or records based on user queries.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what is the difference between a forward index and a reversed index

A

forward index goes document -> words in document

reverse index goes word -> documents they are in

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

what is a document

A

any unit of text in the system that can be retrived

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

what is a collection

A

a set of documents that may staisfies users requests

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

term

A

refers an item/word/phrase in a collection that helps the algorithm find relevant documetns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

query

A

represents a suers information expressed as a set of search terms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

what is lemmatization

A

return the word to the base form of the word

combines look-up tables with rule sets and can be inferred using ml algorithms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

what are the pros and cons of lemmatization

A

returning to base form is understadnable to humans

its expensive as look-up tables are needed for each langauge

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

what is stemming

A

cuts off the end of the word so its just the start aka the stem

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

what are the pros and cons of stemming

A

pros, can establish links between related words without needing to define a dictionary, thus efficent and scalable

cons results are not as readable as lemmas and stemmers may rip off too much of a word

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

how would you classify documents based on topic

A

vector based representations

suppose doc 1 contains 3 managment and 5 meeting -> (3,5)

and doc 2 has 4 managment and 1 meeting -> (4,1)

put them in the same space as the query

use euclidian distance to represent similarity

the most relevant doc is the one with the closest distance to the query

sqrt( (a1-b1)^2 + (a2-b2)^2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

though euclidian distance works, what should we use for a better result

A

words in longer documents will have a higher chance of appearing which means they will have longer vectors but that doesnt gaunrtee more relevance

so we use cosine similarity because it is a length-normalized metric

remeber the cosine foruma

How well did you know this?
1
Not at all
2
3
4
5
Perfectly