w2 L1 information retrival Flashcards

Question 1

Q

why do LLM rely on statstical methods for tokenization as opposed to morpoholigcal ways

Answer

A

different languages have different morphoigicy so its not scalable/ robust for different languages

Question 2

Q

what is a webcrawler

Answer

A

a programm that recursivly downlaods wepages

the first web search engine was wandex

Question 3

Q

what is a search index

Answer

A

data structure used by search engines to store and retrieve information efficiently. It acts as a digital catalog that organizes and maps data, allowing for fast retrieval of relevant documents, webpages, or records based on user queries.

Question 4

Q

what is the difference between a forward index and a reversed index

Answer

A

forward index goes document -> words in document

reverse index goes word -> documents they are in

Question 5

Q

what is a document

Answer

A

any unit of text in the system that can be retrived

Question 6

Q

what is a collection

Answer

A

a set of documents that may staisfies users requests

Question 7

Q

term

Answer

A

refers an item/word/phrase in a collection that helps the algorithm find relevant documetns

Question 8

Q

query

Answer

A

represents a suers information expressed as a set of search terms

Question 9

Q

what is lemmatization

Answer

A

return the word to the base form of the word

combines look-up tables with rule sets and can be inferred using ml algorithms

Question 10

Q

what are the pros and cons of lemmatization

Answer

A

returning to base form is understadnable to humans

its expensive as look-up tables are needed for each langauge

Question 11

Q

what is stemming

Answer

A

cuts off the end of the word so its just the start aka the stem

Question 12

Q

what are the pros and cons of stemming

Answer

A

pros, can establish links between related words without needing to define a dictionary, thus efficent and scalable

cons results are not as readable as lemmas and stemmers may rip off too much of a word

Question 13

Q

how would you classify documents based on topic

Answer

A

vector based representations

suppose doc 1 contains 3 managment and 5 meeting -> (3,5)

and doc 2 has 4 managment and 1 meeting -> (4,1)

put them in the same space as the query

use euclidian distance to represent similarity

the most relevant doc is the one with the closest distance to the query

sqrt( (a1-b1)^2 + (a2-b2)^2

Question 14

Q

though euclidian distance works, what should we use for a better result

Answer

A

words in longer documents will have a higher chance of appearing which means they will have longer vectors but that doesnt gaunrtee more relevance

so we use cosine similarity because it is a length-normalized metric

remeber the cosine foruma