w2 L1 information retrival Flashcards
why do LLM rely on statstical methods for tokenization as opposed to morpoholigcal ways
different languages have different morphoigicy so its not scalable/ robust for different languages
what is a webcrawler
a programm that recursivly downlaods wepages
the first web search engine was wandex
what is a search index
data structure used by search engines to store and retrieve information efficiently. It acts as a digital catalog that organizes and maps data, allowing for fast retrieval of relevant documents, webpages, or records based on user queries.
what is the difference between a forward index and a reversed index
forward index goes document -> words in document
reverse index goes word -> documents they are in
what is a document
any unit of text in the system that can be retrived
what is a collection
a set of documents that may staisfies users requests
term
refers an item/word/phrase in a collection that helps the algorithm find relevant documetns
query
represents a suers information expressed as a set of search terms
what is lemmatization
return the word to the base form of the word
combines look-up tables with rule sets and can be inferred using ml algorithms
what are the pros and cons of lemmatization
returning to base form is understadnable to humans
its expensive as look-up tables are needed for each langauge
what is stemming
cuts off the end of the word so its just the start aka the stem
what are the pros and cons of stemming
pros, can establish links between related words without needing to define a dictionary, thus efficent and scalable
cons results are not as readable as lemmas and stemmers may rip off too much of a word
how would you classify documents based on topic
vector based representations
suppose doc 1 contains 3 managment and 5 meeting -> (3,5)
and doc 2 has 4 managment and 1 meeting -> (4,1)
put them in the same space as the query
use euclidian distance to represent similarity
the most relevant doc is the one with the closest distance to the query
sqrt( (a1-b1)^2 + (a2-b2)^2
though euclidian distance works, what should we use for a better result
words in longer documents will have a higher chance of appearing which means they will have longer vectors but that doesnt gaunrtee more relevance
so we use cosine similarity because it is a length-normalized metric
remeber the cosine foruma