IR-TFIDF 6 Flashcards
What is Information Retrieval?
Information retrieval is finding material (usually documents) of
an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers).
E.g. A search engine (Google, Bing, Baidu, DuckDuckGo, …)
What is Ranked Information Retrieval?
return an ordering of the documents in the collection with respect to the query
How would you rank documents in ranked Information Retrieval?
- Assign a score – say in [0, 1] – to each document
* This score measures how well document and query “match”
How does the Jaccard Coefficient work?
View the query and each document as a set of types.
Check how many times the query appears in each document.
Remember types not tokens.
Name two pros and cons of the Jaccard Coefficient.
Pros of Jaccard Coefficient
• Easy to implement, easy to understand
• The baseline method you should consider (if your proposed method performs worse than Jaccard coefficient, throw it away)
Cons of Jaccard Coefficient
• It doesn’t consider term frequency (how many times a term occurs in a document)
• It ignores word sequences (‘Alex is taller than Bob’ and ‘Bob is taller than Alex’ are regarded as the same). This can be mitigated by considering n-grams
What is bag of words?
- View a document as a bag of words
- Count the appearance of each word, or term frequency (unlike Jaccard coefficient, which ignores the term frequencies)
- Known as BoW or BOW representation
Why should you not use euclidean distance to measure the similarity between vectors in bag of words?
And what should be used instead?
• Larger distance > less relevancy
• However, Euclidean distance is a bad idea, because it takes the length of the
vector into account
• Cosine Similarity
What is Term Frequency?
the number of times term t appears in document d
What is Inverse Document Frequency?
No. of Docs / Term t (can add gamma number if necessary e.g. the numerator is very large)
What are the problems with IDF?
Name solutions commonly used?
• If the word does not appear in any documents in the corpus (this can happen
when you use a given vocabulary), tf is ill-defined
• When a word appears in a very few docs (i.e. the denominator is very small) and there are many documents (i.e. the numerator is very large), the IDF is very large
Gamma can be any positive real number; usually set to be 1.0 which is added to the term t.
How do you calculate tf-idf?
TF * IDF
Pros and cons of tf-idf?
Pros:
• Easy to understand and easy to implement
• Yields surprisingly good performance in many text classification and text clustering tasks, usually better than BoW and Jaccard
Cons:
• Surficial matching:
• Texts with similar words can deliver very different meanings (S1 and S2)
• Texts with different words can deliver very similar meanings (S1 and S3)
• Sparsity: if the vocabulary size is large (e.g. 30K common words in English), each text vector has many 0s and very few 1s. As a result, similar sentences have low cosine similarity
What is the minimum edit distance?
The minimum edit distance between two strings is defined as the
minimum number of editing operations (operations like insertion,
deletion, substitution) needed to transform one string into another.
How much does substitution cost in Levenshtein?
2
What is Standard precision?
And What is IR Evaluation: Precision@K
Standard precision is what proportion of positive identifications was actually correct.
• Precisoin@K
• Of the top-K returned results, how many of them are indeed relevant to the query (or how many are labelled as ‘R’ by the human annotators)
• Decide K by your application:
• E.g., You want to make sure the user finds at least one relevant result on the first page
• For all M queries, compute the average
Performance = mean(Precision@K for results on query_i), i=1,…,M