Information Retrieval Flashcards

1
Q

Information Retrieval Systems:

Overview

A
  • Simpler model than RDBMS
    • Info organized as a collection of documents
    • Documents are unstructured, no schema(generally)
  • Locate relevant documents based on user input such as keywords
  • Can be used on textual descriptions that are provided with non-textual data
    • Example: Images with descriptions
  • Web Search engines are the most popular form of IR systems
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Information Retrieval Systems:

Differences with RDBMS

A
  • IR Systems organize information into unstructured documents
    • Where database systems deal with structured data
  • IR Systems don’t deal with transactional updates
  • IR Systems deal with some querying issues not generally addressed by database systems
    • Approximate searching by keywords
    • Ranking retrieved answers
      • by estimating degree of relevance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Information Retrieval Systems:

Querying issues that

database systems do not deal with

A
  • Approximate Searching by keywords
  • Ranking of retrieved answers by estimating degree of relevance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

IR Systems:

Terminology

A
  • Full Text Retrieval
  • Keywords
  • Keyword Queries
  • Term
  • Intent
  • Query
  • Result
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Keyword Search

A
  • In Full Text Retrieval, all the words in each document are considered to be keywords
    • The word “term” refers to the words in a document
  • IR Systems typically allow query expressions
    • _​_Formed using keywords and logical connectives
      • and, or, not
  • Ranking of documents on the basis of estimated relevance to a query is critical
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

IR Systems:

Returned Results

A
  • In relational databases, there is generally no ranking on returned results
    • In contrast, when dealing with Keyword Queries, many documents may match the query
      • A google search can return millions of results
  • Results must be ranked so that the most relevant tuples are the first thing a user sees
    • Need to determine what the user is looking for
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Keyword Queries:

Analysis Components

A
  • Intent
    • What the user is actually looking for
    • Also referred to as information need
  • Query
    • What the user actually submits to express their intent
  • Result
    • The documents that match the query and are ranked to match the intent as much as possible
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Static Ranking

A
  • Methods of ranking where the ranking does not change over time, unless the underlying data changes
  • Good start to ranking based on entered keywords
  • Popular Methods:
    • TF-IDF
    • BM25
    • Page Rank
    • HITS
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Static Ranking Methods:

TF-IDF

Basic Idea

A

Term Frequency - Inverted Document Frequency

  • Used to determine how relevant a document is given keywords
  • The more frequent a given keyword is in a document, the higher the ranking
  • Simple ratio of how many times the keyword appears over how many total terms in document
  • Uses a logarithmic scale so that very frequent documents do become too “important”
  • Was used for a while in early web searching
    • Yahoo, MSN, AOL, lots of search engines
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Static Ranking Methods:

TF-IDF

Basic Formula

A

Term Frequency - Inverted Document Frequency

TF(d,t ) = log( 1 + n(d,t ) / n(d ) )

  • n(d ) is the number of terms in a document
  • n(d, t) is the number of times term t occurs in document d
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Static Ranking Methods:

TF-IDF:

Calculating Actual Relevance

A

relevance(d, Q) = SUMt in Q [TF(d,t) / n(t)]

n(t) is the number of documents containing the term t

This lowers the score of super frequent terms,

making more unique terms in the query more important

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Ranking:

Why is it important to rank returned results?

A
  • Users do not want to sort through potentially millions of answers
  • Most google users never go to the second page
    • Percentage of Users visiting beyond the second page,3rd, 4th, etc, decreases exponentially with the number of pages
  • More likely to rewrite queries instead of searching through second page and beyond
  • Users are short on time, have little domain knowledge and don’t want to scroll forever
  • Ranking is a method to decode the user’s intent from their queries
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Static Ranking Methods:

TF-IDF:

Other methods to expand TF-IDF

A
  • Add more weight to important areas of document
    • Beginning, title, sections
  • Reduce weight of documents
    • Words occurring late in document
    • What else did we already cover that should be reduced?
  • Proximity
    • Terms close together are weighted higher
      • Use N-grams
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Static Ranking Methods:

BM25

Basics

A
  • More effective compared to TF-IDF
    • Uses constants to scale certain values
    • Considers the frequency of a term in the query
  • Used for massive text documents
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Ranking Web Pages

A
  • Web pages are really just a collection of documents online
  • Can search them just like local collections of documents
  • Extra things to consider:
    • When ranking, what web page should be first
  • Many search engines use to use something like TF-IDF
    • TF-IDF can fail for many reasons
    • Somewhat easy to manipulate
  • Google revolutionized web page ranking with an algorithm called PageRank
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Page Rank:

Overview

A
  • Ranking algorithm developed by google to solve the problems plaguing web search
  • Simple Idea:
    • Uses the links between pages to “vote” on the importance of a page
    • If a page is popular, its “vote” counts for more
  • It is recursive - earlier rankings affect future rankings
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Page Rank:

Formula

A

R(u) = c*SUM<em>v in B<strong>u</strong></em> [R(v) / Nv]

  • R(u) = Rank of web page u
  • Bu = set of pages pointing to the page u
  • Fu = set of pages u points to
  • Nu = | Fu| = number of links from u
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Page Rank:

Recursion

A
  • Since a page’s ranking depends on the ranking of pages linked to it, there is some recursion: a page’s earlier ranking affects it’s ranking
  • The formula is iterated until the change in rank from one iteration to the next is under some predefined threshold
    • Epsilon is usually used as the variable
  • Dangling links are ignored until PageRank has been calculated once
  • The original paper explicitly states “…computation of PageRank is fairly straightforward if we ignore issues of scale”
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Page Rank:

Dangling Links

A

Pages that “go nowhere”.

They do not have links to other pages.

Ignored until PageRank has been calculated once, since other pages need to be recursively iterated.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

HITS:

Overview

A

Hypertext Induced Topics Search (HITS)

  • Developed by Jon Kleinberg in 1998
  • Distinguishes types of queries:
    • Specific Query - one that has the Scarcity Problem
    • Broad Query - one that has the Abundance Problem
  • Some pages have special roles:
    • Authorities
    • Hubs
21
Q

HITS:

Authorities

A

Pages that have many incoming links from Hubs

  • Recognized as providing significant, trustworthy, and useful information on some topic
  • As some “degree” of pages that point to it
    • Comparing the degree allows us to distinguish between authorities
22
Q

HITS:

Hubs

A

Pages that point to many other relevant pages

  • Can act as an “index”
  • Generally point to lots of authorities
  • For example, the course page is a Hub
23
Q

HITS:

How Authorities and Hubs work together

A

Hubs point to many authorities

Authorities are referenced by many Hubs

Together, they create a bipartite graph,

essentially delineating different topics

24
Q

Information Retrieval:

Determining Success

A
  • When results are returned to the user, how do we determine whether they got the correct results or not?
  • Multiple methods can be used to analyze this, depending on the data returned and the queries that were issued
  • Also used to determine how well a database retrieval system is performing
    • Specifically effectiveness, not efficiency
25
Q

IR Analysis:

Efficiency vs Effectiveness

A
  • Efficiency is primarly concerned with time
    • Query Latency vs Query Throughput
    • Indexing, query optimization, etc
  • Effectiveness is concerned with accuracy
    • Usually need some kind of feedback from the user
26
Q

Effectiveness:

Determining Feedback

A
  • Generally get feedback from the user
    • Implicit
    • Explicit
    • Pseudo Explicit
  • Explicit Feedback is provided by the user
    • Clicks, view time, etc
  • Implicit Feedback is drawn from the data
    • Add to search query a term that is frequent in the results
27
Q

Effectiveness:

Common Metrics

A

There are some well defined metrics:

  • Precision
  • Recall
  • Relevance
  • F-Measure (or F-Score)
  • Reciprocal Rank
  • Mean Reciprocal Rank(MRR)
  • Normalized Discounted Cumulative Gain (NDCG)
28
Q

Effectiveness:

Usefulness of Metrics

A
  • Can be used to learn:
    • Maximize given metrics
    • Reinforce strategies
  • Can also use to determine effectiveness
29
Q

Effectiveness:

Definitions:

False Negative

False Positive

A

False Negative

Some relevant documents may not be retrieved

False Positive

Some irrelevant documents may be retrieved

30
Q

Effectiveness:

Precision

A

The ratio of desired tuples

to the tuples returned to the user

  • P@K is used for K returned results, or per page
  • Example:
    • If IR retrieved the top 10 results and 5 of them are relevant to the user
    • Then precision is 5/10 , or 1/2
  • Max precision may not always be 1
31
Q

Effectiveness:

Recall

A

Recall

is the ratio of relevant tuples returned to the user

vs the number of relevant tuples in the database

  • Example:
    • If IR returned the top 10 results
    • 5 of them are relevant
    • but there are 20 relevant tuples in the database
    • Recall is 5/20, or 1/4
  • Can be trivially increased by returning more documents,
    • But this can decrease precision
  • May not always be possible to get recall of 1
32
Q

Effectiveness:

Precision vs Recall

A
  • Recall can be trivially increased
    • Simply return more documents
  • Returning more documents can cause precision to decrease, so there is a tradeoff
  • The converse is also true:
    • Returning a single relevant tuple gives 100% precision, but this gives terrible recall
  • The F-Measure method provides a way to combine both Precision and Recall
33
Q

Effectiveness:

F-Measure

A

F-Measure

  • Also called F-Score
  • Formula for combining the Precision and Recall metrics
  • Balances the trade-off between the two metrics
34
Q

Effectiveness:

F-Measure:

Formula

A

Applies a constant, alpha

as a weight to get a metric that is balanced between Precision and Recall

  • F1 Measure:
    • alpha = 0.5 -> beta =1
    • Perfectly balanced between Precision and Recall
35
Q

Effectiveness:

Precision and Recall

Over Time

A
  • Both metrics can be used to evaluate the effectiveness of retrieval systems over time
    • The system is usually queried for with more than a single intent
  • Take the total precision/recall and divide by the number of interactions
    • Gives an average
  • Good for tracking if the system is improving over time or if some change in the system worked
36
Q

Effectiveness:

Order

A

Precision, Recall and F-Measure don’t take into account order.

However, order of the results matters.

  • Users usually only look at the first few results
  • Need a way to measure satisfaction that also takes order into account
  • Metrics:
    • Reciprocal Rank(RR)
    • Mean Reciprocal Rank(MRR)
37
Q

Effectiveness:

Reciprocal Rank (RR)

A

Reciprocal Rank (RR)

The inverse position for the first relevant document

RR = 1 / posr

Used when we care about the position of the document in the returned results

38
Q

Effectiveness:

Mean Reciprocal Rank (MRR)

A

Mean Reciprocal Rank (MRR)

The average of the inverse of the position for the first relevant document

Useful for multiple interactions, especially when there is only a single relevant result.

39
Q

Effectiveness:

Relevance of Document

A
  • How well a document matches what a user is actually looking for
  • Some documents may be related, but not actually what they are looking for
  • Basic: Relevance Judgement Scores
    • Files that are created/learned by the system over time
    • Rank documents from 0-4 on a given intent
    • Expensive and biased
    • Best to use a simple binary model, yes or no(click or no click)
  • More Advanced: NDCG
40
Q

Effectiveness:

Considerations/Factors that determine overall effectiveness

A
  • Precision
    • Single Query
    • Over time
  • Recall
    • Single Query
    • Over time
  • Order
  • Relevance of document
41
Q

Effectiveness:

Normalized Discounted Cumulative Gain

(NDCG)

A

Normalized Discounted Cumulative Gain (NDCG)

A measure of the relevance of a document

  • Range between 0-1
  • Normalized from DCG
    • DCG gives a larger score for documents with high relevance
    • But, DCG is unbounded
    • Normalized by dividing DCG by IDCG
      • NDCG = DCG / IDCG
42
Q

Effectiveness:

Discounted Cumulative Gain (DCG)

and IDCG

A

Discounted Cumulative Gain (DCG)

Gives larger score for documents with high relevance.

Ideal DCG(IDCG) is calcualted with the tuples ordered in optimal order based on relevance. Used for normalizing DCG.

43
Q

Ranking Optimization:

Keyword Analysis

Methods/Considerations

A
  • Stopping
  • Stemming
  • Proximity/Clustering
    • N-Grams
  • Synonyms
  • Homonyms
44
Q

Ranking Optimization:

Stopping

A

Removes extremely common terms

from the retrieval method or index

  • Some terms are so common that they are not useful:
    • “the” “and” “a” , etc
  • The most frequently occurring words make up the majority of the size of the documents
  • Uses Zipf’s Law to determine what words count as “stop words”
  • Results in less terms to search over
  • More accurate keyword searches
45
Q

Ranking Optimization:

Stemming

A

Reduces some words to their “stem” prior to searching

  • User’s keyword queries may not be an exact match, but rather some variant of the word
    • Plurals, past tense, suffixes, prefixes, etc
  • Stemming removes prefixes and postfixes
  • Can reduce the size of the inverted index
  • May use Table Lookup to find minimum form of different terms
    • NLTK library does this
  • Not always effective
46
Q

Ranking Optimization:

Proximity

A
  • Idea:
    • Keywords that are in the query might appear close to each other in the document
  • The closer they are to each other, the more important the document is
    • Vice Versa
  • Can also consider the order of the keywords
  • One way to take advantage of this is the use of N-Grams
47
Q

Ranking Optimization:

N-Grams

A
  • Basic Idea: Clustering characters together
    • Not too useful for this context
  • Better: Clustering words together
  • Given: “Hello World. Hello Ben”
  • 2-Grams:
    • “Hello World”, “World. Hello”, “Hello Ben”
  • 3-Grams:
    • “Hellow World. Hello” , “World. Hello Ben”
48
Q

Ranking Optimization:

Synonyms and Homonyms

A

Synonyms

Consider synonyms for come terms the user includes in the query:

  • Example: Document: “My instructor” , Query: “teacher”
    • Expand query to include both terms “teacher or instructor”
  • Can be complicated to expand too much, many terms have synonyms that are not appropriate to the user’s intent

Homonyms

Query may include a homonym of the term the user actually intended.

Can also be tricky.