CZ4034 Flashcards
What is term-document incidence matrix?
Limitation? How to improve?
One-hot encoding or basically a 0/1 vector for each term/word
ex. “Anthony” - 110100 (appears in doc 1, 2, 4)
Matrix of the terms will be extremely sparse (more 1s than 0s)
Better representation: only record the 1s
Structured vs. Unstructured Data
Structured data tends to refer to information in “tables”
Unstructured data is information that doesn’t lend itself to the kind of table formatting required by a relational database. Can be textual or non-textual (such as audio, video, and images)
What is inverted index?
For each term t, store a list of all documents (docID) containing t
Ex. “Caesar” - [2 4 5 6 16 57] (postings)
if “Caesar” in new doc 14, must add 14 between 6 and 16 (LIST MUST BE SORTED)
Query Optimization: What is the best order for query processing?
Process in order of increasing frequency
for AND, the smallest set should be on the left (executed first) because what is not in the smallest set should be ignored later
What is Biword Index?
Ex. “Friends, Romans, Countrymen”
Solution 1 to Phrase Queries:
Index every consecutive pair of terms in text as a phrase
The example will generate biwords:
“friends romans” and “romans countrymen”
What is Positional Index?
Solution 2 to Phrase Queries:
In the postings, store (for each term) the position(s) in which tokens of it appear
<term: doc1: pos1, pos2 …; doc2: pos1, pos2 …;>
angels: 2: {35, 174}; 4: {12, 22}; 7: {17};
fools: 1: {1, 17, 74}; 4: {8, 78};
How to process a phrase query using positional index? (2)
ex. “to be or not to be”
- Extract the inverted index entries for each distinct term: to, be, or, not
- Merge their doc:position list to enumerate all positions with “to be or not to be”
Problem with positional index? Despite of this, why is it still standardly used?
- Expands postings storage substantially
- Standardly used because of power and usefulness of phrase and proximity queries
Lossless vs. Lossy compression?
What compression are the following steps: case folding, stopwords removal, stemming?
Lossless = all info is preserved
Lossy = some info discarded (case folding, stopwords removal, stemming)
Formula for Heaps’ Law?
What does the log-log plot of M vs. T suggest?
M = kT^b
M is size of vocabulary and T is # of tokens in the collection
The plot suggests that in the initial docs, collections size will increase. However, the more docs added, the less distinct vocabulary that will be added to the collection (slow to increase)
Zipf’s Law and consequences?
the ith most frequent term has frequency proportional to 1/i
- given a natural language corpus, the freq of any word is inversely proportional to its rank in the freq table
- thus the most freq word will occur approximately 2x as often as the 2nd most freq word, 3x as often as the 3rd most freq word, etc
What are fixed-width terms in dictionary? Why is it wasteful?
Setting an array of fixed-width for a term/word entry
Wasteful bcs not all terms need that much space (like one-letter words). Some super-long words may need even more space
Explain what is “Dictionary as string with pointers to every term”. Problem?
Storing dictionary as long string of characters (instead of table). Pointer to next word shows end of current word.
But difficult to find those words in the dictionary
What is blocking in dictionary?
Storing dictionary as long string of characters
where pointers is stored to every kth term string (using term length to separate diff entries of dict)
Ex. 7systile9syzygetic8syzygial…
What is blocking+front coding in dictionary?
In addition to blocking, stores consecutive entries that share a common prefix
Ex. 8automata8automate9automatic10automation becomes 8automat*a1<>e2<>ic3<>ion
What is Postings Compression?
Storing gaps of docID instead of the docID itself bcs it has fewer bits
Ex. docID 283042 283043 can be stored with gaps 1 instead
What is Tokenization?
Give example of language issues
Breaking down a sequence of text into smaller parts (token = an instance of a sequence of characters)
French: L’ensemble (1 token or 2?)
German: noun compounds are not segmented
Chinese/Japanese: no space
Arabic: written right to left, but with numbers left to right
What are Stopwords?
Words to exclude from dictionary, the commonest words that have little semantic content
ex. the, a, and, to, be
Give examples of cases where stopwords should not be removed (3)
- Phrase queries (King of Denmark)
- Song titles.. etc (Let it be, To be or not to be)
- Relational queries (flights to London)
What is Normalization? Examples
A mapping that maps all the possible spelling permutations to the goal standard form that you are going to save in the dictionary
“Normalize” words in indexes text as well as query words into the same form
Define equivalence classes of terms:
- match U.S.A. and USA -> USA
- match résumé and resume (often best to normalize to a de-accented term)
- match 7月30日 and 7/30
What is Casefolding? Issues?
Reduce all letters to lowercase
It is often best to lowercase everything but there may be some exceptions
ex. WHO vs who, Polish vs. polish
What is Lemmatization?
A type of normalization that does a ‘proper’ reduction to dictionary headword / base form
Ex. am, are, is -> be
car, cars, car’s, cars’ -> car
What is Stemming? Problem?
Crude affix chopping (reduce terms to their ‘roots’ before indexing) (language dependent)
Ex. automates, automatic, automation -> automat
It can completely flip meaning/polarity
Ex. stuffed-> stuff, slimy-> slim, flourish-> flour
Lemmatization vs. Stemming when to use which
In general, if accuracy still high enough, use stemming (bcs very powerful, reduce dictionary much more)
If stemming not good for analysis (in cases where POS tag is important), use lemmatization
Lemmatization Pro and Con
+ improves concept extraction through text normalization (noun/verb inflection elimination)
Ex. eats_burger, ate_burger, eat_burgers, eat_the_burger
- may lead to misunderstanding
Ex. park -> parking, develop -> developing/developed, kick_ball -> kick_balls
What is the most popular algorithm for stemming English?
Porter’s algorithm (set of rules to replace end of a word with sth else) (example in tutorial)
doesn’t work all the time but at least as good as other stemming options
What are Wild-card queries?
How to retrieve (data structure)?
Queries where you don’t get a specific word/phrase from user; instead u get a query containing *
Ex. mon* -> find all docs with word beginning with “mon”
Easy retrieval with binary tree (or B-tree)
What is Permuterm index?
An easy way to handle wildcard queries by adding rotations
Ex. hello ->
hello$, ello$h, llo$he, lo$hel, o$hell
(will help find * in front, end, middle of text)
Problems of permuterm query processing (2)
- Quadruples lexicon size (more memory)
- Takes a long time (expensive query execution)
2 Principal Uses of Spell Correction
- correcting documents being indexed (mapping between correct word in dictionary and misspelled word in document is done correctly)
- correcting user queries to retrieve “right” answers (whatever you do on dictionary, you must do on query)
2 Main Flavors of Spell Correction
- Isolated word
calculate distance between any misspelled word with a correct word to decide what should replace it - Context-sensitive
look at surrounding words to decide if word is correct or not (statistics, machine learning)
3 disadvantages of spell correction and suggest an alternative
- Retrieving documents indexed by correct spelling is computationally expensive
- May lead to wrong normalization
- May produce too many alternatives
Instead, can suggest alternative queries with correct spelling “Did you mean: ….”
Only search if user clicks on the alternative query
Edit Distance (Levenshtein)
the minimum # of operations (insert, delete, replace, transposition) to convert one string to another
ex. From dof to dog is 1
Weighted edit distance
Adding weights to common mistakes
(OCR errors, Keyboard errors, etc)
The cost of replacing certain letters may have higher cost
Problems with Edit Distance. Suggestion
Expensive and Slow to compute edit distance to every dictionary term
Use n-gram overlap (Ex. trigrams)
november - (nov, ove, vem, emb, mbe, ber)
december - (dec, ece, cem, emb, mbe, ber)
3/6 of terms overlap -> normalize using Jaccard coefficient
Jacard distance vs. coefficient
Jacard distance (D) = how different 2 sets are
Jacard coefficient (J) = how similar 2 sets are
always add up to 1 or 100
What is Soundex? Is it useful in IR?
Translate any word/token into a 4-character reduced form
Not very useful for IR (many words that are different end up having the same code)
Pro and Cons of Boolean Search
+ Good for expert users with precise understanding of their needs and the collection
- Not good for majority of users as they may be incapable of writing Boolean queries (may be too much work)
- Either too few or too many results (and not ranked on relevancy)
What is the meaning of Score in ranked retrieval?
A number between [0,1] that measures how well document and query ‘match’
Issues with Jaccard for scoring
- It doesn’t consider term frequency
- Rare terms in a collection are more informative than frequent terms but Jaccard doesn’t consider document frequency
What is Term Frequency tf_t,d?
How many times a term t occurs in a document d
What is log frequency weight of term t in d?
if tf_t,d > 0, w_t,d = 1 + log tf_t,d
and 0 otherwise
used to dampen growth of number of occurrences
What is Document Frequency df_t?
The number of documents that contain term t (measure of popularity)
Frequent terms (ex. stopwords) are less informative than rare terms
We want a high weight for rare terms as they are likely to be relevant to query
What is idf (inverse document frequency)? Does it effect ranking of docs for queries with 1 term?
idf_t = log (N/df_t)
log is used to dampen effect of idf
idf affects the ranking of documents for queries with at least 2 terms
What is collection frequency of t?
The number of occurrences of t in the collection, counting multiple occurrences
What is tf-idf weighting and it’s score?
w_t,d = (1+log tf_t,d) x log(N/df_t)
Score(q,d) = sum tf.idf_t,d over terms t present in both the query q and the document d
Score will need to be normalized so that it’s between [0,1]
What is the Vector Space Model?
What are the axes?
How to get proximity?
A model that represents documents and queries as vectors in high-dimensional space
Terms are axes of the space
Rank documents according to their (angle) proximity to the query in this space
What is the cosine similarity for query q and document d?
The dot product of length-normalized vectors q and d
numerator = dot product of q and d
denominator = sqrt sum square of q * sqrt sum square of d
Vector Space Ranking steps (5)
- Represent the query as a weighted tf-idf vector
- Represent each document as a weighted tf-idf vector
- Compute the cosine similarity score for the query vector and each document vector
- Rank documents with respect to the query by score
- Return the top K (e.g., K = 10) to the user
Limitations of Vector Space Model (2)
- The vector space model tells us what is similar to what BUT doesn’t tell us what is what
- Only implements one kind of reasoning (analogical reasoning) but there are many other kinds of reasoning (e.g., inductive and deductive reasoning)
Common problems of vector space and how to solve
Sparsity, too big or tall matrix
Solution: Dimensionality Reduction (feature selection, TSVD, random projections)
What is truncated singular value decomposition TSVD?
Factorize the matrix
Discard some singular values of eigenvalue matrix (start from bottom as it is ranked by importance)
Rebuild the matrix
Note that we are projecting original space into smaller one where relative distances/angles are preserved
What are the 4 ways for optimized cosine ranking?
To improve top K document retrieval:
1. Index Elimination
2. Champion Lists
3. Static Quality Scores
4. Cluster Pruning
What is Index Elimination? Explain the 2 types
Before starting dot product calculation, decide which documents to do dot product with (docs containing at least one query term)
or advanced:
a.) only consider high-idf query terms (don’t consider stop words as they contribute little to the scores)
b.) only consider docs containing many query terms = only compute scores for docs containing several of the query terms (ex. at least 3 out of 4) -> the docs will have high cosine score
What are champion lists? Advantage
Precompute for each term t in the dictionary, the r docs of highest weight (most relevant) in t’s postings
= only consider docs with high weights for each term
This is done prior to query, only done once (will save a lot of time at index build time)
Problem of champion list
Since r must be pre-determined, say I set K to 10 (meaning I present to user the top 10 relevant documents). However, I found that there are only 9 documents (r < K)
What are examples of static quality scores (2)?
- Relevance (modeled by cosine scores)
- Authority (query-independent property: paper with many citations/twitter account with many followers)
We can model authority (assign a quality score in [0,1])
What is net score?
net-score(q,d) = g(d) + cosine(q,d)
where g(d) is authority and cosine(q,d) is relevance
other linear combinations are possible
What are high and low lists?
For each term, maintain 2 postings lists called high (champion list) and low
When traversing postings on a query, only traverse high lists first
- If we get more than K docs, select the top K and stop
- Else proceed to get docs from the low lists (docs with less count in case r < K)
What is cluster pruning?
- Given query Q, find its nearest leader L (randomly/using heuristics)
- Seek K nearest docs from among L’s followers
only need to consider leader and it’s followers
What are field/parametric indexes?
Postings for each field value to allow user to search for specific parameters of metadata
(similar to how search folders)
A metadata may contain several fields (ex. author, title, language, year of publication, etc)
What are zone indexes?
A zone is a region of the doc that can contain an arbitrary amount of text (ex. title, abstract, body, references)
Build inverted indexes on zones as well to allow user to query for a specific zone
term.abstract / term.title / term.author
What are tiered indexes?
Break postings up into a hierarchy of lists (most important (tier 1),…,least important)
If not enough documents retrieved from tier 1, go down each tier
tier 1 could be 3/4 term from query is in doc
tier 2 could be 2/4, tier 3 could be 1/4 etc
What are query parsers?
- Run the query as a phrase query (‘rising interest rates’)
- If <K docs contain the phrase, split into smaller phrase queries and run (‘rising interest’, ‘interest rates’)
- If we still have <K docs, run the vector space query
- Rank matching docs by vector space scoring
What is precision?
Fraction of retrieved docs that are relevant P(relevant|retrieved) = tp/(tp+fp)
What is recall?
Fraction of relevant docs that are retrieved P(retrieved|relevant) = tp/(tp+fn)
What is F measure (weighted harmonic mean)?
(unranked retrieval evaluation)
Combined measure that assesses precision/recall tradeoff (sweet spot)
F1 = 2PR / (P+R)
What is Precision@K (P@K)?
(ranked retrieval evaluation)
Set a rank threshold K
Compute % relevant in top K
Ignores documents ranked lower than K
example. relevant [yes, no, yes, no, yes]
P@1 = 1, P@2 = 1/2, P@3 = 2/3
P@4 = 2/4, P@5 = 3/5
What is MAP (Mean Average Precision)? Assumption?
(ranked retrieval evaluation)
Average precision = Average of P@K
MAP = average precision across multiple queries/rankings
MAP is macro-averaging: each query counts equally
MAP assumes user is interested in finding many relevant documents for each query
What is Discounted Cumulative Gain?
(ranked retrieval evaluation)
Gain is accumulated starting at the top of the ranking and may be reduced, or discounted, at lower ranks
DCG = r1 + r2/log2 + r3/log3 +…+ rn/logn
= r1 + sum from i=2 to p (ri / log i)
What is Normalized Discounted Cumulative Gain (nDCG) at rank n?
(ranked retrieval evaluation)
nDCG = DCG/IDCG at rank n
Ideal ranking (IDCG) is sorting the documents starting from highest relevance level, then calculating the DCG
2 problems with the rank-based measures?
- Need annotators (humans)
humans are expensive, inconsistent, decaying in value as documents/query mix evolves, not always representative of “real users” - Recall is difficult to measure on the web
What is Inter-judge (dis)agreement?
Have more than one person labeling the data and calculate what is the agreement between the two using Kappa Score
What is the formula for Kappa measure?
Kappa = [ P(A) – P(E) ] / [ 1 – P(E) ]
where P(A) is proportion of time judges agree and P(E) is what agreement would be by chance
P(E) = P(nonrelevant)^2+P(relevant)^2
Kappa = 1 for total agreement, 0 for chance agreement
- If above 0.8, considered as good agreement
What is relevance feedback?
User feedback on relevance of docs in initial set of results
– User issues a (short, simple) query
– The user marks some results as relevant or non-relevant
– The system computes a better representation of the
information need based on feedback
– Relevance feedback can go through one or more iterations
What is the Rocchio algorithm? Formula
It uses the vector space model to pick a relevance feedback query.
New (opt) query moves toward relevant documents (positive feedback) and away from irrelevant documents (negative feedback)
qm = alpha x q0 + beta x A - gamma x B
A = averaged positive feedback vector
B = averaged negative feedback vector
What is relevance feedback useful for?
Relevance feedback is most useful for increasing recall in situations where recall is important
Positive feedback is more valuable than negative (set γ < b)
Assumptions of relevance feedback Rocchio algorithm (2)
A1: User has sufficient knowledge for initial query
A2: Relevance prototypes are “well-behaved”
- Term distribution in relevant documents will be similar
- Term distribution in non-relevant documents will be different from those in relevant documents
4 problems of Rocchio algorithm (Relevance Feedback)
- 2 big assumptions A1 and A2
- requires a lot of time from user to select relevant documents. users are often reluctant to provide explicit feedback
- natural language is ambiguous (“break” has multiple meanings)
- inefficient: high cost, partial solution
user would prefer to revise and resubmit query rather than giving relevant feedback every time
What is Thesaurus-based query expansion? (global method = using external source)
For each term, t, in a query, expand the query with synonyms and related words of t from the thesaurus
- Generally increases recall but may decrease precision
- A high cost of manually producing a thesaurus
Supervised vs. Unsupervised learning
Supervised learning uses labeled training data, and unsupervised learning does not
Naïve Bayes classifier
1. assumption
2. P(cj)
3. P(xi|cj) with smoothing
- conditional independence assumption: Assume that the probability of observing the conjunction of attributes is equal to the product of the individual probabilities
P(X1,…,X3|C) = P(X1|C) x P(X2|C) x P(X3|C) - P(cj) = N(C=cj) / N = # of docs belonging to class cj / # of docs in training data
- P(xi|cj) = [N(Xi=xi, C=cj)+1] / [N(C=cj)+|V|] = [# of occurences of word xi in docs belonging to class cj + 1] / [# of words in docs belonging to class cj + size of vocab]
Naive Bayes Pros and Cons (3 each)
Advantages:
– Fast learning
– Simple, easy to implement
– No curse of dimensionality
Disadvantages:
– Can perform poorly if assumptions do not hold
– Maximum likelihood can overfit data
– It is not semantics-aware
– Word order not taken into account
What is feature selection? How? 2 advantages?
- Select a subset of ‘relevant’ terms in
vocabulary (or features) - Hypothesis testing statistics using chi square
- Reduces training time and can improve generalization (eliminate noise feature, avoids overfitting)
- What is x2 (chi-square) test?
- What does it mean when you reject the hypothesis?
- used to determine whether there is a significant difference between the expected frequencies and the observed frequencies in one or more categories
- 0.999 confidence (… > 10.83) that the class and the term are dependent hence the term should be helpful as a feature
Formula or 2x2 chi-squared
[ N x (AD-CB)^2 ] / [ (A+C) x (B+D) x (A+B) x (C+D) ]
A=#(t,c) , B=#(t,!c), C=#(!t,c) , D=#(!t,!c)
N = A+B+C+D
t = term and c = class
Vector Space Classification vs. kNN
Vector space classification = classify by choosing the nearest centroid
kNN = compute distance/similarity between test and all training points. assign the class of the closest/most similar point
4 advantages 3 disadvantages of kNN
+ No feature selection necessary
+ Scales well with large number of classes
+ No training necessary
+ Most cases more accurate than NB
- Small changes to one class can have ripple effect
- Scores can be hard to convert to probabilities
- May be expensive at test time
kNN vs. NB
kNN high variance, low bias (memorizes, will always say ‘no’ to new object) (non-linear problem)
NB low variance, high bias (lazy, will say ‘yes’ if object has certain feature) (due to assumption, linear separation)
What is support vector machine? What is the goal? Limitation?
Mapping original feature space to some higher-dimension where training set is separable by a hyperplane
Maximize the margin around the separating hyperplane (between the 2 classes)
Binary classification algorithm
What is concept drift?
One measure of text classification system?
Is feature selection good in protecting against concept drift?
Categories change over time (a feature may be good back then, but bad today)
Good text classification system = how well it protects against concept drift
Feature selection NOT good in protecting
3 Issues with RNN (Recurrent Neural Networks)
- Struggle to handle large sequences of text (ex. when reach end of paragraph, they forget beginning)
- Hard to train (exploding gradient problem)
- Hard to parallelize as they process words sequentially (cannot speed up training by more GPUs)
2 Advantages of Transformers over RNN
- Transformers can be very efficiently parallelized (can train big models)
- They have attention, self-attention, positional encoding (easily identify concepts / what is important)
What is clustering?
Unsupervised learning process of grouping set of objects into classes of similar objects
Hard vs. Soft clustering
Hard clustering: each doc belongs to exactly one cluster
Soft clustering: a doc can belong to more than one cluster
K-means algorithm
- Select K random docs as seeds (initial centroids)
- Assign documents to each centroid to form clusters
- Recalculate centroid of each cluster
- Repeat step 2-3 until convergence or stopping criterion reached (centroid doesn’t change / fixed # iteration / etc)
Issue with k-means (2)
- Need to select initial centroids using a good heuristic
- K number of clusters is unknown. Can have too little or too many clusters
Given a clustering, what is the Benefit and Total Benefit and Clustering Value
- Benefit of a doc = cosine similarity to its centroid
- Total Benefit = sum of individual doc benefits
- Clustering value = Total Benefit - Total Cost where Total Cost is cost for each cluster * K
2 Strengths of Hierarchical Clustering
- Do not have to assume number of clusters (any desired k can be obtained by ‘cutting’ dendrogram at that level)
- May correspond to meaningful taxonomies
Agglomerative vs. Divisive clustering
Agglomerative starts with points as individual clusters. At each step, merge closest pair until only one cluster left
Divisive starts with one, all-inclusive cluster. At each step, split a cluster until each cluster contains a point
Hierarchical Agglomerative Clustering (HAC) Algorithm
- Compute proximity matrix where each data point is a cluster
- Merge the two closest clusters and update proximity matrix
- Repeat 2 until only single cluster remains
4 ways of defining closest pair of clusters
- Single-link (most cosine-similar)
- Complete-link (least cosine-similar)
- Centroid
- Average-link (average cosine between pairs of elements)
Production of single-link and complete-link clustering
Single-link clustering often produces long, straggly clusters
Complete-link often causes outliers
What is good clustering? (2)
High quality clusters with
1. High intra-class similarity
2. Low inter-class similarity
The measured quality of a clustering depends on both the document representation and the similarity measure used
What is purity? Example: A cluster has 1 point in class A, 4 points in class B and 1 point in class C, what is the purity?
Ratio between the dominant class in the cluster and the size of the cluster
Purity = max(1,4,1)/6 = 4/6
What is Rand index? Formula?
To compare the similarity of results between two different clustering methods
RI = (A+D) / (A+B+C+D)
where A is true positive (# of correctly identified pairs),
B is false positive (# of bad cluster pairs),
C is false negative (# of pairs that were supposed to be clustered together),
D is true negative
What is discriminative labeling and non-discriminative labeling?
discriminative labeling: Find terms or phrases that distinguish a cluster from the other clusters (can use feature selection criteria like chi-square)
non-discriminative labeling: Select terms or phrases based solely on information from the cluster itself (ex. terms with high weights in centroids or alternatively use titles/use AI)
Non-discriminative methods sometimes select frequent terms that do not distinguish clusters
What is Goto?
A paid search ranking in 1996
Search ranking depended on how much you paid, auction for keywords
Ads are ranked based on bids (highest bidder gets first rank)
Maximizes revenue but no relevance ranking
Trouble with paid search ads (Goto)
Advertisers are only charged when someone clicks on their ad.
Hence if user search and sees an irrelevant ad, they won’t click it but it is free publicity for the advertiser
What is Search Engine Optimization (SEO)?
“Tuning” your webpage to rank highly in the algorithmic search results for select keywords
What is keyword stuffing? What is the variant?
Dense repetitions of chosen terms (sometimes in same color as background so that it won’t be visible to humans)
Variant: Meta-Tags (like instagram hashtags) but people use it irresponsibly for free publicity
What is cloaking and doorway pages?
Serve fake content (different page) for search engine spider/bots and serve true one to a human visitor
How can user’s search query be categorized to satisfy their needs? (4)
- Informational (want to learn about sth)
- Navigational (want to go to that page)
- Transactional (want to do something, ex. shop or download)
- Gray areas
What is fingerprint? What can it be used to detect?
Fingerprint = a succinct (say 64 bits) digest of the characters in a doc. map an arbitrarily large data to a much shorter bit string
When fingerprints of two web pages are equal, test if the pages are identical (duplicates)
What is near-duplicate detection?
Detect near-duplicates (approximate match) by computing similarity using shingles (n-grams)
Similarity (= size of intersection / size of union) is estimated based on short sketch
What any crawler must do (2)
- Be polite
- explicit politeness: only crawl portion based on what webmasters specify in robots.txt
- implicit politeness: even with no specification, avoid hitting any site too often - Be robust: be immune to spider traps and other malicious behavior from web servers
What is Google bomb
a search with “bad” results due to maliciously manipulated anchor text
ex. search for “evil empire” resulted in Microsoft homepage as top result
What is Page Rank?
Long-term visit rate (pages visited more often are more important)
PR(u) = SUM PR(v)/Lv for all v in Bu
where Bu is set of pages pointing to u and Lv is number of outgoing links from page v (not counting duplicate links)
What are the steps of transition probability matrix? Purpose of teleportation rate?
- If there is a hyperlink from page i to page j, then Aij = 1, otherwise Aij = 0
- Divide each 1 in A by the number of 1’s in its row
- Multiply the resulting matrix by 1 − α (α: teleporting rate)
- Add α/N to every entry
To prevent from getting stuck in a dead end for a well-defined random walk
What are the 2 conditions of Ergodic Markov Chain? What is the Theorem?
- Irreducibility. Roughly: there is a path from any page to any other page
- Aperiodicity. Roughly: The pages cannot be partitioned such that the random walker visits the partitions sequentially
- Theorem: For any ergodic Markov chain, there is a unique long-term visit rate for each state (steady-state probability distribution)
PageRank issues (2)
- Real surfers are not random surfers (back button, bookmarks, search are nonrandom)
- Simple PageRank produces bad results for many pages
What are 2 types of relevance? What makes them good?
- Relevance type 1: Hubs = A hub page has a good list of links to pages answering the
information need (outlinks) - Relevance type 2: Authorities = An authority page is a direct answer to the information need (inlinks)
- A good hub page for a topic links to many authority pages for that topic
- A good authority page for a topic is linked to by many hub pages for that topic
Steps for Hyperlink-induced topic search (HITS)
Compute for each page d in the base set a hub score h(d) and an authority score a(d)
1. Initialization: for all d: h(d) = 1, a(d) = 1
2. Iteratively update all h(d) outlinks, a(d) inlinks (adding)
3. After convergence:
* Output pages with highest h scores as top hubs
* Output pages with highest a scores as top authorities
* So we output two ranked lists
PageRank vs. HITS
- PageRank can be precomputed, HITS has to be computed at query time (sometimes too expensive)
- They make two different design choices concerning (i) the eigenproblem formalization (ii) the set of pages to apply the formalization to
On the Web, a good hub almost always is also a good authority. The actual difference between PageRank and HITS ranking is not as large as one might expect
What is Information Retrieval?
Retrieving (getting back) information that is relevant to a specific query