CZ4034 Flashcards

1
Q

What is term-document incidence matrix?
Limitation? How to improve?

A

One-hot encoding or basically a 0/1 vector for each term/word
ex. “Anthony” - 110100 (appears in doc 1, 2, 4)

Matrix of the terms will be extremely sparse (more 1s than 0s)
Better representation: only record the 1s

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Structured vs. Unstructured Data

A

Structured data tends to refer to information in “tables”
Unstructured data is information that doesn’t lend itself to the kind of table formatting required by a relational database. Can be textual or non-textual (such as audio, video, and images)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is inverted index?

A

For each term t, store a list of all documents (docID) containing t

Ex. “Caesar” - [2 4 5 6 16 57] (postings)
if “Caesar” in new doc 14, must add 14 between 6 and 16 (LIST MUST BE SORTED)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Query Optimization: What is the best order for query processing?

A

Process in order of increasing frequency
for AND, the smallest set should be on the left (executed first) because what is not in the smallest set should be ignored later

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is Biword Index?
Ex. “Friends, Romans, Countrymen”

A

Solution 1 to Phrase Queries:
Index every consecutive pair of terms in text as a phrase
The example will generate biwords:
“friends romans” and “romans countrymen”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is Positional Index?

A

Solution 2 to Phrase Queries:
In the postings, store (for each term) the position(s) in which tokens of it appear

<term: doc1: pos1, pos2 …; doc2: pos1, pos2 …;>

angels: 2: {35, 174}; 4: {12, 22}; 7: {17};
fools: 1: {1, 17, 74}; 4: {8, 78};

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How to process a phrase query using positional index? (2)
ex. “to be or not to be”

A
  1. Extract the inverted index entries for each distinct term: to, be, or, not
  2. Merge their doc:position list to enumerate all positions with “to be or not to be”
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Problem with positional index? Despite of this, why is it still standardly used?

A
  • Expands postings storage substantially
  • Standardly used because of power and usefulness of phrase and proximity queries
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Lossless vs. Lossy compression?
What compression are the following steps: case folding, stopwords removal, stemming?

A

Lossless = all info is preserved
Lossy = some info discarded (case folding, stopwords removal, stemming)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Formula for Heaps’ Law?
What does the log-log plot of M vs. T suggest?

A

M = kT^b
M is size of vocabulary and T is # of tokens in the collection

The plot suggests that in the initial docs, collections size will increase. However, the more docs added, the less distinct vocabulary that will be added to the collection (slow to increase)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Zipf’s Law and consequences?

A

the ith most frequent term has frequency proportional to 1/i

  • given a natural language corpus, the freq of any word is inversely proportional to its rank in the freq table
  • thus the most freq word will occur approximately 2x as often as the 2nd most freq word, 3x as often as the 3rd most freq word, etc
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are fixed-width terms in dictionary? Why is it wasteful?

A

Setting an array of fixed-width for a term/word entry

Wasteful bcs not all terms need that much space (like one-letter words). Some super-long words may need even more space

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Explain what is “Dictionary as string with pointers to every term”. Problem?

A

Storing dictionary as long string of characters (instead of table). Pointer to next word shows end of current word.
But difficult to find those words in the dictionary

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is blocking in dictionary?

A

Storing dictionary as long string of characters
where pointers is stored to every kth term string (using term length to separate diff entries of dict)
Ex. 7systile9syzygetic8syzygial…

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is blocking+front coding in dictionary?

A

In addition to blocking, stores consecutive entries that share a common prefix
Ex. 8automata8automate9automatic10automation becomes 8automat*a1<>e2<>ic3<>ion

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is Postings Compression?

A

Storing gaps of docID instead of the docID itself bcs it has fewer bits

Ex. docID 283042 283043 can be stored with gaps 1 instead

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is Tokenization?
Give example of language issues

A

Breaking down a sequence of text into smaller parts (token = an instance of a sequence of characters)

French: L’ensemble (1 token or 2?)
German: noun compounds are not segmented
Chinese/Japanese: no space
Arabic: written right to left, but with numbers left to right

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are Stopwords?

A

Words to exclude from dictionary, the commonest words that have little semantic content
ex. the, a, and, to, be

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Give examples of cases where stopwords should not be removed (3)

A
  1. Phrase queries (King of Denmark)
  2. Song titles.. etc (Let it be, To be or not to be)
  3. Relational queries (flights to London)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is Normalization? Examples

A

A mapping that maps all the possible spelling permutations to the goal standard form that you are going to save in the dictionary

“Normalize” words in indexes text as well as query words into the same form
Define equivalence classes of terms:
- match U.S.A. and USA -> USA
- match résumé and resume (often best to normalize to a de-accented term)
- match 7月30日 and 7/30

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is Casefolding? Issues?

A

Reduce all letters to lowercase

It is often best to lowercase everything but there may be some exceptions
ex. WHO vs who, Polish vs. polish

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is Lemmatization?

A

A type of normalization that does a ‘proper’ reduction to dictionary headword / base form

Ex. am, are, is -> be
car, cars, car’s, cars’ -> car

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is Stemming? Problem?

A

Crude affix chopping (reduce terms to their ‘roots’ before indexing) (language dependent)
Ex. automates, automatic, automation -> automat

It can completely flip meaning/polarity
Ex. stuffed-> stuff, slimy-> slim, flourish-> flour

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Lemmatization vs. Stemming when to use which

A

In general, if accuracy still high enough, use stemming (bcs very powerful, reduce dictionary much more)

If stemming not good for analysis (in cases where POS tag is important), use lemmatization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Lemmatization Pro and Con

A

+ improves concept extraction through text normalization (noun/verb inflection elimination)
Ex. eats_burger, ate_burger, eat_burgers, eat_the_burger
- may lead to misunderstanding
Ex. park -> parking, develop -> developing/developed, kick_ball -> kick_balls

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What is the most popular algorithm for stemming English?

A

Porter’s algorithm (set of rules to replace end of a word with sth else) (example in tutorial)

doesn’t work all the time but at least as good as other stemming options

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What are Wild-card queries?
How to retrieve (data structure)?

A

Queries where you don’t get a specific word/phrase from user; instead u get a query containing *
Ex. mon* -> find all docs with word beginning with “mon”

Easy retrieval with binary tree (or B-tree)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What is Permuterm index?

A

An easy way to handle wildcard queries by adding rotations
Ex. hello ->
hello$, ello$h, llo$he, lo$hel, o$hell
(will help find * in front, end, middle of text)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Problems of permuterm query processing (2)

A
  1. Quadruples lexicon size (more memory)
  2. Takes a long time (expensive query execution)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

2 Principal Uses of Spell Correction

A
  • correcting documents being indexed (mapping between correct word in dictionary and misspelled word in document is done correctly)
  • correcting user queries to retrieve “right” answers (whatever you do on dictionary, you must do on query)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

2 Main Flavors of Spell Correction

A
  1. Isolated word
    calculate distance between any misspelled word with a correct word to decide what should replace it
  2. Context-sensitive
    look at surrounding words to decide if word is correct or not (statistics, machine learning)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

3 disadvantages of spell correction and suggest an alternative

A
  1. Retrieving documents indexed by correct spelling is computationally expensive
  2. May lead to wrong normalization
  3. May produce too many alternatives

Instead, can suggest alternative queries with correct spelling “Did you mean: ….”
Only search if user clicks on the alternative query

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Edit Distance (Levenshtein)

A

the minimum # of operations (insert, delete, replace, transposition) to convert one string to another
ex. From dof to dog is 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Weighted edit distance

A

Adding weights to common mistakes
(OCR errors, Keyboard errors, etc)
The cost of replacing certain letters may have higher cost

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Problems with Edit Distance. Suggestion

A

Expensive and Slow to compute edit distance to every dictionary term

Use n-gram overlap (Ex. trigrams)
november - (nov, ove, vem, emb, mbe, ber)
december - (dec, ece, cem, emb, mbe, ber)
3/6 of terms overlap -> normalize using Jaccard coefficient

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Jacard distance vs. coefficient

A

Jacard distance (D) = how different 2 sets are
Jacard coefficient (J) = how similar 2 sets are
always add up to 1 or 100

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What is Soundex? Is it useful in IR?

A

Translate any word/token into a 4-character reduced form

Not very useful for IR (many words that are different end up having the same code)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

Pro and Cons of Boolean Search

A

+ Good for expert users with precise understanding of their needs and the collection
- Not good for majority of users as they may be incapable of writing Boolean queries (may be too much work)
- Either too few or too many results (and not ranked on relevancy)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

What is the meaning of Score in ranked retrieval?

A

A number between [0,1] that measures how well document and query ‘match’

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

Issues with Jaccard for scoring

A
  • It doesn’t consider term frequency
  • Rare terms in a collection are more informative than frequent terms but Jaccard doesn’t consider document frequency
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

What is Term Frequency tf_t,d?

A

How many times a term t occurs in a document d

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

What is log frequency weight of term t in d?

A

if tf_t,d > 0, w_t,d = 1 + log tf_t,d
and 0 otherwise

used to dampen growth of number of occurrences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

What is Document Frequency df_t?

A

The number of documents that contain term t (measure of popularity)

Frequent terms (ex. stopwords) are less informative than rare terms
We want a high weight for rare terms as they are likely to be relevant to query

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

What is idf (inverse document frequency)? Does it effect ranking of docs for queries with 1 term?

A

idf_t = log (N/df_t)
log is used to dampen effect of idf
idf affects the ranking of documents for queries with at least 2 terms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

What is collection frequency of t?

A

The number of occurrences of t in the collection, counting multiple occurrences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

What is tf-idf weighting and it’s score?

A

w_t,d = (1+log tf_t,d) x log(N/df_t)

Score(q,d) = sum tf.idf_t,d over terms t present in both the query q and the document d

Score will need to be normalized so that it’s between [0,1]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

What is the Vector Space Model?
What are the axes?
How to get proximity?

A

A model that represents documents and queries as vectors in high-dimensional space
Terms are axes of the space
Rank documents according to their (angle) proximity to the query in this space

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

What is the cosine similarity for query q and document d?

A

The dot product of length-normalized vectors q and d
numerator = dot product of q and d
denominator = sqrt sum square of q * sqrt sum square of d

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

Vector Space Ranking steps (5)

A
  1. Represent the query as a weighted tf-idf vector
  2. Represent each document as a weighted tf-idf vector
  3. Compute the cosine similarity score for the query vector and each document vector
  4. Rank documents with respect to the query by score
  5. Return the top K (e.g., K = 10) to the user
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

Limitations of Vector Space Model (2)

A
  1. The vector space model tells us what is similar to what BUT doesn’t tell us what is what
  2. Only implements one kind of reasoning (analogical reasoning) but there are many other kinds of reasoning (e.g., inductive and deductive reasoning)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

Common problems of vector space and how to solve

A

Sparsity, too big or tall matrix
Solution: Dimensionality Reduction (feature selection, TSVD, random projections)

52
Q

What is truncated singular value decomposition TSVD?

A

Factorize the matrix
Discard some singular values of eigenvalue matrix (start from bottom as it is ranked by importance)
Rebuild the matrix

Note that we are projecting original space into smaller one where relative distances/angles are preserved

53
Q

What are the 4 ways for optimized cosine ranking?

A

To improve top K document retrieval:
1. Index Elimination
2. Champion Lists
3. Static Quality Scores
4. Cluster Pruning

54
Q

What is Index Elimination? Explain the 2 types

A

Before starting dot product calculation, decide which documents to do dot product with (docs containing at least one query term)
or advanced:
a.) only consider high-idf query terms (don’t consider stop words as they contribute little to the scores)
b.) only consider docs containing many query terms = only compute scores for docs containing several of the query terms (ex. at least 3 out of 4) -> the docs will have high cosine score

55
Q

What are champion lists? Advantage

A

Precompute for each term t in the dictionary, the r docs of highest weight (most relevant) in t’s postings
= only consider docs with high weights for each term

This is done prior to query, only done once (will save a lot of time at index build time)

56
Q

Problem of champion list

A

Since r must be pre-determined, say I set K to 10 (meaning I present to user the top 10 relevant documents). However, I found that there are only 9 documents (r < K)

57
Q

What are examples of static quality scores (2)?

A
  1. Relevance (modeled by cosine scores)
  2. Authority (query-independent property: paper with many citations/twitter account with many followers)

We can model authority (assign a quality score in [0,1])

58
Q

What is net score?

A

net-score(q,d) = g(d) + cosine(q,d)
where g(d) is authority and cosine(q,d) is relevance
other linear combinations are possible

59
Q

What are high and low lists?

A

For each term, maintain 2 postings lists called high (champion list) and low
When traversing postings on a query, only traverse high lists first
- If we get more than K docs, select the top K and stop
- Else proceed to get docs from the low lists (docs with less count in case r < K)

60
Q

What is cluster pruning?

A
  1. Given query Q, find its nearest leader L (randomly/using heuristics)
  2. Seek K nearest docs from among L’s followers
    only need to consider leader and it’s followers
61
Q

What are field/parametric indexes?

A

Postings for each field value to allow user to search for specific parameters of metadata
(similar to how search folders)

A metadata may contain several fields (ex. author, title, language, year of publication, etc)

62
Q

What are zone indexes?

A

A zone is a region of the doc that can contain an arbitrary amount of text (ex. title, abstract, body, references)

Build inverted indexes on zones as well to allow user to query for a specific zone

term.abstract / term.title / term.author

63
Q

What are tiered indexes?

A

Break postings up into a hierarchy of lists (most important (tier 1),…,least important)

If not enough documents retrieved from tier 1, go down each tier

tier 1 could be 3/4 term from query is in doc
tier 2 could be 2/4, tier 3 could be 1/4 etc

64
Q

What are query parsers?

A
  1. Run the query as a phrase query (‘rising interest rates’)
  2. If <K docs contain the phrase, split into smaller phrase queries and run (‘rising interest’, ‘interest rates’)
  3. If we still have <K docs, run the vector space query
  4. Rank matching docs by vector space scoring
65
Q

What is precision?

A

Fraction of retrieved docs that are relevant P(relevant|retrieved) = tp/(tp+fp)

66
Q

What is recall?

A

Fraction of relevant docs that are retrieved P(retrieved|relevant) = tp/(tp+fn)

67
Q

What is F measure (weighted harmonic mean)?
(unranked retrieval evaluation)

A

Combined measure that assesses precision/recall tradeoff (sweet spot)
F1 = 2PR / (P+R)

68
Q

What is Precision@K (P@K)?
(ranked retrieval evaluation)

A

Set a rank threshold K
Compute % relevant in top K
Ignores documents ranked lower than K

example. relevant [yes, no, yes, no, yes]
P@1 = 1, P@2 = 1/2, P@3 = 2/3
P@4 = 2/4, P@5 = 3/5

69
Q

What is MAP (Mean Average Precision)? Assumption?
(ranked retrieval evaluation)

A

Average precision = Average of P@K
MAP = average precision across multiple queries/rankings
MAP is macro-averaging: each query counts equally
MAP assumes user is interested in finding many relevant documents for each query

70
Q

What is Discounted Cumulative Gain?
(ranked retrieval evaluation)

A

Gain is accumulated starting at the top of the ranking and may be reduced, or discounted, at lower ranks

DCG = r1 + r2/log2 + r3/log3 +…+ rn/logn
= r1 + sum from i=2 to p (ri / log i)

71
Q

What is Normalized Discounted Cumulative Gain (nDCG) at rank n?
(ranked retrieval evaluation)

A

nDCG = DCG/IDCG at rank n

Ideal ranking (IDCG) is sorting the documents starting from highest relevance level, then calculating the DCG

72
Q

2 problems with the rank-based measures?

A
  1. Need annotators (humans)
    humans are expensive, inconsistent, decaying in value as documents/query mix evolves, not always representative of “real users”
  2. Recall is difficult to measure on the web
73
Q

What is Inter-judge (dis)agreement?

A

Have more than one person labeling the data and calculate what is the agreement between the two using Kappa Score

74
Q

What is the formula for Kappa measure?

A

Kappa = [ P(A) – P(E) ] / [ 1 – P(E) ]
where P(A) is proportion of time judges agree and P(E) is what agreement would be by chance
P(E) = P(nonrelevant)^2+P(relevant)^2

Kappa = 1 for total agreement, 0 for chance agreement
- If above 0.8, considered as good agreement

75
Q

What is relevance feedback?

A

User feedback on relevance of docs in initial set of results
– User issues a (short, simple) query
– The user marks some results as relevant or non-relevant
– The system computes a better representation of the
information need based on feedback
– Relevance feedback can go through one or more iterations

76
Q

What is the Rocchio algorithm? Formula

A

It uses the vector space model to pick a relevance feedback query.
New (opt) query moves toward relevant documents (positive feedback) and away from irrelevant documents (negative feedback)

qm = alpha x q0 + beta x A - gamma x B
A = averaged positive feedback vector
B = averaged negative feedback vector

77
Q

What is relevance feedback useful for?

A

Relevance feedback is most useful for increasing recall in situations where recall is important
Positive feedback is more valuable than negative (set γ < b)

78
Q

Assumptions of relevance feedback Rocchio algorithm (2)

A

A1: User has sufficient knowledge for initial query

A2: Relevance prototypes are “well-behaved”
- Term distribution in relevant documents will be similar
- Term distribution in non-relevant documents will be different from those in relevant documents

79
Q

4 problems of Rocchio algorithm (Relevance Feedback)

A
  1. 2 big assumptions A1 and A2
  2. requires a lot of time from user to select relevant documents. users are often reluctant to provide explicit feedback
  3. natural language is ambiguous (“break” has multiple meanings)
  4. inefficient: high cost, partial solution

user would prefer to revise and resubmit query rather than giving relevant feedback every time

80
Q

What is Thesaurus-based query expansion? (global method = using external source)

A

For each term, t, in a query, expand the query with synonyms and related words of t from the thesaurus

  • Generally increases recall but may decrease precision
  • A high cost of manually producing a thesaurus
81
Q

Supervised vs. Unsupervised learning

A

Supervised learning uses labeled training data, and unsupervised learning does not

82
Q

Naïve Bayes classifier
1. assumption
2. P(cj)
3. P(xi|cj) with smoothing

A
  1. conditional independence assumption: Assume that the probability of observing the conjunction of attributes is equal to the product of the individual probabilities
    P(X1,…,X3|C) = P(X1|C) x P(X2|C) x P(X3|C)
  2. P(cj) = N(C=cj) / N = # of docs belonging to class cj / # of docs in training data
  3. P(xi|cj) = [N(Xi=xi, C=cj)+1] / [N(C=cj)+|V|] = [# of occurences of word xi in docs belonging to class cj + 1] / [# of words in docs belonging to class cj + size of vocab]
83
Q

Naive Bayes Pros and Cons (3 each)

A

Advantages:
– Fast learning
– Simple, easy to implement
– No curse of dimensionality

Disadvantages:
– Can perform poorly if assumptions do not hold
– Maximum likelihood can overfit data
– It is not semantics-aware
– Word order not taken into account

84
Q

What is feature selection? How? 2 advantages?

A
  • Select a subset of ‘relevant’ terms in
    vocabulary (or features)
  • Hypothesis testing statistics using chi square
  • Reduces training time and can improve generalization (eliminate noise feature, avoids overfitting)
85
Q
  1. What is x2 (chi-square) test?
  2. What does it mean when you reject the hypothesis?
A
  1. used to determine whether there is a significant difference between the expected frequencies and the observed frequencies in one or more categories
  2. 0.999 confidence (… > 10.83) that the class and the term are dependent hence the term should be helpful as a feature
86
Q

Formula or 2x2 chi-squared

A

[ N x (AD-CB)^2 ] / [ (A+C) x (B+D) x (A+B) x (C+D) ]
A=#(t,c) , B=#(t,!c), C=#(!t,c) , D=#(!t,!c)
N = A+B+C+D
t = term and c = class

87
Q

Vector Space Classification vs. kNN

A

Vector space classification = classify by choosing the nearest centroid
kNN = compute distance/similarity between test and all training points. assign the class of the closest/most similar point

88
Q

4 advantages 3 disadvantages of kNN

A

+ No feature selection necessary
+ Scales well with large number of classes
+ No training necessary
+ Most cases more accurate than NB
- Small changes to one class can have ripple effect
- Scores can be hard to convert to probabilities
- May be expensive at test time

89
Q

kNN vs. NB

A

kNN high variance, low bias (memorizes, will always say ‘no’ to new object) (non-linear problem)

NB low variance, high bias (lazy, will say ‘yes’ if object has certain feature) (due to assumption, linear separation)

90
Q

What is support vector machine? What is the goal? Limitation?

A

Mapping original feature space to some higher-dimension where training set is separable by a hyperplane

Maximize the margin around the separating hyperplane (between the 2 classes)

Binary classification algorithm

91
Q

What is concept drift?
One measure of text classification system?
Is feature selection good in protecting against concept drift?

A

Categories change over time (a feature may be good back then, but bad today)
Good text classification system = how well it protects against concept drift
Feature selection NOT good in protecting

92
Q

3 Issues with RNN (Recurrent Neural Networks)

A
  • Struggle to handle large sequences of text (ex. when reach end of paragraph, they forget beginning)
  • Hard to train (exploding gradient problem)
  • Hard to parallelize as they process words sequentially (cannot speed up training by more GPUs)
93
Q

2 Advantages of Transformers over RNN

A
  • Transformers can be very efficiently parallelized (can train big models)
  • They have attention, self-attention, positional encoding (easily identify concepts / what is important)
94
Q

What is clustering?

A

Unsupervised learning process of grouping set of objects into classes of similar objects

95
Q

Hard vs. Soft clustering

A

Hard clustering: each doc belongs to exactly one cluster
Soft clustering: a doc can belong to more than one cluster

96
Q

K-means algorithm

A
  1. Select K random docs as seeds (initial centroids)
  2. Assign documents to each centroid to form clusters
  3. Recalculate centroid of each cluster
  4. Repeat step 2-3 until convergence or stopping criterion reached (centroid doesn’t change / fixed # iteration / etc)
97
Q

Issue with k-means (2)

A
  1. Need to select initial centroids using a good heuristic
  2. K number of clusters is unknown. Can have too little or too many clusters
98
Q

Given a clustering, what is the Benefit and Total Benefit and Clustering Value

A
  1. Benefit of a doc = cosine similarity to its centroid
  2. Total Benefit = sum of individual doc benefits
  3. Clustering value = Total Benefit - Total Cost where Total Cost is cost for each cluster * K
99
Q

2 Strengths of Hierarchical Clustering

A
  1. Do not have to assume number of clusters (any desired k can be obtained by ‘cutting’ dendrogram at that level)
  2. May correspond to meaningful taxonomies
100
Q

Agglomerative vs. Divisive clustering

A

Agglomerative starts with points as individual clusters. At each step, merge closest pair until only one cluster left

Divisive starts with one, all-inclusive cluster. At each step, split a cluster until each cluster contains a point

101
Q

Hierarchical Agglomerative Clustering (HAC) Algorithm

A
  1. Compute proximity matrix where each data point is a cluster
  2. Merge the two closest clusters and update proximity matrix
  3. Repeat 2 until only single cluster remains
102
Q

4 ways of defining closest pair of clusters

A
  1. Single-link (most cosine-similar)
  2. Complete-link (least cosine-similar)
  3. Centroid
  4. Average-link (average cosine between pairs of elements)
103
Q

Production of single-link and complete-link clustering

A

Single-link clustering often produces long, straggly clusters
Complete-link often causes outliers

104
Q

What is good clustering? (2)

A

High quality clusters with
1. High intra-class similarity
2. Low inter-class similarity

The measured quality of a clustering depends on both the document representation and the similarity measure used

105
Q

What is purity? Example: A cluster has 1 point in class A, 4 points in class B and 1 point in class C, what is the purity?

A

Ratio between the dominant class in the cluster and the size of the cluster

Purity = max(1,4,1)/6 = 4/6

106
Q

What is Rand index? Formula?

A

To compare the similarity of results between two different clustering methods
RI = (A+D) / (A+B+C+D)
where A is true positive (# of correctly identified pairs),
B is false positive (# of bad cluster pairs),
C is false negative (# of pairs that were supposed to be clustered together),
D is true negative

107
Q

What is discriminative labeling and non-discriminative labeling?

A

discriminative labeling: Find terms or phrases that distinguish a cluster from the other clusters (can use feature selection criteria like chi-square)

non-discriminative labeling: Select terms or phrases based solely on information from the cluster itself (ex. terms with high weights in centroids or alternatively use titles/use AI)

Non-discriminative methods sometimes select frequent terms that do not distinguish clusters

108
Q

What is Goto?

A

A paid search ranking in 1996
Search ranking depended on how much you paid, auction for keywords
Ads are ranked based on bids (highest bidder gets first rank)
Maximizes revenue but no relevance ranking

109
Q

Trouble with paid search ads (Goto)

A

Advertisers are only charged when someone clicks on their ad.
Hence if user search and sees an irrelevant ad, they won’t click it but it is free publicity for the advertiser

110
Q

What is Search Engine Optimization (SEO)?

A

“Tuning” your webpage to rank highly in the algorithmic search results for select keywords

111
Q

What is keyword stuffing? What is the variant?

A

Dense repetitions of chosen terms (sometimes in same color as background so that it won’t be visible to humans)

Variant: Meta-Tags (like instagram hashtags) but people use it irresponsibly for free publicity

112
Q

What is cloaking and doorway pages?

A

Serve fake content (different page) for search engine spider/bots and serve true one to a human visitor

113
Q

How can user’s search query be categorized to satisfy their needs? (4)

A
  1. Informational (want to learn about sth)
  2. Navigational (want to go to that page)
  3. Transactional (want to do something, ex. shop or download)
  4. Gray areas
114
Q

What is fingerprint? What can it be used to detect?

A

Fingerprint = a succinct (say 64 bits) digest of the characters in a doc. map an arbitrarily large data to a much shorter bit string

When fingerprints of two web pages are equal, test if the pages are identical (duplicates)

115
Q

What is near-duplicate detection?

A

Detect near-duplicates (approximate match) by computing similarity using shingles (n-grams)
Similarity (= size of intersection / size of union) is estimated based on short sketch

116
Q

What any crawler must do (2)

A
  1. Be polite
    - explicit politeness: only crawl portion based on what webmasters specify in robots.txt
    - implicit politeness: even with no specification, avoid hitting any site too often
  2. Be robust: be immune to spider traps and other malicious behavior from web servers
117
Q

What is Google bomb

A

a search with “bad” results due to maliciously manipulated anchor text
ex. search for “evil empire” resulted in Microsoft homepage as top result

118
Q

What is Page Rank?

A

Long-term visit rate (pages visited more often are more important)
PR(u) = SUM PR(v)/Lv for all v in Bu
where Bu is set of pages pointing to u and Lv is number of outgoing links from page v (not counting duplicate links)

119
Q

What are the steps of transition probability matrix? Purpose of teleportation rate?

A
  1. If there is a hyperlink from page i to page j, then Aij = 1, otherwise Aij = 0
  2. Divide each 1 in A by the number of 1’s in its row
  3. Multiply the resulting matrix by 1 − α (α: teleporting rate)
  4. Add α/N to every entry

To prevent from getting stuck in a dead end for a well-defined random walk

120
Q

What are the 2 conditions of Ergodic Markov Chain? What is the Theorem?

A
  • Irreducibility. Roughly: there is a path from any page to any other page
  • Aperiodicity. Roughly: The pages cannot be partitioned such that the random walker visits the partitions sequentially
  • Theorem: For any ergodic Markov chain, there is a unique long-term visit rate for each state (steady-state probability distribution)
121
Q

PageRank issues (2)

A
  • Real surfers are not random surfers (back button, bookmarks, search are nonrandom)
  • Simple PageRank produces bad results for many pages
122
Q

What are 2 types of relevance? What makes them good?

A
  • Relevance type 1: Hubs = A hub page has a good list of links to pages answering the
    information need (outlinks)
  • Relevance type 2: Authorities = An authority page is a direct answer to the information need (inlinks)
  • A good hub page for a topic links to many authority pages for that topic
  • A good authority page for a topic is linked to by many hub pages for that topic
123
Q

Steps for Hyperlink-induced topic search (HITS)

A

Compute for each page d in the base set a hub score h(d) and an authority score a(d)
1. Initialization: for all d: h(d) = 1, a(d) = 1
2. Iteratively update all h(d) outlinks, a(d) inlinks (adding)
3. After convergence:
* Output pages with highest h scores as top hubs
* Output pages with highest a scores as top authorities
* So we output two ranked lists

124
Q

PageRank vs. HITS

A
  1. PageRank can be precomputed, HITS has to be computed at query time (sometimes too expensive)
  2. They make two different design choices concerning (i) the eigenproblem formalization (ii) the set of pages to apply the formalization to

On the Web, a good hub almost always is also a good authority. The actual difference between PageRank and HITS ranking is not as large as one might expect

125
Q

What is Information Retrieval?

A

Retrieving (getting back) information that is relevant to a specific query