CZ4034 Flashcards
What is term-document incidence matrix?
Limitation? How to improve?
One-hot encoding or basically a 0/1 vector for each term/word
ex. “Anthony” - 110100 (appears in doc 1, 2, 4)
Matrix of the terms will be extremely sparse (more 1s than 0s)
Better representation: only record the 1s
Structured vs. Unstructured Data
Structured data tends to refer to information in “tables”
Unstructured data is information that doesn’t lend itself to the kind of table formatting required by a relational database. Can be textual or non-textual (such as audio, video, and images)
What is inverted index?
For each term t, store a list of all documents (docID) containing t
Ex. “Caesar” - [2 4 5 6 16 57] (postings)
if “Caesar” in new doc 14, must add 14 between 6 and 16 (LIST MUST BE SORTED)
Query Optimization: What is the best order for query processing?
Process in order of increasing frequency
for AND, the smallest set should be on the left (executed first) because what is not in the smallest set should be ignored later
What is Biword Index?
Ex. “Friends, Romans, Countrymen”
Solution 1 to Phrase Queries:
Index every consecutive pair of terms in text as a phrase
The example will generate biwords:
“friends romans” and “romans countrymen”
What is Positional Index?
Solution 2 to Phrase Queries:
In the postings, store (for each term) the position(s) in which tokens of it appear
<term: doc1: pos1, pos2 …; doc2: pos1, pos2 …;>
angels: 2: {35, 174}; 4: {12, 22}; 7: {17};
fools: 1: {1, 17, 74}; 4: {8, 78};
How to process a phrase query using positional index? (2)
ex. “to be or not to be”
- Extract the inverted index entries for each distinct term: to, be, or, not
- Merge their doc:position list to enumerate all positions with “to be or not to be”
Problem with positional index? Despite of this, why is it still standardly used?
- Expands postings storage substantially
- Standardly used because of power and usefulness of phrase and proximity queries
Lossless vs. Lossy compression?
What compression are the following steps: case folding, stopwords removal, stemming?
Lossless = all info is preserved
Lossy = some info discarded (case folding, stopwords removal, stemming)
Formula for Heaps’ Law?
What does the log-log plot of M vs. T suggest?
M = kT^b
M is size of vocabulary and T is # of tokens in the collection
The plot suggests that in the initial docs, collections size will increase. However, the more docs added, the less distinct vocabulary that will be added to the collection (slow to increase)
Zipf’s Law and consequences?
the ith most frequent term has frequency proportional to 1/i
- given a natural language corpus, the freq of any word is inversely proportional to its rank in the freq table
- thus the most freq word will occur approximately 2x as often as the 2nd most freq word, 3x as often as the 3rd most freq word, etc
What are fixed-width terms in dictionary? Why is it wasteful?
Setting an array of fixed-width for a term/word entry
Wasteful bcs not all terms need that much space (like one-letter words). Some super-long words may need even more space
Explain what is “Dictionary as string with pointers to every term”. Problem?
Storing dictionary as long string of characters (instead of table). Pointer to next word shows end of current word.
But difficult to find those words in the dictionary
What is blocking in dictionary?
Storing dictionary as long string of characters
where pointers is stored to every kth term string (using term length to separate diff entries of dict)
Ex. 7systile9syzygetic8syzygial…
What is blocking+front coding in dictionary?
In addition to blocking, stores consecutive entries that share a common prefix
Ex. 8automata8automate9automatic10automation becomes 8automat*a1<>e2<>ic3<>ion
What is Postings Compression?
Storing gaps of docID instead of the docID itself bcs it has fewer bits
Ex. docID 283042 283043 can be stored with gaps 1 instead
What is Tokenization?
Give example of language issues
Breaking down a sequence of text into smaller parts (token = an instance of a sequence of characters)
French: L’ensemble (1 token or 2?)
German: noun compounds are not segmented
Chinese/Japanese: no space
Arabic: written right to left, but with numbers left to right
What are Stopwords?
Words to exclude from dictionary, the commonest words that have little semantic content
ex. the, a, and, to, be
Give examples of cases where stopwords should not be removed (3)
- Phrase queries (King of Denmark)
- Song titles.. etc (Let it be, To be or not to be)
- Relational queries (flights to London)
What is Normalization? Examples
A mapping that maps all the possible spelling permutations to the goal standard form that you are going to save in the dictionary
“Normalize” words in indexes text as well as query words into the same form
Define equivalence classes of terms:
- match U.S.A. and USA -> USA
- match résumé and resume (often best to normalize to a de-accented term)
- match 7月30日 and 7/30
What is Casefolding? Issues?
Reduce all letters to lowercase
It is often best to lowercase everything but there may be some exceptions
ex. WHO vs who, Polish vs. polish
What is Lemmatization?
A type of normalization that does a ‘proper’ reduction to dictionary headword / base form
Ex. am, are, is -> be
car, cars, car’s, cars’ -> car
What is Stemming? Problem?
Crude affix chopping (reduce terms to their ‘roots’ before indexing) (language dependent)
Ex. automates, automatic, automation -> automat
It can completely flip meaning/polarity
Ex. stuffed-> stuff, slimy-> slim, flourish-> flour
Lemmatization vs. Stemming when to use which
In general, if accuracy still high enough, use stemming (bcs very powerful, reduce dictionary much more)
If stemming not good for analysis (in cases where POS tag is important), use lemmatization
Lemmatization Pro and Con
+ improves concept extraction through text normalization (noun/verb inflection elimination)
Ex. eats_burger, ate_burger, eat_burgers, eat_the_burger
- may lead to misunderstanding
Ex. park -> parking, develop -> developing/developed, kick_ball -> kick_balls
What is the most popular algorithm for stemming English?
Porter’s algorithm (set of rules to replace end of a word with sth else) (example in tutorial)
doesn’t work all the time but at least as good as other stemming options
What are Wild-card queries?
How to retrieve (data structure)?
Queries where you don’t get a specific word/phrase from user; instead u get a query containing *
Ex. mon* -> find all docs with word beginning with “mon”
Easy retrieval with binary tree (or B-tree)
What is Permuterm index?
An easy way to handle wildcard queries by adding rotations
Ex. hello ->
hello$, ello$h, llo$he, lo$hel, o$hell
(will help find * in front, end, middle of text)
Problems of permuterm query processing (2)
- Quadruples lexicon size (more memory)
- Takes a long time (expensive query execution)
2 Principal Uses of Spell Correction
- correcting documents being indexed (mapping between correct word in dictionary and misspelled word in document is done correctly)
- correcting user queries to retrieve “right” answers (whatever you do on dictionary, you must do on query)
2 Main Flavors of Spell Correction
- Isolated word
calculate distance between any misspelled word with a correct word to decide what should replace it - Context-sensitive
look at surrounding words to decide if word is correct or not (statistics, machine learning)
3 disadvantages of spell correction and suggest an alternative
- Retrieving documents indexed by correct spelling is computationally expensive
- May lead to wrong normalization
- May produce too many alternatives
Instead, can suggest alternative queries with correct spelling “Did you mean: ….”
Only search if user clicks on the alternative query
Edit Distance (Levenshtein)
the minimum # of operations (insert, delete, replace, transposition) to convert one string to another
ex. From dof to dog is 1
Weighted edit distance
Adding weights to common mistakes
(OCR errors, Keyboard errors, etc)
The cost of replacing certain letters may have higher cost
Problems with Edit Distance. Suggestion
Expensive and Slow to compute edit distance to every dictionary term
Use n-gram overlap (Ex. trigrams)
november - (nov, ove, vem, emb, mbe, ber)
december - (dec, ece, cem, emb, mbe, ber)
3/6 of terms overlap -> normalize using Jaccard coefficient
Jacard distance vs. coefficient
Jacard distance (D) = how different 2 sets are
Jacard coefficient (J) = how similar 2 sets are
always add up to 1 or 100
What is Soundex? Is it useful in IR?
Translate any word/token into a 4-character reduced form
Not very useful for IR (many words that are different end up having the same code)
Pro and Cons of Boolean Search
+ Good for expert users with precise understanding of their needs and the collection
- Not good for majority of users as they may be incapable of writing Boolean queries (may be too much work)
- Either too few or too many results (and not ranked on relevancy)
What is the meaning of Score in ranked retrieval?
A number between [0,1] that measures how well document and query ‘match’
Issues with Jaccard for scoring
- It doesn’t consider term frequency
- Rare terms in a collection are more informative than frequent terms but Jaccard doesn’t consider document frequency
What is Term Frequency tf_t,d?
How many times a term t occurs in a document d
What is log frequency weight of term t in d?
if tf_t,d > 0, w_t,d = 1 + log tf_t,d
and 0 otherwise
used to dampen growth of number of occurrences
What is Document Frequency df_t?
The number of documents that contain term t (measure of popularity)
Frequent terms (ex. stopwords) are less informative than rare terms
We want a high weight for rare terms as they are likely to be relevant to query
What is idf (inverse document frequency)? Does it effect ranking of docs for queries with 1 term?
idf_t = log (N/df_t)
log is used to dampen effect of idf
idf affects the ranking of documents for queries with at least 2 terms
What is collection frequency of t?
The number of occurrences of t in the collection, counting multiple occurrences
What is tf-idf weighting and it’s score?
w_t,d = (1+log tf_t,d) x log(N/df_t)
Score(q,d) = sum tf.idf_t,d over terms t present in both the query q and the document d
Score will need to be normalized so that it’s between [0,1]
What is the Vector Space Model?
What are the axes?
How to get proximity?
A model that represents documents and queries as vectors in high-dimensional space
Terms are axes of the space
Rank documents according to their (angle) proximity to the query in this space
What is the cosine similarity for query q and document d?
The dot product of length-normalized vectors q and d
numerator = dot product of q and d
denominator = sqrt sum square of q * sqrt sum square of d
Vector Space Ranking steps (5)
- Represent the query as a weighted tf-idf vector
- Represent each document as a weighted tf-idf vector
- Compute the cosine similarity score for the query vector and each document vector
- Rank documents with respect to the query by score
- Return the top K (e.g., K = 10) to the user
Limitations of Vector Space Model (2)
- The vector space model tells us what is similar to what BUT doesn’t tell us what is what
- Only implements one kind of reasoning (analogical reasoning) but there are many other kinds of reasoning (e.g., inductive and deductive reasoning)