Chapter 1 - Term by Document Matrix Flashcards
Define term by document matrix
Database of webpages containing various erms is called a term by document matrix The (i,j) term is 0 if term i is not in document j. Can be constructed where:
(i,j) entry is 1 if term i occurs in document j or (i,j) entry is no of times term i is in document j
Define the search vector
Search vector for terms i1,i2,…..ik is a vector with 1 in positions ij, j=1,2,…..,k and zeros everywhere else
What information will the dot product of the search vector and the column i of A the term by document matrix give?
v x column i of A >0 means document i contains at least one of the terms
What does the dot product of the search vector and the term by document matrix return
A list of all documents for which the dot product is positive
How to find a webpage on which one of list of terms appears
Column vector v of terms and calculate v multiplied by each column of A. If answer is positive then they are the relevant pages
Define a normalised vector
Vector of length 1 int he same direction as the other vector
When does equality occur int he Cauchy schwarz inequality
When u or v are multiples of each other
How do I normalise the term by document matrix
Normalise each column of the matrix - replace column c with c/||c|| and normalise the search vector.
When normalised term by document and search vector what does dot product tell you
Calculated d.w where d is column of normalised term by document matrix and w is the normalised search vector.
d.w>0 only when v.c>0 (original un normalised version) so this means one of the terms is contained in the document
But d>w<=1 and the closer to 1 the number is the better the webpage fits our query.
Format of term by document matrix
Terms are the rows - columns are the documents
Define semantic content
d.w - normalised dot product
How could a webpage filter webpages returned so that they are helpful using the dot product method
Set a lower bound for d.w and only return webpages above that bound to make sure helpful webpages are kept. Ensures webpages returned contain many occurrences of the query and to speed up response time but this can cause missed documents