wk11 text analysis Flashcards
What is the first step to take before even performing text analysis? give examples
Pre-processing:
-remove punctuation
-make all lowercase
-find unique words
-count occurrences
Word Adjustment:
-remove stop words (useless common words, often for linking)
-Stemming, making all words of the same form e.g. running, ran, ran -> run
What is Zipf’s Law
The frequency of each term used in a document is proportional to the constant of proportionality (size of document C) divided by the word rank raised to some power (usually measure of range of vocab used)
what are the 3 proposed reasons zipfs law works
-humans limited brain capacity, use small word
-words are often grouped together
-sentences get longer limits the number of words possibly used
what are stop words
lists of words commonly used that are devoid of semantic meaning and more there for structure. not useful for document analysis
what is one way we can do stemming and what is it
stemming is changing all the variations of a word to one e.g. taught, teaching, teach into just teach. One way to do this is RegEx
How is simply document similarity measured
Term Frequency - Inverse Document Frequency score. Similarity is TF x IDF
What is Term Frequency score
a function of a term and document TF(Ted). it is just equal to number of times a term occurs in a document
What is inverse document frequency
a function of a term and number of documents. equal to log of (Num of documents / number of documents which contain term t)
what is the tf-idf weight
w_tf-idf = TF(t,d) x IDF(t)
How do you vectorise a document
1) enumerate all the unique terms in the documents analysed
2) compute their frequency in each document i.e. how many times term 1 appears in each document
3) count number of documents it appears in
4) calculate the IDF
6) the idf will now correspond to the value of the weight of each term in the document vector see image
why vectorise documents
the similarity computation without vectorisation is very expensive. about O(n^2). vecotrising it makes it much cheaper to compute distances and similarities implicitly
what is document similarity equation after vectoriazation
dot(w1,w2) which is equal to sum of all weights w_i and w_j which is equal to their magnitude multiplied by cos(theta)
why is the initial vectorized similarity equation for documents problematic
we want to use k-means for clustering documents, but the similarity function gives an inverse distance (cosine similarity). when we need euclidian distance measures
what is the solution to the sim() function being an inverse cosine distance
1) expand the cosine distance and divide the equation by the vectors components’ length to set the distances to equal 1
2) take the inverse of the cosine distance and normalise it to give us (see image)
why isn’t document similarity sufficient for using it for search results
1) many documents could be similar, millions of results al equally ranked
2) easy to play the system, could insert white text to maximise similarity for your page
3) assigns equal importance to all documents
what does page rank do in summary
prioritises pages with are authoritative, this means they are frequently linked by authoritative pages but also linked selectivley
How do we run page rank, what do we simulate web searching as to get scores
we simulate web browsing as a markov process and the system as a markov chain where the probability of a user clicking any particular link on a webpage is random
How do we define the states and state transition of a system in a markov process
1) define set of states S size N
2) define probability of each state as a matrix where sum of all probabilities is 1
3) define transition matrix where each point t_i,j is the transition of state s_i -> s_j and the sum of each row equals 1
what is markov equation for a state transition from time t to t+1
p^(t+1) = T^T . P^(t)
as the markov web browsing process, but we know states will eventually stabilise, how do we find the stable points
1) state evolves by p^(t+1) = T^T . P^(t)
2) so we need p^(t+1) = P^(t)
3) This means T^T must be an eigenvector of value 1
what must we do after we compute an eigenvector for the steady state in a markov process
as eigen values are sum of all square values in the vector which are equal to t. we must normalise it to obtain probabilities, this means dividing it by the sum of the eigenvalues
what are the special cases when the Markov process of page ranking does not converge? What are their specific effects on converging as time tends to infinity and how do it recognise it in matrix form
1) no incoming links to site, rather than converging probabilities oscillate. Matrix has a 0 columns
2) no outgoing links from site, probabilities tend to 0. Matrix has a 0 row
how do we prevent pages with no incoming links from ruining markov process and what effect does solution have
Damping:
-give each state a default value equal to 1/(num of pages to link to) * some damping constant
-rows with all zero values now have non-zero values, as the damping value is added to each element of the matrix
-this is done by the equation in the image. Gamma is normalising constant
how do we prevent pages with no outgoing links from ruining markov process and what effect does solution have
-Give all rows with all 0 values a default value equal to 1/N (num of pages linked). this way ll rows will link to 1
-for transition W^T .X^(t), make it (W + V)^Tx^(t) where V_i,j = 1/N for each j, if document d_i is a dangiling page, 0 otherwise
what is the full equation for the markov evolution of the page rank system accounting for any cases of dangling pages or pages with no incoming links
(1 - gamma ) * (It is the state transition matrix + matrix to remove all zero valued rows (dangling pages) )^T multiplies by initial state x^(t) + game /Num of pages multiplied by matrix of 1’s
How do we construct a transition matrix
page rank of X_n = sum (over all documents that link to X_n) of documents that link to X_n’s rank / number of links those pages have
What is the issue with a naive approach to constructing a transition matrix and what’s the solution
It is self-referential, the rank of a page depends on the rank of the pages that link to it:
solution is to initalise transition matrix of all documents I as the following:
The matrix element, i,j, is equal to 1/number of documents that are linked from document i, if document i has a link to j Otherwise 0. See diagram
-this way all rows sum to 1 and for all elements j in row i, j=j for all j
The system evolves by x^(t+1) = Transition Matrix Transposed * previous state
What does page rank prioritise for high rankings
-pages that link to it have high page rank
-pages that link to it don’t have many links
-many pages link to it
what is the evolution of the page rank matrix and what happens at t -> infinity given that one of the exceptions does not occur
x^(t+1) = W^T . x^(t)
(W= transition matrix)
Over time the probabilities x^(t) will stabilise. These values are used as the rank of the document / website