wk11 text analysis Flashcards

Question 1

Q

What is the first step to take before even performing text analysis? give examples

Answer

A

Pre-processing:
-remove punctuation
-make all lowercase
-find unique words
-count occurrences
Word Adjustment:
-remove stop words (useless common words, often for linking)
-Stemming, making all words of the same form e.g. running, ran, ran -> run

Question 2

Q

What is Zipf’s Law

Answer

A

The frequency of each term used in a document is proportional to the constant of proportionality (size of document C) divided by the word rank raised to some power (usually measure of range of vocab used)

Question 3

Q

what are the 3 proposed reasons zipfs law works

Answer

A

-humans limited brain capacity, use small word
-words are often grouped together
-sentences get longer limits the number of words possibly used

Question 4

Q

what are stop words

Answer

A

lists of words commonly used that are devoid of semantic meaning and more there for structure. not useful for document analysis

Question 5

Q

what is one way we can do stemming and what is it

Answer

A

stemming is changing all the variations of a word to one e.g. taught, teaching, teach into just teach. One way to do this is RegEx

Question 6

Q

How is simply document similarity measured

Answer

A

Term Frequency - Inverse Document Frequency score. Similarity is TF x IDF

Question 7

Q

What is Term Frequency score

Answer

A

a function of a term and document TF(Ted). it is just equal to number of times a term occurs in a document

Question 8

Q

What is inverse document frequency

Answer

A

a function of a term and number of documents. equal to log of (Num of documents / number of documents which contain term t)

Question 9

Q

what is the tf-idf weight

Answer

A

w_tf-idf = TF(t,d) x IDF(t)

Question 10

Q

How do you vectorise a document

Answer

A

1) enumerate all the unique terms in the documents analysed
2) compute their frequency in each document i.e. how many times term 1 appears in each document
3) count number of documents it appears in
4) calculate the IDF
6) the idf will now correspond to the value of the weight of each term in the document vector see image

Question 11

Q

why vectorise documents

Answer

A

the similarity computation without vectorisation is very expensive. about O(n^2). vecotrising it makes it much cheaper to compute distances and similarities implicitly

Question 12

Q

what is document similarity equation after vectoriazation

Answer

A

dot(w1,w2) which is equal to sum of all weights w_i and w_j which is equal to their magnitude multiplied by cos(theta)

Question 13

Q

why is the initial vectorized similarity equation for documents problematic

Answer

A

we want to use k-means for clustering documents, but the similarity function gives an inverse distance (cosine similarity). when we need euclidian distance measures

Question 14

Q

what is the solution to the sim() function being an inverse cosine distance

Answer

A

1) expand the cosine distance and divide the equation by the vectors components’ length to set the distances to equal 1
2) take the inverse of the cosine distance and normalise it to give us (see image)

Question 15

Q

why isn’t document similarity sufficient for using it for search results

Answer

A

1) many documents could be similar, millions of results al equally ranked
2) easy to play the system, could insert white text to maximise similarity for your page
3) assigns equal importance to all documents

Question 16

Q

what does page rank do in summary

Answer

A

prioritises pages with are authoritative, this means they are frequently linked by authoritative pages but also linked selectivley

Question 17

Q

How do we run page rank, what do we simulate web searching as to get scores

Answer

A

we simulate web browsing as a markov process and the system as a markov chain where the probability of a user clicking any particular link on a webpage is random

Question 18

Q

How do we define the states and state transition of a system in a markov process

Answer

A

1) define set of states S size N
2) define probability of each state as a matrix where sum of all probabilities is 1
3) define transition matrix where each point t_i,j is the transition of state s_i -> s_j and the sum of each row equals 1

Question 19

Q

what is markov equation for a state transition from time t to t+1

Answer

A

p^(t+1) = T^T . P^(t)

Question 20

Q

as the markov web browsing process, but we know states will eventually stabilise, how do we find the stable points

Answer

A

1) state evolves by p^(t+1) = T^T . P^(t)
2) so we need p^(t+1) = P^(t)
3) This means T^T must be an eigenvector of value 1

Question 21

Q

what must we do after we compute an eigenvector for the steady state in a markov process

Answer

A

as eigen values are sum of all square values in the vector which are equal to t. we must normalise it to obtain probabilities, this means dividing it by the sum of the eigenvalues

Question 22

Q

what are the special cases when the Markov process of page ranking does not converge? What are their specific effects on converging as time tends to infinity and how do it recognise it in matrix form

Answer

A

1) no incoming links to site, rather than converging probabilities oscillate. Matrix has a 0 columns
2) no outgoing links from site, probabilities tend to 0. Matrix has a 0 row

Question 23

Q

how do we prevent pages with no incoming links from ruining markov process and what effect does solution have

Answer

A

Damping:
-give each state a default value equal to 1/(num of pages to link to) * some damping constant
-rows with all zero values now have non-zero values, as the damping value is added to each element of the matrix
-this is done by the equation in the image. Gamma is normalising constant

Question 24

Q

how do we prevent pages with no outgoing links from ruining markov process and what effect does solution have

Answer

A

-Give all rows with all 0 values a default value equal to 1/N (num of pages linked). this way ll rows will link to 1
-for transition W^T .X^(t), make it (W + V)^Tx^(t) where V_i,j = 1/N for each j, if document d_i is a dangiling page, 0 otherwise

Question 25

Q

what is the full equation for the markov evolution of the page rank system accounting for any cases of dangling pages or pages with no incoming links

Answer

A

(1 - gamma ) * (It is the state transition matrix + matrix to remove all zero valued rows (dangling pages) )^T multiplies by initial state x^(t) + game /Num of pages multiplied by matrix of 1’s

Question 26

Q

How do we construct a transition matrix

Answer

A

page rank of X_n = sum (over all documents that link to X_n) of documents that link to X_n’s rank / number of links those pages have

Question 27

Q

What is the issue with a naive approach to constructing a transition matrix and what’s the solution

Answer

A

It is self-referential, the rank of a page depends on the rank of the pages that link to it:
solution is to initalise transition matrix of all documents I as the following:
The matrix element, i,j, is equal to 1/number of documents that are linked from document i, if document i has a link to j Otherwise 0. See diagram
-this way all rows sum to 1 and for all elements j in row i, j=j for all j
The system evolves by x^(t+1) = Transition Matrix Transposed * previous state

Question 28

Q

What does page rank prioritise for high rankings

Answer

A

-pages that link to it have high page rank
-pages that link to it don’t have many links
-many pages link to it

Question 29

Q

what is the evolution of the page rank matrix and what happens at t -> infinity given that one of the exceptions does not occur

Answer

A

x^(t+1) = W^T . x^(t)
(W= transition matrix)
Over time the probabilities x^(t) will stabilise. These values are used as the rank of the document / website