wk11 text analysis Flashcards

1
Q

What is the first step to take before even performing text analysis? give examples

A

Pre-processing:
-remove punctuation
-make all lowercase
-find unique words
-count occurrences
Word Adjustment:
-remove stop words (useless common words, often for linking)
-Stemming, making all words of the same form e.g. running, ran, ran -> run

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is Zipf’s Law

A

The frequency of each term used in a document is proportional to the constant of proportionality (size of document C) divided by the word rank raised to some power (usually measure of range of vocab used)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what are the 3 proposed reasons zipfs law works

A

-humans limited brain capacity, use small word
-words are often grouped together
-sentences get longer limits the number of words possibly used

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what are stop words

A

lists of words commonly used that are devoid of semantic meaning and more there for structure. not useful for document analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

what is one way we can do stemming and what is it

A

stemming is changing all the variations of a word to one e.g. taught, teaching, teach into just teach. One way to do this is RegEx

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How is simply document similarity measured

A

Term Frequency - Inverse Document Frequency score. Similarity is TF x IDF

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is Term Frequency score

A

a function of a term and document TF(Ted). it is just equal to number of times a term occurs in a document

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is inverse document frequency

A

a function of a term and number of documents. equal to log of (Num of documents / number of documents which contain term t)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

what is the tf-idf weight

A

w_tf-idf = TF(t,d) x IDF(t)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How do you vectorise a document

A

1) enumerate all the unique terms in the documents analysed
2) compute their frequency in each document i.e. how many times term 1 appears in each document
3) count number of documents it appears in
4) calculate the IDF
6) the idf will now correspond to the value of the weight of each term in the document vector see image

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

why vectorise documents

A

the similarity computation without vectorisation is very expensive. about O(n^2). vecotrising it makes it much cheaper to compute distances and similarities implicitly

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

what is document similarity equation after vectoriazation

A

dot(w1,w2) which is equal to sum of all weights w_i and w_j which is equal to their magnitude multiplied by cos(theta)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

why is the initial vectorized similarity equation for documents problematic

A

we want to use k-means for clustering documents, but the similarity function gives an inverse distance (cosine similarity). when we need euclidian distance measures

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

what is the solution to the sim() function being an inverse cosine distance

A

1) expand the cosine distance and divide the equation by the vectors components’ length to set the distances to equal 1
2) take the inverse of the cosine distance and normalise it to give us (see image)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

why isn’t document similarity sufficient for using it for search results

A

1) many documents could be similar, millions of results al equally ranked
2) easy to play the system, could insert white text to maximise similarity for your page
3) assigns equal importance to all documents

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

what does page rank do in summary

A

prioritises pages with are authoritative, this means they are frequently linked by authoritative pages but also linked selectivley

17
Q

How do we run page rank, what do we simulate web searching as to get scores

A

we simulate web browsing as a markov process and the system as a markov chain where the probability of a user clicking any particular link on a webpage is random

18
Q

How do we define the states and state transition of a system in a markov process

A

1) define set of states S size N
2) define probability of each state as a matrix where sum of all probabilities is 1
3) define transition matrix where each point t_i,j is the transition of state s_i -> s_j and the sum of each row equals 1

19
Q

what is markov equation for a state transition from time t to t+1

A

p^(t+1) = T^T . P^(t)

20
Q

as the markov web browsing process, but we know states will eventually stabilise, how do we find the stable points

A

1) state evolves by p^(t+1) = T^T . P^(t)
2) so we need p^(t+1) = P^(t)
3) This means T^T must be an eigenvector of value 1

21
Q

what must we do after we compute an eigenvector for the steady state in a markov process

A

as eigen values are sum of all square values in the vector which are equal to t. we must normalise it to obtain probabilities, this means dividing it by the sum of the eigenvalues

22
Q

what are the special cases when the Markov process of page ranking does not converge? What are their specific effects on converging as time tends to infinity and how do it recognise it in matrix form

A

1) no incoming links to site, rather than converging probabilities oscillate. Matrix has a 0 columns
2) no outgoing links from site, probabilities tend to 0. Matrix has a 0 row

23
Q

how do we prevent pages with no incoming links from ruining markov process and what effect does solution have

A

Damping:
-give each state a default value equal to 1/(num of pages to link to) * some damping constant
-rows with all zero values now have non-zero values, as the damping value is added to each element of the matrix
-this is done by the equation in the image. Gamma is normalising constant

24
Q

how do we prevent pages with no outgoing links from ruining markov process and what effect does solution have

A

-Give all rows with all 0 values a default value equal to 1/N (num of pages linked). this way ll rows will link to 1
-for transition W^T .X^(t), make it (W + V)^Tx^(t) where V_i,j = 1/N for each j, if document d_i is a dangiling page, 0 otherwise

25
Q

what is the full equation for the markov evolution of the page rank system accounting for any cases of dangling pages or pages with no incoming links

A

(1 - gamma ) * (It is the state transition matrix + matrix to remove all zero valued rows (dangling pages) )^T multiplies by initial state x^(t) + game /Num of pages multiplied by matrix of 1’s

26
Q

How do we construct a transition matrix

A

page rank of X_n = sum (over all documents that link to X_n) of documents that link to X_n’s rank / number of links those pages have

27
Q

What is the issue with a naive approach to constructing a transition matrix and what’s the solution

A

It is self-referential, the rank of a page depends on the rank of the pages that link to it:
solution is to initalise transition matrix of all documents I as the following:
The matrix element, i,j, is equal to 1/number of documents that are linked from document i, if document i has a link to j Otherwise 0. See diagram
-this way all rows sum to 1 and for all elements j in row i, j=j for all j
The system evolves by x^(t+1) = Transition Matrix Transposed * previous state

28
Q

What does page rank prioritise for high rankings

A

-pages that link to it have high page rank
-pages that link to it don’t have many links
-many pages link to it

29
Q

what is the evolution of the page rank matrix and what happens at t -> infinity given that one of the exceptions does not occur

A

x^(t+1) = W^T . x^(t)
(W= transition matrix)
Over time the probabilities x^(t) will stabilise. These values are used as the rank of the document / website