wk11 text analysis Flashcards
What is the first step to take before even performing text analysis? give examples
Pre-processing:
-remove punctuation
-make all lowercase
-find unique words
-count occurrences
Word Adjustment:
-remove stop words (useless common words, often for linking)
-Stemming, making all words of the same form e.g. running, ran, ran -> run
What is Zipf’s Law
The frequency of each term used in a document is proportional to the constant of proportionality (size of document C) divided by the word rank raised to some power (usually measure of range of vocab used)
what are the 3 proposed reasons zipfs law works
-humans limited brain capacity, use small word
-words are often grouped together
-sentences get longer limits the number of words possibly used
what are stop words
lists of words commonly used that are devoid of semantic meaning and more there for structure. not useful for document analysis
what is one way we can do stemming and what is it
stemming is changing all the variations of a word to one e.g. taught, teaching, teach into just teach. One way to do this is RegEx
How is simply document similarity measured
Term Frequency - Inverse Document Frequency score. Similarity is TF x IDF
What is Term Frequency score
a function of a term and document TF(Ted). it is just equal to number of times a term occurs in a document
What is inverse document frequency
a function of a term and number of documents. equal to log of (Num of documents / number of documents which contain term t)
what is the tf-idf weight
w_tf-idf = TF(t,d) x IDF(t)
How do you vectorise a document
1) enumerate all the unique terms in the documents analysed
2) compute their frequency in each document i.e. how many times term 1 appears in each document
3) count number of documents it appears in
4) calculate the IDF
6) the idf will now correspond to the value of the weight of each term in the document vector see image
why vectorise documents
the similarity computation without vectorisation is very expensive. about O(n^2). vecotrising it makes it much cheaper to compute distances and similarities implicitly
what is document similarity equation after vectoriazation
dot(w1,w2) which is equal to sum of all weights w_i and w_j which is equal to their magnitude multiplied by cos(theta)
why is the initial vectorized similarity equation for documents problematic
we want to use k-means for clustering documents, but the similarity function gives an inverse distance (cosine similarity). when we need euclidian distance measures
what is the solution to the sim() function being an inverse cosine distance
1) expand the cosine distance and divide the equation by the vectors components’ length to set the distances to equal 1
2) take the inverse of the cosine distance and normalise it to give us (see image)
why isn’t document similarity sufficient for using it for search results
1) many documents could be similar, millions of results al equally ranked
2) easy to play the system, could insert white text to maximise similarity for your page
3) assigns equal importance to all documents