Book - Chapter 9 Analytical Theory Text Analysis Flashcards

1
Q

What is text analysis

A

Representation and processing of text

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why is text analysis high dimensionality

A

Every distinct time is a dimension

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Is the data structured or unstructured

A

Unstructured

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the three important steps/process InTEXT analysis

A

Passing. Search/retrieval. Text mining

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is parsing

A

Imposing structure on the unstructured/semistructured text for downstream analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is search/retrieval

A

Which documents have this word or phrase. Which documents are about this topic or this entity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is text mining

A

Understanding the content. For example clustering, classification

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are regular expressions

A

Or a means for finding words, strings or particular patterns in text

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What does bag of words mean

A

Most common representation of the structure. The bag of words is a vector with one dimension for every unique term in the space

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is term frequency

A

The number of times a term occurs in a vector

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is a reverse index

A

For every possible feature, A list of all the documents that contain that feature

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the corpus metrics

A

Volume. Corpus wide term frequencies. Inverse document frequency

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the challenge with a corpus

A

A corpus is dynamic. The index and metrics must be updated continuously

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the three things that determine quality of search results

A

Relevance. Precision . Recall

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is relevant in the quality of search results

A

Is the document what I wanted? It is used to rank search results

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is precision in the quality of search results

A

What percentage of the document in the results are relevant

17
Q

What is recall in the quality of search results

A

Of all the relevant documents in the corpus, what percentage were returned to me

18
Q

What is term frequency

A

Assigns each item in the document are white.

19
Q

What does inverse document frequency do

A

It measures the uniqueness of a term in the corpus

20
Q

What is tf-idf

A

It provides measure that we await the presence of unusual terms in the query as higher indications of document relevance than the presence of more common terms

21
Q

What is authoritativeness

A

Page rank used by Google

22
Q

What is the recency metric

A

New documents are more relevant than old ones

23
Q

The tasks such as reverse indexing, finding the inverse document frequencies and corpus term frequencies are implemented with what

A

Map and reduce algorithms