Module 9 Flashcards

Question

What is topic modeling, and what does it derive

Answer 1

- the process of automatically identifying topics in a text corpus - derives the hidden patterns among words in an unsupervised manner

Answer 2

- a combination of N-words together is called n-grams - N> 1 are more informative - bigrams (N=2) are most important

Answer 3

1. Tokenization: all words tokenized 2. Vocabulary creation: unique words create vocabulary 3. Vector creation: vector row is sentence, columns are size of vocabulary

Answer 4

- a weighted model used for Information retrieval | - converts text documents into vector models

Answer 5

- Term frequency = frequency of word in doc / total number of words in doc

Answer 6

Inverse document frequency = log (total number of documents / documents containing word W)

Answer 7

gives relative importance to a term in corpus

Answer 8

a technique to systematically classify a text object

Answer 9

matching text objects to find similarities

Answer 10

minimum number of edits to transform one string into another insertion, deletion, substitution of single character

Answer 11

takes keyword as input and produces character string to identify words phonetically similar helps search large text corpus, correct spelling errors and match relevant names

Answer 12

when text is represented as vector notation, the vectorized similarity can be measured COS similarity ranges from 0 to 1 Closer to 1 = 2 vectors have same orientation Closer to 0 = 2 vectors have less similarity

Answer 13

given article, automatically summarize to produce most basic sentences

Answer 14

translate text from one language to another

Answer 15

Generation Converting info from computer DB or semantic intents to readable human language Understanding converting chunks of text into logical structures for computer programs

Answer 16

given image representing text, determine corresponding text

Answer 17

parsing textual data in documents in an analyzable and clean format

Answer 18

determine the most probable class label for the object - assumed independence Input - variables are discrete Output - Probability score (proportional to true probability) and Class label (based on highest probability score)

Answer 19

Spam filtering, fraud detection

Answer 20

P(C | A) = P ( A & C ) / P(A) = P(A|C) P(C)/ P(A) - C is class label , A is attribute

Answer 21

``` Turn P(A|C) = summation P(aj | cj) P(C|A) = summation P(aj | cj) * P(C) ```

Answer 22

Get P(Ci) for all class labels, Get P(aj | Ci ) for all A and C, assign the classifier label that maximized value of naive assumption

Answer 23

Numerical Underflow - resulting from multiplying probabilities near 0 - preventable by computing log Zero probabilities - unobserved attribute/classifier pairs - handled by smoothing

Answer 24

``` Precision = TP/(TP+FP) Recall = TP/(TP+FN) ```

Answer 25

- synonymy - many ways to refer to the same object (car, automobile) - poor recall - small cosine but related - polysemy - most words have more than one meaning ( model) - poor precision - large cosine but not related

Answer 26

Latent Semantic Indexing

Answer 27

- term by document matrix - convert matrix entries to weights - rank reduced singular value decomposition - Compute similarities between entities in semantic space with cosine

Answer 28

- tool for dimension reduction - similarity measure based on co-occurrence - finds optimal projection into low dimensional space - generalized least squares method

Module 9 Flashcards

(52 cards)