Natural Language Processing Flashcards

1
Q

Text Processing

A

Deals with various algorithms and techniques to retrieve valuable information from the documents or texts in a natural language.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the applications of Text Processing?

6

A

1) Social Media Text Mining
2) Text Classification
3) Recommender Analysis
4) Conversational Agents
5) Document Summarization
6) Semantic web

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the difference between structured and unstructured data?

A

unstructured data = data that lacks a prescribed semantical structure
structured data = database

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Vector Space

A

Represents documents as features and weights in an n-dimensional vector space

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Term Frequency (TF)

A

The number of terms in a document

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Inverse Document Frequency (IDF)

A

The number of documents that were counted in the searched collection.
idf(t,D) = log2(N/(df)) = log2(total number of documents/term frequency in the documents)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Suppose you have 3 documents. D1 = midterm 1 marks, D2 = midterm 2 marks and D3 = total. The term frequency in this example would be:

D1 | 1 | 1 | 1 | 0 | 0 |
| D2 | 1 | 0 | 1 | 1 | 0 |
| D3 | 0 | 0 | 0 | 0 | 1 |
| Term Doc Freq | 2 | 1 | 2 | 1 | 1 |

A

IDF Calculations
N = total number of documents
df = frequency of the term in the document
IDF(mid) = IDF(marks) = log2(3/2) = 0.585
IDF(term1) = log2(3/1) = 1.585

TF.IDF
TF.IDF(D1, mid) = TF.IDF(D2, mid) = TF.IDF(D1, marks) = TF.IDF(D2, marks) =1 * 0.585 = 0.585
TF.IDF(D1, term1) = TF.IDF(D2, term2) = TF.IDF(D3, total) = 1*1.585 = 1.585

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Precision

A

Number of documents retrieved by a search / total number of documents retrieved by a search (measurement of relevancy)

TP/(TP+FP)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Recall

A

Relevant # of documents retrieved by a search / total number of existing relevant works (quality of relevance)

TP/(TP + FN)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

F-Score

A

2(PrecisionRecall)/(Precision + Recall)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Word Embeddings

A

Vector of numbers representing a specific word.
Ex: TF-IDF word embedding would be based on frequency of a word

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Give an example of how word embeddings works for the sentences “I like apples” and “I love apples”.

A

1) Let the vocabulary be X = {I, like, love, apples}
2) Create a vectors for each word. I = [1,0,0,0]’ ; like = [0,1,0,0]’ ; love = [0,0,1,0]’ ; apples = [0,0,0,1]’

The method above uses hot encoding but is incorrect since each word is considered different (like and love are not however).
Similar words should be close spatially.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the Word2Vec methods?

2

A

1) Common Bag of Words (CBOW)
2) Skip Gram

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Explain the CBOW method.

A

CBOW (Continuous Bag of Words)
Goal: Predict the target word based on its context words.

Input: Context words (surrounding words).

Output: The target (center) word.

Example: Given the sentence: “The cat sits on the mat”

For the word “sits”, with window size 2, context = [“The”, “cat”, “on”, “the”]

CBOW tries to predict “sits” using those context words.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Explain the skip gram method.

A

Skip-gram
Goal: Predict the context words from a target word.

Input: The target (center) word.

Output: Context words (surrounding words).

Example: Given the same sentence:

For the word “sits”, with window size 2, output = [“The”, “cat”, “on”, “the”]

Skip-gram uses “sits” to predict each of those surrounding words.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

String matching

A

process of comparing two or more words using a spcecific set of rules

17
Q

What are the applications of string matching?

4

A

1) Search engines
2) Spell Checkers
3) Profile matches
4) DNA sequence comparison

18
Q

What are the types of string matching algorithms?

2

A

1) exact string matching Jennifer ≠ Jenifer - given a pair of strings every character of the first string should match every character of the next string
2) Approximate string matching Jennifer = Jenifer - given a pair of strings two strings should be similar enough

19
Q

What are the string matching algorithms?

4

A

1) Phonetic
2) Token-based
3) Pattern-matching
4) Hybrid

20
Q

Token-based string matching

A

A type of string matching that considers tokens not characters so “John Smith” = “Smith John” since they both have John and Smith

21
Q

What are the different token-based string matching?

6

A

1) Jaccard Similarity
2) K-Shingling
3) Minhashing
4) Hamming Distance
5) Cohen TF.IDF
6) Monge and Elkan’s Atomic String

22
Q

Explain the Jaccard Similarity method of token-based string matching.

A

The Jaccard similarity is a simple and intuitive way to measure the similarity between two sets. In the context of token matching, it compares the sets of tokens (e.g., words, characters, or other elements) from two text strings to see how much they overlap.

Formula:
JaccardSimilarity = ∣𝐴∩𝐵∣ / ∣𝐴∪𝐵∣

Where:
A and B are sets of tokens from the two strings.
∣A∩B∣ is the number of tokens common to both sets.
∣A∪B∣ is the total number of unique tokens in both sets.

Example:
Let’s say we have two sentences:

String 1: “apple banana mango”

String 2: “banana mango peach”

Tokens:

Set A = {“apple”, “banana”, “mango”}

Set B = {“banana”, “mango”, “peach”}

Intersection: {“banana”, “mango”} → size = 2
Union: {“apple”, “banana”, “mango”, “peach”} → size = 4

Jaccard Similarity = 2 / 4 = 0.5

23
Q

Explain the K-Shingles method of token-based string matching.

A

K-shingle of a document at hand is said to be all the possible consecutive sub-string of length k found in it.
Ex: In the string “abcdabd” with a k=2 then the set of shingles is D = {ab, bc, cd, da, bd}
Often we using this method we want to remove stop words like “and”, “you” and “to”.

24
Q

Explain the Minhashing method of token-based string matching.

A

Suppose that the universal set is {m,n,o,p,q}.
Here S1 = {n,o,p}, S2 = {m,p}, S3 = {n,o,q} and S4 = {m,n,q}
|row| S1 | S2 | S3 | S4 |(x+3) mod 5 | (2x+1) mod 5 |
| 0 | 0 | 1 | 0 | 1 | 3 | 1 |
| 1 | 1 | 0 | 1 | 1 | 4 | 3 |
| 2 | 1 | 0 | 1 | 0 | 0 | 0 |
| 3 | 0 | 1 | 0 | 0 | 1 | 2 |
| 4 | 1 | 0 | 1 | 1 | 2 | 4 |

h1(x) = (x+3) mod 5
h2(x) = (2x+1) mod 5

Signature Matrix

Step 1: Initially all infinity
| | S1 | S2 | S3 | S4 |
| h1 | inf | inf | inf | inf |
| h2 | inf | inf | inf | inf |

Step 2: Start with row 1. In this row S2 and S4 have values of 1. So replace with hash for h1 and h2 since less than infinity.
| | S1 | S2 | S3 | S4 |
| h1 | inf | 3 | inf | 3 |
| h2 | inf | 1 | inf | 1 |

Step 3: Start with row 2. In this row S1 and S3 have values of 1. So replace with hash for h1 and h2 since less than infinity.
| | S1| S2 | S3 | S4 |
| h1 | 4 | 3 | 4 | 3 |
| h2 | 3 | 1 | 3 | 1 |

Step 4: Start with row 3. In this row S1 and S3 have values of 1. So replace with hash for h1 and h2 since less than infinity.
| | S1| S2 | S3 | S4 |
| h1 | 0 | 3 | 0 | 3 |
| h2 | 0 | 1 | 0 | 1 |

Step 5: Start with row 4. In this row S1, S3 and S4 have values of 1. So replace with hash for h1 and h2 since less than infinity.
| | S1| S2 | S3 | S4 |
| h1 | 0 | 2 | 0 | 2 |
| h2 | 0 | 1 | 0 | 1 |

Element | S1 | S2 | S3 | S4 |

25
Q

What are the pattern based matching algorithms?

2

A

Edit Distance
Jaro Distance
Q-gram

26
Q

What is the q-gram metric?

A

The metric between 2 strings is calculated based on the number of common q-grams between the two strings divided by the number of q-grams in the shorter string.

Distance(A,B)=∣Q(A)∣+∣Q(B)∣−2⋅∣Q(A)∩Q(B)

Example:
String A = “night”
String B = “nacht”
q = 2 (bigrams)

A → {“ni”, “ig”, “gh”, “ht”}

B → {“na”, “ac”, “ch”, “ht”}

Shared q-grams: {“ht”} → 1 common

Distance = 4 + 4 - 2×1 = 6

27
Q

What is the q-gram for the following example?
‘apple’ = {##a, #ap, app, ppl, ple, le#, e##}
‘apply’ = {##a, #ap, app, ppl, ply, ly#, y##}

A

The strings are the same size so there is no shorter string.
3-grams in the shorter string = 7
common 3-grams = 4 (##a, #ap, app, ppl)
3-gram = 4/7

28
Q

What is the hidden state in an RNN model?

A

It is when an RNN remembers some information about a sequence.

29
Q

What are the types of RNN models?

2

A

1) One-to-many
2) Many-to-one

30
Q

Long Short-Term Memory

A

Type of RNN designed to address the vanishing gradient problem. It’s components include:
* Cell state - represents the memory of the network. Information flows through the state using gates that either retain or discard info.
* Input gate - determines how much info should be stored in the cell state
* Forget gate - controls what information should be removed from a cell state
* output gate - determines which part of the cell state should be outputted to final prediction

31
Q

Hamming Distance

A

The hamming distance between two vectors is the number of components in which the two vectors differ.

Example:
Let’s compare:

String A: “karolin”

String B: “kathrin”

Now compare each character:

Position A B Match?
1 k k Yes
2 a a Yes
3 r t No
4 o h No
5 l r No
6 i i Yes
7 n n Yes
Hamming distance = 3 (positions 3, 4, and 5 differ)

32
Q

What is the hamming distance between 1110010 and 1011001.

A

1110010
1011001

The vectors are the same in the first, third and fifth position.
Therefore hamming distance = 4.