Natural Language Processing Flashcards by Victoria Buchanan

Text Processing

Deals with various algorithms and techniques to retrieve valuable information from the documents or texts in a natural language.

How well did you know this?

Not at all

Perfectly

What are the applications of Text Processing?

1) Social Media Text Mining
2) Text Classification
3) Recommender Analysis
4) Conversational Agents
5) Document Summarization
6) Semantic web

How well did you know this?

Not at all

Perfectly

What is the difference between structured and unstructured data?

unstructured data = data that lacks a prescribed semantical structure
structured data = database

How well did you know this?

Not at all

Perfectly

Vector Space

Represents documents as features and weights in an n-dimensional vector space

How well did you know this?

Not at all

Perfectly

Term Frequency (TF)

The number of terms in a document

How well did you know this?

Not at all

Perfectly

Inverse Document Frequency (IDF)

The number of documents that were counted in the searched collection.
idf(t,D) = log2(N/(df)) = log2(total number of documents/term frequency in the documents)

How well did you know this?

Not at all

Perfectly

Suppose you have 3 documents. D1 = midterm 1 marks, D2 = midterm 2 marks and D3 = total. The term frequency in this example would be:

D1 | 1 | 1 | 1 | 0 | 0 |
| D2 | 1 | 0 | 1 | 1 | 0 |
| D3 | 0 | 0 | 0 | 0 | 1 |
| Term Doc Freq | 2 | 1 | 2 | 1 | 1 |

IDF Calculations
N = total number of documents
df = frequency of the term in the document
IDF(mid) = IDF(marks) = log2(3/2) = 0.585
IDF(term1) = log2(3/1) = 1.585

TF.IDF
TF.IDF(D1, mid) = TF.IDF(D2, mid) = TF.IDF(D1, marks) = TF.IDF(D2, marks) =1 * 0.585 = 0.585
TF.IDF(D1, term1) = TF.IDF(D2, term2) = TF.IDF(D3, total) = 1*1.585 = 1.585

How well did you know this?

Not at all

Perfectly

Precision

Number of documents retrieved by a search / total number of documents retrieved by a search (measurement of relevancy)

TP/(TP+FP)

How well did you know this?

Not at all

Perfectly

Recall

Relevant # of documents retrieved by a search / total number of existing relevant works (quality of relevance)

TP/(TP + FN)

How well did you know this?

Not at all

Perfectly

F-Score

2(PrecisionRecall)/(Precision + Recall)

How well did you know this?

Not at all

Perfectly

Word Embeddings

Vector of numbers representing a specific word.
Ex: TF-IDF word embedding would be based on frequency of a word

How well did you know this?

Not at all

Perfectly

Give an example of how word embeddings works for the sentences “I like apples” and “I love apples”.

1) Let the vocabulary be X = {I, like, love, apples}
2) Create a vectors for each word. I = [1,0,0,0]’ ; like = [0,1,0,0]’ ; love = [0,0,1,0]’ ; apples = [0,0,0,1]’

The method above uses hot encoding but is incorrect since each word is considered different (like and love are not however).
Similar words should be close spatially.

How well did you know this?

Not at all

Perfectly

What are the Word2Vec methods?

1) Common Bag of Words (CBOW)
2) Skip Gram

How well did you know this?

Not at all

Perfectly

Explain the CBOW method.

CBOW (Continuous Bag of Words)
Goal: Predict the target word based on its context words.

Input: Context words (surrounding words).

Output: The target (center) word.

Example: Given the sentence: “The cat sits on the mat”

For the word “sits”, with window size 2, context = [“The”, “cat”, “on”, “the”]

CBOW tries to predict “sits” using those context words.

How well did you know this?

Not at all

Perfectly

Explain the skip gram method.

Skip-gram
Goal: Predict the context words from a target word.

Input: The target (center) word.

Output: Context words (surrounding words).

Example: Given the same sentence:

For the word “sits”, with window size 2, output = [“The”, “cat”, “on”, “the”]

Skip-gram uses “sits” to predict each of those surrounding words.

How well did you know this?

Not at all

Perfectly

String matching

Study These Flashcards

process of comparing two or more words using a spcecific set of rules

What are the applications of string matching?

Study These Flashcards

1) Search engines
2) Spell Checkers
3) Profile matches
4) DNA sequence comparison

What are the types of string matching algorithms?

Study These Flashcards

1) exact string matching Jennifer ≠ Jenifer - given a pair of strings every character of the first string should match every character of the next string
2) Approximate string matching Jennifer = Jenifer - given a pair of strings two strings should be similar enough

What are the string matching algorithms?

Study These Flashcards

1) Phonetic
2) Token-based
3) Pattern-matching
4) Hybrid

Token-based string matching

Study These Flashcards

A type of string matching that considers tokens not characters so “John Smith” = “Smith John” since they both have John and Smith

What are the different token-based string matching?

Study These Flashcards

1) Jaccard Similarity
2) K-Shingling
3) Minhashing
4) Hamming Distance
5) Cohen TF.IDF
6) Monge and Elkan’s Atomic String

Explain the Jaccard Similarity method of token-based string matching.

Study These Flashcards

The Jaccard similarity is a simple and intuitive way to measure the similarity between two sets. In the context of token matching, it compares the sets of tokens (e.g., words, characters, or other elements) from two text strings to see how much they overlap.

Formula:
JaccardSimilarity = ∣𝐴∩𝐵∣ / ∣𝐴∪𝐵∣

Where:
A and B are sets of tokens from the two strings.
∣A∩B∣ is the number of tokens common to both sets.
∣A∪B∣ is the total number of unique tokens in both sets.

Example:
Let’s say we have two sentences:

String 1: “apple banana mango”

String 2: “banana mango peach”

Tokens:

Set A = {“apple”, “banana”, “mango”}

Set B = {“banana”, “mango”, “peach”}

Intersection: {“banana”, “mango”} → size = 2
Union: {“apple”, “banana”, “mango”, “peach”} → size = 4

Jaccard Similarity = 2 / 4 = 0.5

Explain the K-Shingles method of token-based string matching.

Study These Flashcards

K-shingle of a document at hand is said to be all the possible consecutive sub-string of length k found in it.
Ex: In the string “abcdabd” with a k=2 then the set of shingles is D = {ab, bc, cd, da, bd}
Often we using this method we want to remove stop words like “and”, “you” and “to”.

Explain the Minhashing method of token-based string matching.

Study These Flashcards

Suppose that the universal set is {m,n,o,p,q}.
Here S1 = {n,o,p}, S2 = {m,p}, S3 = {n,o,q} and S4 = {m,n,q}
|row| S1 | S2 | S3 | S4 |(x+3) mod 5 | (2x+1) mod 5 |
| 0 | 0 | 1 | 0 | 1 | 3 | 1 |
| 1 | 1 | 0 | 1 | 1 | 4 | 3 |
| 2 | 1 | 0 | 1 | 0 | 0 | 0 |
| 3 | 0 | 1 | 0 | 0 | 1 | 2 |
| 4 | 1 | 0 | 1 | 1 | 2 | 4 |

h1(x) = (x+3) mod 5
h2(x) = (2x+1) mod 5

Signature Matrix

Step 1: Initially all infinity
| | S1 | S2 | S3 | S4 |
| h1 | inf | inf | inf | inf |
| h2 | inf | inf | inf | inf |

Step 2: Start with row 1. In this row S2 and S4 have values of 1. So replace with hash for h1 and h2 since less than infinity.
| | S1 | S2 | S3 | S4 |
| h1 | inf | 3 | inf | 3 |
| h2 | inf | 1 | inf | 1 |

Step 3: Start with row 2. In this row S1 and S3 have values of 1. So replace with hash for h1 and h2 since less than infinity.
| | S1| S2 | S3 | S4 |
| h1 | 4 | 3 | 4 | 3 |
| h2 | 3 | 1 | 3 | 1 |

Step 4: Start with row 3. In this row S1 and S3 have values of 1. So replace with hash for h1 and h2 since less than infinity.
| | S1| S2 | S3 | S4 |
| h1 | 0 | 3 | 0 | 3 |
| h2 | 0 | 1 | 0 | 1 |

Step 5: Start with row 4. In this row S1, S3 and S4 have values of 1. So replace with hash for h1 and h2 since less than infinity.
| | S1| S2 | S3 | S4 |
| h1 | 0 | 2 | 0 | 2 |
| h2 | 0 | 1 | 0 | 1 |

Element | S1 | S2 | S3 | S4 |

What are the pattern based matching algorithms? | 2

Edit Distance Jaro Distance Q-gram

What is the q-gram metric?

The metric between 2 strings is calculated based on the number of common q-grams between the two strings divided by the number of q-grams in the shorter string. Distance(A,B)=∣Q(A)∣+∣Q(B)∣−2⋅∣Q(A)∩Q(B) Example: String A = "night" String B = "nacht" q = 2 (bigrams) A → {"ni", "ig", "gh", "ht"} B → {"na", "ac", "ch", "ht"} Shared q-grams: {"ht"} → 1 common Distance = 4 + 4 - 2×1 = 6

What is the q-gram for the following example? 'apple' = {##a, #ap, app, ppl, ple, le#, e##} 'apply' = {##a, #ap, app, ppl, ply, ly#, y##}

The strings are the same size so there is no shorter string. 3-grams in the shorter string = 7 common 3-grams = 4 (##a, #ap, app, ppl) 3-gram = 4/7

What is the hidden state in an RNN model?

It is when an RNN remembers some information about a sequence.

What are the types of RNN models? | 2

1) One-to-many 2) Many-to-one

Long Short-Term Memory

Type of RNN designed to address the vanishing gradient problem. It's components include: * Cell state - represents the memory of the network. Information flows through the state using gates that either retain or discard info. * Input gate - determines how much info should be stored in the cell state * Forget gate - controls what information should be removed from a cell state * output gate - determines which part of the cell state should be outputted to final prediction

Hamming Distance

The hamming distance between two vectors is the number of components in which the two vectors differ. Example: Let’s compare: String A: "karolin" String B: "kathrin" Now compare each character: Position A B Match? 1 k k Yes 2 a a Yes 3 r t No 4 o h No 5 l r No 6 i i Yes 7 n n Yes Hamming distance = 3 (positions 3, 4, and 5 differ)

What is the hamming distance between 1110010 and 1011001.

1**1**1**0**0**10** 1**0**1**1**0**01** The vectors are the same in the first, third and fifth position. Therefore hamming distance = 4.

Natural Language Processing Flashcards

(32 cards)