Natural Language Processing Flashcards
Text Processing
Deals with various algorithms and techniques to retrieve valuable information from the documents or texts in a natural language.
What are the applications of Text Processing?
6
1) Social Media Text Mining
2) Text Classification
3) Recommender Analysis
4) Conversational Agents
5) Document Summarization
6) Semantic web
What is the difference between structured and unstructured data?
unstructured data = data that lacks a prescribed semantical structure
structured data = database
Vector Space
Represents documents as features and weights in an n-dimensional vector space
Term Frequency (TF)
The number of terms in a document
Inverse Document Frequency (IDF)
The number of documents that were counted in the searched collection.
idf(t,D) = log2(N/(df)) = log2(total number of documents/term frequency in the documents)
Suppose you have 3 documents. D1 = midterm 1 marks, D2 = midterm 2 marks and D3 = total. The term frequency in this example would be:
D1 | 1 | 1 | 1 | 0 | 0 |
| D2 | 1 | 0 | 1 | 1 | 0 |
| D3 | 0 | 0 | 0 | 0 | 1 |
| Term Doc Freq | 2 | 1 | 2 | 1 | 1 |
IDF Calculations
N = total number of documents
df = frequency of the term in the document
IDF(mid) = IDF(marks) = log2(3/2) = 0.585
IDF(term1) = log2(3/1) = 1.585
TF.IDF
TF.IDF(D1, mid) = TF.IDF(D2, mid) = TF.IDF(D1, marks) = TF.IDF(D2, marks) =1 * 0.585 = 0.585
TF.IDF(D1, term1) = TF.IDF(D2, term2) = TF.IDF(D3, total) = 1*1.585 = 1.585
Precision
Number of documents retrieved by a search / total number of documents retrieved by a search (measurement of relevancy)
TP/(TP+FP)
Recall
Relevant # of documents retrieved by a search / total number of existing relevant works (quality of relevance)
TP/(TP + FN)
F-Score
2(PrecisionRecall)/(Precision + Recall)
Word Embeddings
Vector of numbers representing a specific word.
Ex: TF-IDF word embedding would be based on frequency of a word
Give an example of how word embeddings works for the sentences “I like apples” and “I love apples”.
1) Let the vocabulary be X = {I, like, love, apples}
2) Create a vectors for each word. I = [1,0,0,0]’ ; like = [0,1,0,0]’ ; love = [0,0,1,0]’ ; apples = [0,0,0,1]’
The method above uses hot encoding but is incorrect since each word is considered different (like and love are not however).
Similar words should be close spatially.
What are the Word2Vec methods?
2
1) Common Bag of Words (CBOW)
2) Skip Gram
Explain the CBOW method.
CBOW (Continuous Bag of Words)
Goal: Predict the target word based on its context words.
Input: Context words (surrounding words).
Output: The target (center) word.
Example: Given the sentence: “The cat sits on the mat”
For the word “sits”, with window size 2, context = [“The”, “cat”, “on”, “the”]
CBOW tries to predict “sits” using those context words.
Explain the skip gram method.
Skip-gram
Goal: Predict the context words from a target word.
Input: The target (center) word.
Output: Context words (surrounding words).
Example: Given the same sentence:
For the word “sits”, with window size 2, output = [“The”, “cat”, “on”, “the”]
Skip-gram uses “sits” to predict each of those surrounding words.
String matching
process of comparing two or more words using a spcecific set of rules
What are the applications of string matching?
4
1) Search engines
2) Spell Checkers
3) Profile matches
4) DNA sequence comparison
What are the types of string matching algorithms?
2
1) exact string matching Jennifer ≠ Jenifer - given a pair of strings every character of the first string should match every character of the next string
2) Approximate string matching Jennifer = Jenifer - given a pair of strings two strings should be similar enough
What are the string matching algorithms?
4
1) Phonetic
2) Token-based
3) Pattern-matching
4) Hybrid
Token-based string matching
A type of string matching that considers tokens not characters so “John Smith” = “Smith John” since they both have John and Smith
What are the different token-based string matching?
6
1) Jaccard Similarity
2) K-Shingling
3) Minhashing
4) Hamming Distance
5) Cohen TF.IDF
6) Monge and Elkan’s Atomic String
Explain the Jaccard Similarity method of token-based string matching.
The Jaccard similarity is a simple and intuitive way to measure the similarity between two sets. In the context of token matching, it compares the sets of tokens (e.g., words, characters, or other elements) from two text strings to see how much they overlap.
Formula:
JaccardSimilarity = ∣𝐴∩𝐵∣ / ∣𝐴∪𝐵∣
Where:
A and B are sets of tokens from the two strings.
∣A∩B∣ is the number of tokens common to both sets.
∣A∪B∣ is the total number of unique tokens in both sets.
Example:
Let’s say we have two sentences:
String 1: “apple banana mango”
String 2: “banana mango peach”
Tokens:
Set A = {“apple”, “banana”, “mango”}
Set B = {“banana”, “mango”, “peach”}
Intersection: {“banana”, “mango”} → size = 2
Union: {“apple”, “banana”, “mango”, “peach”} → size = 4
Jaccard Similarity = 2 / 4 = 0.5
Explain the K-Shingles method of token-based string matching.
K-shingle of a document at hand is said to be all the possible consecutive sub-string of length k found in it.
Ex: In the string “abcdabd” with a k=2 then the set of shingles is D = {ab, bc, cd, da, bd}
Often we using this method we want to remove stop words like “and”, “you” and “to”.
Explain the Minhashing method of token-based string matching.
Suppose that the universal set is {m,n,o,p,q}.
Here S1 = {n,o,p}, S2 = {m,p}, S3 = {n,o,q} and S4 = {m,n,q}
|row| S1 | S2 | S3 | S4 |(x+3) mod 5 | (2x+1) mod 5 |
| 0 | 0 | 1 | 0 | 1 | 3 | 1 |
| 1 | 1 | 0 | 1 | 1 | 4 | 3 |
| 2 | 1 | 0 | 1 | 0 | 0 | 0 |
| 3 | 0 | 1 | 0 | 0 | 1 | 2 |
| 4 | 1 | 0 | 1 | 1 | 2 | 4 |
h1(x) = (x+3) mod 5
h2(x) = (2x+1) mod 5
Signature Matrix
Step 1: Initially all infinity
| | S1 | S2 | S3 | S4 |
| h1 | inf | inf | inf | inf |
| h2 | inf | inf | inf | inf |
Step 2: Start with row 1. In this row S2 and S4 have values of 1. So replace with hash for h1 and h2 since less than infinity.
| | S1 | S2 | S3 | S4 |
| h1 | inf | 3 | inf | 3 |
| h2 | inf | 1 | inf | 1 |
Step 3: Start with row 2. In this row S1 and S3 have values of 1. So replace with hash for h1 and h2 since less than infinity.
| | S1| S2 | S3 | S4 |
| h1 | 4 | 3 | 4 | 3 |
| h2 | 3 | 1 | 3 | 1 |
Step 4: Start with row 3. In this row S1 and S3 have values of 1. So replace with hash for h1 and h2 since less than infinity.
| | S1| S2 | S3 | S4 |
| h1 | 0 | 3 | 0 | 3 |
| h2 | 0 | 1 | 0 | 1 |
Step 5: Start with row 4. In this row S1, S3 and S4 have values of 1. So replace with hash for h1 and h2 since less than infinity.
| | S1| S2 | S3 | S4 |
| h1 | 0 | 2 | 0 | 2 |
| h2 | 0 | 1 | 0 | 1 |
Element | S1 | S2 | S3 | S4 |
What are the pattern based matching algorithms?
2
Edit Distance
Jaro Distance
Q-gram
What is the q-gram metric?
The metric between 2 strings is calculated based on the number of common q-grams between the two strings divided by the number of q-grams in the shorter string.
Distance(A,B)=∣Q(A)∣+∣Q(B)∣−2⋅∣Q(A)∩Q(B)
Example:
String A = “night”
String B = “nacht”
q = 2 (bigrams)
A → {“ni”, “ig”, “gh”, “ht”}
B → {“na”, “ac”, “ch”, “ht”}
Shared q-grams: {“ht”} → 1 common
Distance = 4 + 4 - 2×1 = 6
What is the q-gram for the following example?
‘apple’ = {##a, #ap, app, ppl, ple, le#, e##}
‘apply’ = {##a, #ap, app, ppl, ply, ly#, y##}
The strings are the same size so there is no shorter string.
3-grams in the shorter string = 7
common 3-grams = 4 (##a, #ap, app, ppl)
3-gram = 4/7
What is the hidden state in an RNN model?
It is when an RNN remembers some information about a sequence.
What are the types of RNN models?
2
1) One-to-many
2) Many-to-one
Long Short-Term Memory
Type of RNN designed to address the vanishing gradient problem. It’s components include:
* Cell state - represents the memory of the network. Information flows through the state using gates that either retain or discard info.
* Input gate - determines how much info should be stored in the cell state
* Forget gate - controls what information should be removed from a cell state
* output gate - determines which part of the cell state should be outputted to final prediction
Hamming Distance
The hamming distance between two vectors is the number of components in which the two vectors differ.
Example:
Let’s compare:
String A: “karolin”
String B: “kathrin”
Now compare each character:
Position A B Match?
1 k k Yes
2 a a Yes
3 r t No
4 o h No
5 l r No
6 i i Yes
7 n n Yes
Hamming distance = 3 (positions 3, 4, and 5 differ)
What is the hamming distance between 1110010 and 1011001.
1110010
1011001
The vectors are the same in the first, third and fifth position.
Therefore hamming distance = 4.