Week 5 - Text Mining III Flashcards

Question 1

Q

What makes two strings similar?

Answer

A

Key factors in assessing similarity:
1. Lexical similarity - words that share characters or character sequences
2. Semantic similarity - words that share a similar or related meaning
3. Phonetic similarity - words that share similar sounds

Question 2

Q

What is the underlying challenge of lexical and semantic similarity?

Answer

A

The underlying challenge is developing effective methods to quantify and measure the degree of similarities between strings

Question 3

Q

What is String Distance Measurement?

Answer

A

Measuring the similarity between two strings at the character level. By quantifying the transformations needed to convert one string into another

Question 4

Q

What are the type of string distance metrics?

Answer

A

Levenshtein distance
Damerau-levenshtein distance
Q-gram
Other distance metrics include: hamming, optimal sting alignment and longest common subsequence

Question 5

Q

What is the purpose of a string distance metric?

Answer

A

The purpose is to obtain a numerical measure of distance between strings

Question 6

Q

What is the Levenshtein Distance Method?

Answer

A

Levenshtein Distance is when the distance is the minimum number of single character edits required to transform one string (a) into another (b)

Question 7

Q

What are the edit operations for LD Method?

Answer

A

Substitution, insertion and deletion of characters (weight = 1 point each)

Question 8

Q

What is the Damerau-Levenshtein Distance?

Answer

A

It’s when you transpose the characters (swapping the two adjacent characters)

Question 9

Q

How can you convert the distance to similarity?

In other words, what is the formula for converting Levenshtein distance into a similarity score between 0 and 1

Answer

A

simLev (a,b) = 1.0 - (distance (a,b))/max (|a|,|b|)

where |a| and |b| represent the lengths of string a and b

Question 10

Q

What is q-grams?

Answer

A

q-grams is another method for measuring the distance between two strings.

note that: q is analogous to n in n-grams

Question 11

Q

What is the string distance?

Answer

A

String distance is the sum of difference between q-gram vectors of both strings

Question 12

Q

What is the distance between vectors?

Answer

A

Distance metrics, can be used to measure the similarity or dissimilarity between numerical characteristic (vectors) of text

Question 13

Q

What is a vector?

Answer

A

A vector is a quantity that has a magnitude and a specific direction. It can describe the movement required from one point to another in the space

Question 14

Q

What is the Euclidean distance?

Answer

A

Euclidean distance is the distance between two points with coordinates (x1, y2) and (x2, y2) is the length of the path (straight line) connecting them

Question 15

Q

What is the relationship between Euclidean distance and magnitude?

Answer

A

Euclidean distance takes into account the magnitude of vectors, but this is not always meaningful for text data

Question 16

Q

What is another metric to measure the Euclidean distance?

Answer

A

Cosine distance: This is a metric that does not use the magnitude of vectors, but is used when the magnitude of the vectors does not matter

Question 17

Q

How can cosine distance measure the vectors?

Answer

A

Cosine distance is computed 1-cosine_similarity. Cosine similarity is the cosine of the angle between two vectors.

Words are more likely to be similar if they have a smaller angle, even if they are far apart by magnitude

Question 18

Q

What is the problem with lexical similarity II?

Answer

A

Lexical metrics fail to capture the words that are most similar in meaning (i.e., semantic dimension)

Question 19

Q

What areas of lexical similarity II should be improved?

Answer

A

Meaning of words determined by information associated with it
Each word is represented as a vector of real-valued numbers containing information associated with the word
Similar words are closer/nearby in the vector space

Question 20

Q

What is ‘one-hot encoding’ ? And why do people use it?

Answer

A

‘one-hot encoding’ is a basic method for representation of words as vectors. Hence, each row has one element (column) that is “hot” (1) and others are 0s.

1 means the presence of the word. We use it because it is easy to understand and straightforward to compute

Question 21

Q

What are the challenges/problems with ‘one-hot encoding’?

Answer

A

Super-sparse: only one element in each row has a non-zero value
Computationally inefficient - a vocabulary that has 100,000 then you get a vector with a dimension of size 100,000
Similarity context or relationships with other words are not captured
Dot product of one-hot vector is always zero
We need a denser representation of words that store information about words in a manageable dimensional space

Question 22

Q

What is word embedding?

Answer

A

Word embedding is a a technique to address the limitations of one-hot encoding

Question 23

Q

What are the approaches of word embedding?

Answer

A

There are two approaches: word2vec and glove

Question 24

Q

How is information represented in word embedding?

Answer

A

Information is represented as a set of floating numbers an N-dimensional vector. Each word has an embedding, and each embedding has the same length. And each column (dimension) captures some contextual information.

Question 25

Q

What is a learning process in the context of Text Mining III?

Answer

A

The learning process is when algorithms learn word embeddings from a training text (large corpus of text) in an unsupervised manner through the context in which they occur (e.g., surrounding words).

We can download the pre-trained models and retrieve the embeddings for a given word.

Question 26

Q

What is Word2vec?

Answer

A

Word2vec: embedding is obtained from models that predict the target word from the context words (or surrounding words), or predict the context words given the target word

Question 27

Q

Define CBOW and Skip Gram

Answer

A

CBOW - Continous Bag of Words; a model that predicts the target word in the centre using surrounding context words

Skip-gram - a model that predicts the context words from the target word in the centre

Question 28

Q

Define GloVe

Answer

A

GloVe (Global Vectors for Word Representation): This model is based on an aggregated global occurrence matrix that shows how words in a corpus co-occur with each other.

It has an aim to learn word vectors such that their dot product equals the logarithm of the words’ probability of co-occurence

Question 29

Q

What is the process for word embedding in R?

Answer

A

Initialise the pre-trained models
Calculate the similarity between words

Question 30

Q

Why does Arithmetic with word embeddings work?

Answer

A

Arithmetic with word embeddings work because of ‘distributional hypothesis’; meaning that words that occur in the same contexts tend to have similar meanings

Question 31

Q

What is the limitations of word embeddings?

Answer

A

Word embeddings are very powerful but they can be
1. instable
2. hard to interpret