Week 5 - Text Mining III Flashcards

1
Q

What makes two strings similar?

A

Key factors in assessing similarity:
1. Lexical similarity - words that share characters or character sequences
2. Semantic similarity - words that share a similar or related meaning
3. Phonetic similarity - words that share similar sounds

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the underlying challenge of lexical and semantic similarity?

A

The underlying challenge is developing effective methods to quantify and measure the degree of similarities between strings

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is String Distance Measurement?

A

Measuring the similarity between two strings at the character level. By quantifying the transformations needed to convert one string into another

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the type of string distance metrics?

A
  1. Levenshtein distance
  2. Damerau-levenshtein distance
  3. Q-gram
  4. Other distance metrics include: hamming, optimal sting alignment and longest common subsequence
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the purpose of a string distance metric?

A

The purpose is to obtain a numerical measure of distance between strings

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the Levenshtein Distance Method?

A

Levenshtein Distance is when the distance is the minimum number of single character edits required to transform one string (a) into another (b)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the edit operations for LD Method?

A

Substitution, insertion and deletion of characters (weight = 1 point each)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the Damerau-Levenshtein Distance?

A

It’s when you transpose the characters (swapping the two adjacent characters)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How can you convert the distance to similarity?

In other words, what is the formula for converting Levenshtein distance into a similarity score between 0 and 1

A

simLev (a,b) = 1.0 - (distance (a,b))/max (|a|,|b|)

where |a| and |b| represent the lengths of string a and b

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is q-grams?

A

q-grams is another method for measuring the distance between two strings.

note that: q is analogous to n in n-grams

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the string distance?

A

String distance is the sum of difference between q-gram vectors of both strings

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the distance between vectors?

A

Distance metrics, can be used to measure the similarity or dissimilarity between numerical characteristic (vectors) of text

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a vector?

A

A vector is a quantity that has a magnitude and a specific direction. It can describe the movement required from one point to another in the space

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the Euclidean distance?

A

Euclidean distance is the distance between two points with coordinates (x1, y2) and (x2, y2) is the length of the path (straight line) connecting them

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the relationship between Euclidean distance and magnitude?

A

Euclidean distance takes into account the magnitude of vectors, but this is not always meaningful for text data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is another metric to measure the Euclidean distance?

A

Cosine distance: This is a metric that does not use the magnitude of vectors, but is used when the magnitude of the vectors does not matter

17
Q

How can cosine distance measure the vectors?

A

Cosine distance is computed 1-cosine_similarity. Cosine similarity is the cosine of the angle between two vectors.

Words are more likely to be similar if they have a smaller angle, even if they are far apart by magnitude

18
Q

What is the problem with lexical similarity II?

A

Lexical metrics fail to capture the words that are most similar in meaning (i.e., semantic dimension)

19
Q

What areas of lexical similarity II should be improved?

A
  1. Meaning of words determined by information associated with it
  2. Each word is represented as a vector of real-valued numbers containing information associated with the word
  3. Similar words are closer/nearby in the vector space
20
Q

What is ‘one-hot encoding’ ? And why do people use it?

A

‘one-hot encoding’ is a basic method for representation of words as vectors. Hence, each row has one element (column) that is “hot” (1) and others are 0s.

1 means the presence of the word. We use it because it is easy to understand and straightforward to compute

21
Q

What are the challenges/problems with ‘one-hot encoding’?

A
  1. Super-sparse: only one element in each row has a non-zero value
  2. Computationally inefficient - a vocabulary that has 100,000 then you get a vector with a dimension of size 100,000
  3. Similarity context or relationships with other words are not captured
  4. Dot product of one-hot vector is always zero
  5. We need a denser representation of words that store information about words in a manageable dimensional space
22
Q

What is word embedding?

A

Word embedding is a a technique to address the limitations of one-hot encoding

23
Q

What are the approaches of word embedding?

A

There are two approaches: word2vec and glove

24
Q

How is information represented in word embedding?

A

Information is represented as a set of floating numbers an N-dimensional vector. Each word has an embedding, and each embedding has the same length. And each column (dimension) captures some contextual information.

25
Q

What is a learning process in the context of Text Mining III?

A

The learning process is when algorithms learn word embeddings from a training text (large corpus of text) in an unsupervised manner through the context in which they occur (e.g., surrounding words).

We can download the pre-trained models and retrieve the embeddings for a given word.

26
Q

What is Word2vec?

A

Word2vec: embedding is obtained from models that predict the target word from the context words (or surrounding words), or predict the context words given the target word

27
Q

Define CBOW and Skip Gram

A

CBOW - Continous Bag of Words; a model that predicts the target word in the centre using surrounding context words

Skip-gram - a model that predicts the context words from the target word in the centre

28
Q

Define GloVe

A

GloVe (Global Vectors for Word Representation): This model is based on an aggregated global occurrence matrix that shows how words in a corpus co-occur with each other.

It has an aim to learn word vectors such that their dot product equals the logarithm of the words’ probability of co-occurence

29
Q

What is the process for word embedding in R?

A
  1. Initialise the pre-trained models
  2. Calculate the similarity between words
30
Q

Why does Arithmetic with word embeddings work?

A

Arithmetic with word embeddings work because of ‘distributional hypothesis’; meaning that words that occur in the same contexts tend to have similar meanings

31
Q

What is the limitations of word embeddings?

A

Word embeddings are very powerful but they can be
1. instable
2. hard to interpret