Chapter 3 & 4: Matching Flashcards

1
Q

Applications set similarity

A
  • Entity resolution
  • Pages with similar words
  • Netflix movies (similar tastes)
  • Dual: movies with similar sets of fans
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Goal on finding similar documents

A

Given a large number of documents, find near duplicate pairs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Problem of finding similar docs

A
  • too many documents to compare all pairs
  • Documents are too large or too many -> cannot fit in main memory
  • Many small pieces of one document can appear out of order (order is important)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Shingling

A

Generating a set of strings of length k that appear in the document

A k −shingle (or k −gram) for a document is a sequence
of k characters or k works or any other token
§ Example: k=2; doc = abcab. Doc2=abfef
§ Set of 2−shingles= {ab, bc, ca}. 2-shingles2={ab bf fe ef}
§ [ab bc ca bf fe ef]
§ D1: [1,1,1,0,0,0]. D2: [1,0, 0, 1,1,,1]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Shingling benefits

A
  • Documents that are similar will have many shingles in common
  • Changing a word only affects k-shingles within distance k from the word
  • Reordering paragraphs only affects the 2k shingles that cross paragraph boundaries
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Challenges using shingles

A
  • Selecting the right k
  • Selecting the rihjt token
  • Hashing shingles
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Computing similarities between document representation as sets with Jaccard

A

Jacc (C1,C2) = |C1 intersection C2|/|C1 union C2|
But inefficient for large sets of documents because of pair-wise

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Similarity problems

A

Finsing subsets that have significant intersection

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Minhashing

A

Using techniques like minhashing allows for efficient computation of similarities among large datasets. By reducing the size of the data through the creation of signatures, we can achieve significant performance improvements without sacrificing the accuracy of similarity measurements. This approach is crucial for handling large-scale data efficiently.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Matrix sparsity

A

In large datasets, the boolean matrix is typically sparse, meaning most entries are 0. This happens because each set (document) typically contains only a small subset of the universal set (all possible elements).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Boolean Matrix Representation

A

Efficiently encodes sets as columns in a matrix, with rows representing elements of the universal set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Row hashing

A

Row hashing utilizes hash functions to induce random permutations of row indices. This approach allows for the generation of min-hash signatures without the need for explicit row permutations, thereby avoiding the associated storage and computational challenges. The one-pass implementation involves iterating through each row of the matrix, computing hash values for each row, and updating the min-hash values for each column based on the hash results. This strategy offers a scalable solution for efficiently computing min-hash signatures while avoiding the complexities of directly permuting large matrices.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Localy-sensitive hashing (LSH)

A

Focus on pairs of signatures likely to be similar
1. create a shortlist of potential pairs that need to be checked for similarity.

  1. Divide the columns into many groups using hashing. Elements within the same group are considered potential pairs that require comparison to determine their similarity
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Modern entity linking

A

Schema alignment -> blocking -> matching -> clustering

How well did you know this?
1
Not at all
2
3
4
5
Perfectly