Chapter 3 & 4: Matching Flashcards

Question 1

Q

Applications set similarity

Answer

A

Entity resolution
Pages with similar words
Netflix movies (similar tastes)
Dual: movies with similar sets of fans

Question 2

Q

Goal on finding similar documents

Answer

A

Given a large number of documents, find near duplicate pairs

Question 3

Q

Problem of finding similar docs

Answer

A

too many documents to compare all pairs
Documents are too large or too many -> cannot fit in main memory
Many small pieces of one document can appear out of order (order is important)

Question 4

Q

Shingling

Answer

A

Generating a set of strings of length k that appear in the document

A k −shingle (or k −gram) for a document is a sequence
of k characters or k works or any other token
§ Example: k=2; doc = abcab. Doc2=abfef
§ Set of 2−shingles= {ab, bc, ca}. 2-shingles2={ab bf fe ef}
§ [ab bc ca bf fe ef]
§ D1: [1,1,1,0,0,0]. D2: [1,0, 0, 1,1,,1]

Question 5

Q

Shingling benefits

Answer

A

Documents that are similar will have many shingles in common
Changing a word only affects k-shingles within distance k from the word
Reordering paragraphs only affects the 2k shingles that cross paragraph boundaries

Question 6

Q

Challenges using shingles

Answer

A

Selecting the right k
Selecting the rihjt token
Hashing shingles

Question 7

Q

Computing similarities between document representation as sets with Jaccard

Answer

A

Jacc (C1,C2) = |C1 intersection C2|/|C1 union C2|
But inefficient for large sets of documents because of pair-wise

Question 8

Q

Similarity problems

Answer

A

Finsing subsets that have significant intersection

Question 9

Q

Minhashing

Answer

A

Using techniques like minhashing allows for efficient computation of similarities among large datasets. By reducing the size of the data through the creation of signatures, we can achieve significant performance improvements without sacrificing the accuracy of similarity measurements. This approach is crucial for handling large-scale data efficiently.

Question 10

Q

Matrix sparsity

Answer

A

In large datasets, the boolean matrix is typically sparse, meaning most entries are 0. This happens because each set (document) typically contains only a small subset of the universal set (all possible elements).

Question 11

Q

Boolean Matrix Representation

Answer

A

Efficiently encodes sets as columns in a matrix, with rows representing elements of the universal set.

Question 12

Q

Row hashing

Answer

A

Row hashing utilizes hash functions to induce random permutations of row indices. This approach allows for the generation of min-hash signatures without the need for explicit row permutations, thereby avoiding the associated storage and computational challenges. The one-pass implementation involves iterating through each row of the matrix, computing hash values for each row, and updating the min-hash values for each column based on the hash results. This strategy offers a scalable solution for efficiently computing min-hash signatures while avoiding the complexities of directly permuting large matrices.

Question 13

Q

Localy-sensitive hashing (LSH)

Answer

A

Focus on pairs of signatures likely to be similar
1. create a shortlist of potential pairs that need to be checked for similarity.

Divide the columns into many groups using hashing. Elements within the same group are considered potential pairs that require comparison to determine their similarity

Question 14

Q

Modern entity linking

Answer

A

Schema alignment -> blocking -> matching -> clustering

Chapter 3 & 4: Matching Flashcards

(14 cards)