Chapter 3 & 4: Matching Flashcards
Applications set similarity
- Entity resolution
- Pages with similar words
- Netflix movies (similar tastes)
- Dual: movies with similar sets of fans
Goal on finding similar documents
Given a large number of documents, find near duplicate pairs
Problem of finding similar docs
- too many documents to compare all pairs
- Documents are too large or too many -> cannot fit in main memory
- Many small pieces of one document can appear out of order (order is important)
Shingling
Generating a set of strings of length k that appear in the document
A k −shingle (or k −gram) for a document is a sequence
of k characters or k works or any other token
§ Example: k=2; doc = abcab. Doc2=abfef
§ Set of 2−shingles= {ab, bc, ca}. 2-shingles2={ab bf fe ef}
§ [ab bc ca bf fe ef]
§ D1: [1,1,1,0,0,0]. D2: [1,0, 0, 1,1,,1]
Shingling benefits
- Documents that are similar will have many shingles in common
- Changing a word only affects k-shingles within distance k from the word
- Reordering paragraphs only affects the 2k shingles that cross paragraph boundaries
Challenges using shingles
- Selecting the right k
- Selecting the rihjt token
- Hashing shingles
Computing similarities between document representation as sets with Jaccard
Jacc (C1,C2) = |C1 intersection C2|/|C1 union C2|
But inefficient for large sets of documents because of pair-wise
Similarity problems
Finsing subsets that have significant intersection
Minhashing
Using techniques like minhashing allows for efficient computation of similarities among large datasets. By reducing the size of the data through the creation of signatures, we can achieve significant performance improvements without sacrificing the accuracy of similarity measurements. This approach is crucial for handling large-scale data efficiently.
Matrix sparsity
In large datasets, the boolean matrix is typically sparse, meaning most entries are 0. This happens because each set (document) typically contains only a small subset of the universal set (all possible elements).
Boolean Matrix Representation
Efficiently encodes sets as columns in a matrix, with rows representing elements of the universal set.
Row hashing
Row hashing utilizes hash functions to induce random permutations of row indices. This approach allows for the generation of min-hash signatures without the need for explicit row permutations, thereby avoiding the associated storage and computational challenges. The one-pass implementation involves iterating through each row of the matrix, computing hash values for each row, and updating the min-hash values for each column based on the hash results. This strategy offers a scalable solution for efficiently computing min-hash signatures while avoiding the complexities of directly permuting large matrices.
Localy-sensitive hashing (LSH)
Focus on pairs of signatures likely to be similar
1. create a shortlist of potential pairs that need to be checked for similarity.
- Divide the columns into many groups using hashing. Elements within the same group are considered potential pairs that require comparison to determine their similarity
Modern entity linking
Schema alignment -> blocking -> matching -> clustering