BigDataIntegration Flashcards

Question

What is Clean-Clean ER?

Answer 1

Finding matches between two clean collections (collections free of duplicates)

Answer 2

Finding duplicates within a single dirty collection

Answer 3

Finding matches between more than two collections (Deduplication)

Answer 4

A blocking technique that uses every token in every attribute value as a blocking key regardless of the attribute

Answer 5

It is very robust and has high precision as it's unlikely to miss a match and doesn't require schema knowledge

Answer 6

A technique that uses block-entity relationships to identify and maintain the most promising comparisons reducing superfluous comparisons

Answer 7

Aggregate Reciprocal Comparisons Scheme - sums the reciprocal of the number of comparisons from each shared block

Answer 8

Common Blocks Scheme - uses the number of blocks shared between profiles

Answer 9

Jaccard Scheme - uses the Jaccard similarity of the blocks shared between profiles

Answer 10

1) Weighted Edge Pruning 2) Cardinality Edge Pruning 3) Weighted Node Pruning 4) Cardinality Node Pruning

Answer 11

A technique that retains each entity in a percentage of its smallest blocks since larger blocks are less likely to contain unique duplicates

Answer 12

A technique that removes oversized blocks by setting an upper limit on block cardinality

Answer 13

A strategy that focuses on generating candidate pairs in a specific order to maximize progressive recall with limited resources

Answer 14

1) Comparisons level 2) Block level 3) Entity level

Answer 15

A method that sorts records by a key and uses a sliding window to compare nearby records assuming higher matching probability for proximity

Answer 16

Local Schema-agnostic PSN - uses similarity principle and token sorting without schema knowledge

Answer 17

Global Schema-agnostic PSN - defines global execution order for all pairs within a predefined range of window sizes

Answer 18

Progressive Block Scheduling - orders blocks by weight assuming smaller blocks are more informative

Answer 19

1) Unionable tables 2) Joinable tables

Answer 20

Tables that are entity complements sharing the same schema but different records

Answer 21

Tables that are schema complements sharing some key attributes but having different additional attributes

Answer 22

An operation that retrieves all pairs of records whose similarity exceeds a threshold

Answer 23

1) Prefix filter 2) Length filter 3) Positional filter

Answer 24

Filter that requires matching records to share at least one token in their prefixes

Answer 25

Representing documents as sets of k-length substrings found within them

Answer 26

A technique that compresses sets into small signatures while preserving their Jaccard similarity

Answer 27

The probability that the minhash function produces the same value for two sets equals their Jaccard similarity

Answer 28

A technique to reduce comparisons by hashing similar items to the same buckets

Answer 29

Dividing the signature matrix into bands and hashing each band to find candidate pairs that match in at least one band

Answer 30

Using random projections where the probability of hash collision is proportional to the cosine of the angle between vectors

Answer 31

Finding matches between two databases without duplicates like merging customer records from two different companies

Answer 32

Finding matches across multiple product catalogs that may each contain duplicates

Answer 33

Finding tables that contain additional attributes for existing entities like adding demographic data to customer records

Answer 34

Record linkage data cleaning and deduplication tasks

Answer 35

It creates a set-based representation that captures local text structure and enables set similarity comparisons

Answer 36

It creates small fixed-size signatures that preserve similarity while reducing storage and comparison costs

Answer 37

It reduces the number of necessary comparisons by only comparing items likely to be similar based on their hash values

Answer 38

Balance between discriminative power and computational cost larger k gives more precision but requires more processing

Answer 39

It uses block relationships and weights to identify promising comparisons rather than just shared blocking keys

Answer 40

It prioritizes likely matches to identify as many matches as possible early with limited time/computation

Answer 41

More bands increase recall but also false positives while more rows per band increase precision but may miss matches

Answer 42

Select attributes that are unlikely to contain errors and have good discriminative power

Answer 43

GS-PSN avoids repeated pair comparisons and has better matching probability estimation

Answer 44

It retains some utility from larger blocks while still reducing comparisons rather than completely removing them

Answer 45

Multiple functions provide better similarity estimation by sampling more permutations

Answer 46

If records don't share any prefix tokens they cannot meet the similarity threshold

Answer 47

It helps group related pairs into clusters assuming if A=B and B=C then A=C

Answer 48

Supervised methods can learn better weighting schemes if labeled training data is available

Answer 49

They don't require schema alignment or domain knowledge making them more flexible and scalable

Answer 50

Use set-based similarity measures like Jaccard similarity rather than exact matching

Answer 51

It ensures consistent distance relationships between multiple items

Answer 52

They can detect small character-level differences that indicate typing errors

Answer 53

They treat text as bags of words so word order doesn't affect similarity

Answer 54

They determine the minimum similarity required to consider items as matching

Answer 55

They allow useful partial results with limited resources by prioritizing likely matches

Answer 56

Keys must balance discriminative power error tolerance and computational efficiency

Answer 57

Measure pair completeness (recall) and pair quality (precision) of the generated blocks

Answer 58

Smaller blocks generally have higher matching probability but may miss some matches

Answer 59

Data type expected error patterns and computational requirements

Answer 60

Different measures catch different types of variations and errors

Answer 61

Efficient indexing structures and filtering techniques that reduce necessary comparisons

Answer 62

Use incremental techniques that can process updates without full recomputation

Answer 63

It reduces noise and standardizes formats improving matching accuracy

Answer 64

Data volumes variety and velocity keep increasing while quality requirements remain high

Answer 65

Scale heterogeneity and lack of schema information in many data sources

Answer 66

They model and propagate uncertainty rather than forcing early decisions

Answer 67

It helps validate matches train models and resolve difficult cases

Answer 68

Adjust blocking key selection and filtering parameters based on application needs

Answer 69

Factors like data quality schema heterogeneity and scale affect difficulty

Answer 70

Use similarity measures that can handle missing data or impute values

Answer 71

It helps select features blocking keys and similarity measures

Answer 72

New data may affect existing clusters requiring efficient updates

Answer 73

Consider time-stamped values and evolution of entities over time

Answer 74

It helps understand data characteristics to choose appropriate methods

Answer 75

Consider source reliability and potential conflicts between sources

Answer 76

Need for quick decisions with limited information and resources

Answer 77

Measure precision recall and clustering quality against ground truth

Answer 78

Methods must handle data volume while maintaining acceptable accuracy

Answer 79

Use privacy-preserving techniques while maintaining matching ability

Answer 80

Factors like data quality conflicting information and ambiguity

Answer 81

Consider entity relationships and dependencies during matching

Answer 82

It reduces superficial differences improving matching accuracy

Answer 83

Use language-independent features or cross-lingual matching techniques

Answer 84

Changes in data structure require updating matching rules and models

Answer 85

Track changes updates and maintain cluster consistency

Answer 86

It ensures consistent policies for matching and merging entities

Answer 87

Use distributed algorithms and maintain consistency across nodes

Answer 88

It provides authoritative information for matching and validation

Answer 89

Process updates incrementally with limited historical information

Answer 90

It provides context and constraints for matching decisions

Answer 91

Model and propagate uncertainty through the resolution process

Answer 92

It identifies and resolves duplicate and conflicting entity information

Answer 93

Use efficient indexing blocking and filtering techniques

Answer 94

It reduces manual effort while maintaining acceptable accuracy

Answer 95

Break down into simpler components and combine results

Answer 96

Need for immediate decisions with partial information

Answer 97

Use robust matching methods and clean data when possible

Answer 98

Track quality and performance to maintain effectiveness

Answer 99

Efficiently propagate changes while maintaining consistency

Answer 100

Factors like completeness accuracy and structure

Answer 101

Align schemas while handling uncertainty and variations

Answer 102

Validate matching rules and measure effectiveness

Answer 103

Use distributed processing and efficient algorithms

Answer 104

Need to maintain consistency while processing changes

Answer 105

Consider domain-specific features and matching rules

Answer 106

Track decisions rules and processes for maintenance

Answer 107

Use efficient methods that can make quick decisions

Answer 108

Good preparation improves matching accuracy

Answer 109

Analyze failures to improve matching rules

Answer 110

Improve efficiency while maintaining accuracy

Answer 111

JOSIE (JOining Search using Intersection Estimation) is an algorithm that finds the k sets in a data lake with the largest intersections with a query set, where sets represent columns in tables

Answer 112

1. Inverted index (containing posting lists) 2. Dictionary (storing tokens, frequencies, and pointers to posting lists)

Answer 113

A tuple containing: (SetID, Position, SetSize)

Answer 114

Its read time is linear to the number of matched tokens, making it inefficient for sets with thousands or millions of tokens

Answer 115

It reduces the number of posting lists that need to be read by using the k-th candidate's intersection size as a threshold, making a prefix of |Q| − |Q ∩ Xk| + 1 lists

Answer 116

1. Global ordering for all tokens (e.g., lexicographic, length) 2. Posting lists must contain token positions and set sizes

Answer 117

1. Prunes candidates whose intersection sizes are less than threshold before reading them 2. Reduces time of reading individual candidates by only reading from first matching position

Answer 118

Using the equation: |Q ∩ X| ≤ |Q ∩ X|ub = 1 + min, where jX,0 is the position of the first matching token

Answer 119

ProbeSet probes candidates as it encounters them and can stop early, while MergeList reads all posting lists completely

Answer 120

When the number of lists read equals |Q| − |Q ∩ Xk| + 1, where Xk is the k-th candidate

Answer 121

A system for detecting top k n-ary joinable tables from large table corpora, using XASH hash function and super keys

Answer 122

It encodes distinctive properties: less frequent characters, their positions, and value length, rather than relying on uniform distribution