BigDataIntegration Flashcards

1
Q

What is the main objective of Big Data integration?

A

To make heterogeneous data from different sources accessible and usable by combining traditional data integration principles with Big Data characteristics (4 Vs)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the 4 Vs that characterize Big Data?

A

Volume Velocity Variety and Value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is an example of Big Data integration in company mergers?

A

Customer database integration dealing with differences in schemas entity duplication and data inconsistencies

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Why are traditional ETL approaches less effective for Big Data?

A

1) Schema rigidity limits flexibility 2) Duplicate detection is computationally expensive 3) Choosing a single true representation may not work for all applications

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the main difference between ETL and ELT?

A

ELT loads data before transformation while ETL transforms data before loading making it better suited for big data due to cheaper storage and on-demand analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a Data Lake?

A

A large centralized repository allowing storage of raw structured semi-structured and unstructured data from various sources in one place

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the three main challenges in Big Data Integration?

A

1) Data Discovery 2) Schema Alignment 3) Entity Resolution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is Schema Alignment?

A

The process of aligning schemas from different data sources to enable integration

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is Entity Resolution (ER)?

A

The process of identifying and merging records from different data sources that refer to the same real-world entity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are probabilistic mediated schemas?

A

Schemas that use weighted attribute correspondences to model uncertainty about the semantics of attributes in the sources

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the two main types of similarity used in schema alignment?

A

1) Metadata similarity (attribute names documentation table captions) 2) Instance similarity (attribute cell values entities columns)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the four main phases of Entity Resolution?

A

1) Candidate Generation 2) Matching 3) Clustering 4) Merging

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is Candidate Generation in ER?

A

The phase where pairs of records that might refer to the same entity are identified based on attribute similarity or group membership

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the Matching phase in ER?

A

The phase where a matching function evaluates the probability that candidate record pairs refer to the same entity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the purpose of the Clustering phase in ER?

A

To group matching records into clusters where each cluster represents a single real-world entity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the Merging phase in ER?

A

The phase where a single record representing the entity is created from each cluster by choosing a standard format and resolving conflicts

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What are the three main challenges in Entity Resolution?

A

1) Real-world Ambiguity 2) Data Errors 3) Scalability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is Blocking in ER?

A

A technique to reduce comparisons by grouping similar records into blocks and only comparing within each block

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What are the three main types of text similarity measures?

A

1) Character-based 2) Token-based 3) Vector-based

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is Hamming distance?

A

Number of positions in which two strings of equal length differ

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is Levenshtein distance?

A

Minimum number of character insertions deletions and replacements needed to transform one string into another

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is Jaccard similarity?

A

The size of the intersection divided by the size of the union of two sets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is Jaccard containment?

A

The size of the intersection divided by the minimum size of the two sets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What are the three subtasks of Entity Resolution?

A

1) Clean-Clean ER (Record Linkage) 2) Dirty-Clean ER 3) Dirty-Dirty ER

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is Clean-Clean ER?

A

Finding matches between two clean collections (collections free of duplicates)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What is Dirty-Clean ER?

A

Finding duplicates within a single dirty collection

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is Dirty-Dirty ER?

A

Finding matches between more than two collections (Deduplication)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What is Schema-agnostic Token Blocking?

A

A blocking technique that uses every token in every attribute value as a blocking key regardless of the attribute

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What is the main advantage of Schema-agnostic Token Blocking?

A

It is very robust and has high precision as it’s unlikely to miss a match and doesn’t require schema knowledge

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What is Meta-blocking?

A

A technique that uses block-entity relationships to identify and maintain the most promising comparisons reducing superfluous comparisons

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What is ARCS weighting scheme in Meta-blocking?

A

Aggregate Reciprocal Comparisons Scheme - sums the reciprocal of the number of comparisons from each shared block

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What is CBS weighting scheme in Meta-blocking?

A

Common Blocks Scheme - uses the number of blocks shared between profiles

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What is JS weighting scheme in Meta-blocking?

A

Jaccard Scheme - uses the Jaccard similarity of the blocks shared between profiles

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What are the four pruning methods in Meta-blocking?

A

1) Weighted Edge Pruning 2) Cardinality Edge Pruning 3) Weighted Node Pruning 4) Cardinality Node Pruning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

What is Block Filtering?

A

A technique that retains each entity in a percentage of its smallest blocks since larger blocks are less likely to contain unique duplicates

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

What is Block Purging?

A

A technique that removes oversized blocks by setting an upper limit on block cardinality

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What is Pay-as-you-go ER?

A

A strategy that focuses on generating candidate pairs in a specific order to maximize progressive recall with limited resources

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

What are the three levels of ordering in Pay-as-you-go ER?

A

1) Comparisons level 2) Block level 3) Entity level

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

What is Progressive Sorted Neighborhood (PSN)?

A

A method that sorts records by a key and uses a sliding window to compare nearby records assuming higher matching probability for proximity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

What is LS-PSN?

A

Local Schema-agnostic PSN - uses similarity principle and token sorting without schema knowledge

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

What is GS-PSN?

A

Global Schema-agnostic PSN - defines global execution order for all pairs within a predefined range of window sizes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

What is PBS in Progressive ER?

A

Progressive Block Scheduling - orders blocks by weight assuming smaller blocks are more informative

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

What are the two main types of related tables?

A

1) Unionable tables 2) Joinable tables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

What are Unionable tables?

A

Tables that are entity complements sharing the same schema but different records

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

What are Joinable tables?

A

Tables that are schema complements sharing some key attributes but having different additional attributes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

What is Similarity Join?

A

An operation that retrieves all pairs of records whose similarity exceeds a threshold

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

What are the three main filters used in Similarity Joins?

A

1) Prefix filter 2) Length filter 3) Positional filter

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

What is the Prefix filter?

A

Filter that requires matching records to share at least one token in their prefixes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

What is shingling in document similarity?

A

Representing documents as sets of k-length substrings found within them

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

What is minhashing?

A

A technique that compresses sets into small signatures while preserving their Jaccard similarity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

What is the connection between minhashing and Jaccard similarity?

A

The probability that the minhash function produces the same value for two sets equals their Jaccard similarity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

What is Locality-Sensitive Hashing (LSH)?

A

A technique to reduce comparisons by hashing similar items to the same buckets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

What is the banding technique in LSH?

A

Dividing the signature matrix into bands and hashing each band to find candidate pairs that match in at least one band

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

What is the basic LSH method for cosine similarity?

A

Using random projections where the probability of hash collision is proportional to the cosine of the angle between vectors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

What are examples of clean-clean entity resolution?

A

Finding matches between two databases without duplicates like merging customer records from two different companies

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

What is a real-world example of dirty-dirty entity resolution?

A

Finding matches across multiple product catalogs that may each contain duplicates

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

What is a use case for schema complement tables?

A

Finding tables that contain additional attributes for existing entities like adding demographic data to customer records

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

What are common real-world applications of similarity joins?

A

Record linkage data cleaning and deduplication tasks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
59
Q

How does shingling help with document similarity?

A

It creates a set-based representation that captures local text structure and enables set similarity comparisons

60
Q

Why is minhashing more efficient than direct set comparison?

A

It creates small fixed-size signatures that preserve similarity while reducing storage and comparison costs

61
Q

What problem does LSH solve in similarity search?

A

It reduces the number of necessary comparisons by only comparing items likely to be similar based on their hash values

62
Q

What are key considerations in choosing shingle size?

A

Balance between discriminative power and computational cost larger k gives more precision but requires more processing

63
Q

What makes Meta-blocking more accurate than basic blocking?

A

It uses block relationships and weights to identify promising comparisons rather than just shared blocking keys

64
Q

How does Pay-as-you-go ER handle resource constraints?

A

It prioritizes likely matches to identify as many matches as possible early with limited time/computation

65
Q

What is the main trade-off in choosing LSH bands and rows?

A

More bands increase recall but also false positives while more rows per band increase precision but may miss matches

66
Q

How do you choose blocking keys for traditional blocking?

A

Select attributes that are unlikely to contain errors and have good discriminative power

67
Q

Why might you choose GS-PSN over LS-PSN?

A

GS-PSN avoids repeated pair comparisons and has better matching probability estimation

68
Q

What advantage does Block Filtering offer over Block Purging?

A

It retains some utility from larger blocks while still reducing comparisons rather than completely removing them

69
Q

Why use multiple minhash functions rather than just one?

A

Multiple functions provide better similarity estimation by sampling more permutations

70
Q

How does the prefix filter guarantee completeness?

A

If records don’t share any prefix tokens they cannot meet the similarity threshold

71
Q

What role does transitivity play in entity clustering?

A

It helps group related pairs into clusters assuming if A=B and B=C then A=C

72
Q

Why might you choose supervised over unsupervised meta-blocking?

A

Supervised methods can learn better weighting schemes if labeled training data is available

73
Q

What makes schema-agnostic methods valuable for big data?

A

They don’t require schema alignment or domain knowledge making them more flexible and scalable

74
Q

How do you handle multi-valued attributes in similarity computation?

A

Use set-based similarity measures like Jaccard similarity rather than exact matching

75
Q

What is the purpose of the triangle inequality in similarity metrics?

A

It ensures consistent distance relationships between multiple items

76
Q

Why are character-based similarities good for catching typos?

A

They can detect small character-level differences that indicate typing errors

77
Q

How do token-based similarities handle word rearrangement?

A

They treat text as bags of words so word order doesn’t affect similarity

78
Q

What role do thresholds play in similarity matching?

A

They determine the minimum similarity required to consider items as matching

79
Q

Why use progressive techniques for entity resolution?

A

They allow useful partial results with limited resources by prioritizing likely matches

80
Q

What makes blocking key selection challenging?

A

Keys must balance discriminative power error tolerance and computational efficiency

81
Q

How do you evaluate the quality of blocking results?

A

Measure pair completeness (recall) and pair quality (precision) of the generated blocks

82
Q

What is the relationship between block size and matching probability?

A

Smaller blocks generally have higher matching probability but may miss some matches

83
Q

What factors affect the choice of similarity measure?

A

Data type expected error patterns and computational requirements

84
Q

Why might you combine multiple similarity measures?

A

Different measures catch different types of variations and errors

85
Q

What makes some blocking methods more scalable than others?

A

Efficient indexing structures and filtering techniques that reduce necessary comparisons

86
Q

How do you handle evolving data in entity resolution?

A

Use incremental techniques that can process updates without full recomputation

87
Q

What role does data cleaning play in entity resolution?

A

It reduces noise and standardizes formats improving matching accuracy

88
Q

Why is entity resolution an ongoing challenge?

A

Data volumes variety and velocity keep increasing while quality requirements remain high

89
Q

What makes schema alignment particularly challenging for big data?

A

Scale heterogeneity and lack of schema information in many data sources

90
Q

How do probabilistic approaches help with uncertainty?

A

They model and propagate uncertainty rather than forcing early decisions

91
Q

What role does human feedback play in entity resolution?

A

It helps validate matches train models and resolve difficult cases

92
Q

How do you balance precision and recall in blocking?

A

Adjust blocking key selection and filtering parameters based on application needs

93
Q

What makes some entity resolution tasks harder than others?

A

Factors like data quality schema heterogeneity and scale affect difficulty

94
Q

How do you handle missing values in similarity computation?

A

Use similarity measures that can handle missing data or impute values

95
Q

What role does domain knowledge play in entity resolution?

A

It helps select features blocking keys and similarity measures

96
Q

What makes incremental entity resolution challenging?

A

New data may affect existing clusters requiring efficient updates

97
Q

How do you handle temporal aspects in entity resolution?

A

Consider time-stamped values and evolution of entities over time

98
Q

What role does data profiling play in entity resolution?

A

It helps understand data characteristics to choose appropriate methods

99
Q

How do you handle multi-source entity resolution?

A

Consider source reliability and potential conflicts between sources

100
Q

What makes real-time entity resolution challenging?

A

Need for quick decisions with limited information and resources

101
Q

How do you evaluate entity resolution results?

A

Measure precision recall and clustering quality against ground truth

102
Q

What role does scalability play in choosing ER methods?

A

Methods must handle data volume while maintaining acceptable accuracy

103
Q

How do you handle privacy in entity resolution?

A

Use privacy-preserving techniques while maintaining matching ability

104
Q

What makes some entity pairs harder to resolve than others?

A

Factors like data quality conflicting information and ambiguity

105
Q

How do you handle hierarchical relationships in ER?

A

Consider entity relationships and dependencies during matching

106
Q

What role does data standardization play in ER?

A

It reduces superficial differences improving matching accuracy

107
Q

How do you handle multi-lingual entity resolution?

A

Use language-independent features or cross-lingual matching techniques

108
Q

What makes schema evolution challenging for ER?

A

Changes in data structure require updating matching rules and models

109
Q

How do you maintain entity resolution results over time?

A

Track changes updates and maintain cluster consistency

110
Q

What role does data governance play in entity resolution?

A

It ensures consistent policies for matching and merging entities

111
Q

How do you handle entity resolution in distributed systems?

A

Use distributed algorithms and maintain consistency across nodes

112
Q

What makes reference data important for entity resolution?

A

It provides authoritative information for matching and validation

113
Q

How do you handle streaming entity resolution?

A

Process updates incrementally with limited historical information

114
Q

What role does metadata play in entity resolution?

A

It provides context and constraints for matching decisions

115
Q

How do you handle uncertainty in entity resolution?

A

Model and propagate uncertainty through the resolution process

116
Q

What makes entity resolution important for data quality?

A

It identifies and resolves duplicate and conflicting entity information

117
Q

How do you handle scale in entity resolution?

A

Use efficient indexing blocking and filtering techniques

118
Q

What role does automation play in entity resolution?

A

It reduces manual effort while maintaining acceptable accuracy

119
Q

How do you handle complex matching rules in ER?

A

Break down into simpler components and combine results

120
Q

What makes online entity resolution different from batch?

A

Need for immediate decisions with partial information

121
Q

How do you handle data quality issues in ER?

A

Use robust matching methods and clean data when possible

122
Q

What role does monitoring play in entity resolution?

A

Track quality and performance to maintain effectiveness

123
Q

How do you handle updates to resolved entities?

A

Efficiently propagate changes while maintaining consistency

124
Q

What makes some data sources better for ER than others?

A

Factors like completeness accuracy and structure

125
Q

How do you handle schema mapping in entity resolution?

A

Align schemas while handling uncertainty and variations

126
Q

What role does testing play in entity resolution?

A

Validate matching rules and measure effectiveness

127
Q

How do you handle large-scale entity resolution?

A

Use distributed processing and efficient algorithms

128
Q

What makes incremental updates challenging for ER?

A

Need to maintain consistency while processing changes

129
Q

How do you handle entity resolution across domains?

A

Consider domain-specific features and matching rules

130
Q

What role does documentation play in entity resolution?

A

Track decisions rules and processes for maintenance

131
Q

How do you handle entity resolution in real time?

A

Use efficient methods that can make quick decisions

132
Q

What makes data preparation important for ER?

A

Good preparation improves matching accuracy

133
Q

How do you handle entity resolution failure cases?

A

Analyze failures to improve matching rules

134
Q

What role does optimization play in entity resolution?

A

Improve efficiency while maintaining accuracy

135
Q

What is JOSIE and what problem does it solve?

A

JOSIE (JOining Search using Intersection Estimation) is an algorithm that finds the k sets in a data lake with the largest intersections with a query set, where sets represent columns in tables

136
Q

What basic data structures does JOSIE use to handle large sets?

A
  1. Inverted index (containing posting lists) 2. Dictionary (storing tokens, frequencies, and pointers to posting lists)
137
Q

What information does each posting list entry contain in JOSIE?

A

A tuple containing: (SetID, Position, SetSize)

138
Q

What is the main limitation of the MergeList algorithm in JOSIE?

A

Its read time is linear to the number of matched tokens, making it inefficient for sets with thousands or millions of tokens

139
Q

How does Prefix Filter optimize JOSIE’s performance?

A

It reduces the number of posting lists that need to be read by using the k-th candidate’s intersection size as a threshold, making a prefix of |Q| − |Q ∩ Xk| + 1 lists

140
Q

What two requirements must be met to use Position filter in JOSIE?

A
  1. Global ordering for all tokens (e.g., lexicographic, length) 2. Posting lists must contain token positions and set sizes
141
Q

What are the two benefits of using Position filter in JOSIE?

A
  1. Prunes candidates whose intersection sizes are less than threshold before reading them 2. Reduces time of reading individual candidates by only reading from first matching position
142
Q

How does the Position filter calculate the upper bound of intersection size?

A

Using the equation: |Q ∩ X| ≤ |Q ∩ X|ub = 1 + min, where jX,0 is the position of the first matching token

143
Q

What is the key difference between ProbeSet and MergeList in JOSIE?

A

ProbeSet probes candidates as it encounters them and can stop early, while MergeList reads all posting lists completely

144
Q

When does JOSIE stop reading new posting lists?

A

When the number of lists read equals |Q| − |Q ∩ Xk| + 1, where Xk is the k-th candidate

145
Q

What is MATE in contrast to JOSIE?

A

A system for detecting top k n-ary joinable tables from large table corpora, using XASH hash function and super keys

146
Q

What makes XASH unique compared to other hash functions?

A

It encodes distinctive properties: less frequent characters, their positions, and value length, rather than relying on uniform distribution