BigDataIntegration Flashcards
What is the main objective of Big Data integration?
To make heterogeneous data from different sources accessible and usable by combining traditional data integration principles with Big Data characteristics (4 Vs)
What are the 4 Vs that characterize Big Data?
Volume Velocity Variety and Value
What is an example of Big Data integration in company mergers?
Customer database integration dealing with differences in schemas entity duplication and data inconsistencies
Why are traditional ETL approaches less effective for Big Data?
1) Schema rigidity limits flexibility 2) Duplicate detection is computationally expensive 3) Choosing a single true representation may not work for all applications
What is the main difference between ETL and ELT?
ELT loads data before transformation while ETL transforms data before loading making it better suited for big data due to cheaper storage and on-demand analysis
What is a Data Lake?
A large centralized repository allowing storage of raw structured semi-structured and unstructured data from various sources in one place
What are the three main challenges in Big Data Integration?
1) Data Discovery 2) Schema Alignment 3) Entity Resolution
What is Schema Alignment?
The process of aligning schemas from different data sources to enable integration
What is Entity Resolution (ER)?
The process of identifying and merging records from different data sources that refer to the same real-world entity
What are probabilistic mediated schemas?
Schemas that use weighted attribute correspondences to model uncertainty about the semantics of attributes in the sources
What are the two main types of similarity used in schema alignment?
1) Metadata similarity (attribute names documentation table captions) 2) Instance similarity (attribute cell values entities columns)
What are the four main phases of Entity Resolution?
1) Candidate Generation 2) Matching 3) Clustering 4) Merging
What is Candidate Generation in ER?
The phase where pairs of records that might refer to the same entity are identified based on attribute similarity or group membership
What is the Matching phase in ER?
The phase where a matching function evaluates the probability that candidate record pairs refer to the same entity
What is the purpose of the Clustering phase in ER?
To group matching records into clusters where each cluster represents a single real-world entity
What is the Merging phase in ER?
The phase where a single record representing the entity is created from each cluster by choosing a standard format and resolving conflicts
What are the three main challenges in Entity Resolution?
1) Real-world Ambiguity 2) Data Errors 3) Scalability
What is Blocking in ER?
A technique to reduce comparisons by grouping similar records into blocks and only comparing within each block
What are the three main types of text similarity measures?
1) Character-based 2) Token-based 3) Vector-based
What is Hamming distance?
Number of positions in which two strings of equal length differ
What is Levenshtein distance?
Minimum number of character insertions deletions and replacements needed to transform one string into another
What is Jaccard similarity?
The size of the intersection divided by the size of the union of two sets
What is Jaccard containment?
The size of the intersection divided by the minimum size of the two sets
What are the three subtasks of Entity Resolution?
1) Clean-Clean ER (Record Linkage) 2) Dirty-Clean ER 3) Dirty-Dirty ER
What is Clean-Clean ER?
Finding matches between two clean collections (collections free of duplicates)
What is Dirty-Clean ER?
Finding duplicates within a single dirty collection
What is Dirty-Dirty ER?
Finding matches between more than two collections (Deduplication)
What is Schema-agnostic Token Blocking?
A blocking technique that uses every token in every attribute value as a blocking key regardless of the attribute
What is the main advantage of Schema-agnostic Token Blocking?
It is very robust and has high precision as it’s unlikely to miss a match and doesn’t require schema knowledge
What is Meta-blocking?
A technique that uses block-entity relationships to identify and maintain the most promising comparisons reducing superfluous comparisons
What is ARCS weighting scheme in Meta-blocking?
Aggregate Reciprocal Comparisons Scheme - sums the reciprocal of the number of comparisons from each shared block
What is CBS weighting scheme in Meta-blocking?
Common Blocks Scheme - uses the number of blocks shared between profiles
What is JS weighting scheme in Meta-blocking?
Jaccard Scheme - uses the Jaccard similarity of the blocks shared between profiles
What are the four pruning methods in Meta-blocking?
1) Weighted Edge Pruning 2) Cardinality Edge Pruning 3) Weighted Node Pruning 4) Cardinality Node Pruning
What is Block Filtering?
A technique that retains each entity in a percentage of its smallest blocks since larger blocks are less likely to contain unique duplicates
What is Block Purging?
A technique that removes oversized blocks by setting an upper limit on block cardinality
What is Pay-as-you-go ER?
A strategy that focuses on generating candidate pairs in a specific order to maximize progressive recall with limited resources
What are the three levels of ordering in Pay-as-you-go ER?
1) Comparisons level 2) Block level 3) Entity level
What is Progressive Sorted Neighborhood (PSN)?
A method that sorts records by a key and uses a sliding window to compare nearby records assuming higher matching probability for proximity
What is LS-PSN?
Local Schema-agnostic PSN - uses similarity principle and token sorting without schema knowledge
What is GS-PSN?
Global Schema-agnostic PSN - defines global execution order for all pairs within a predefined range of window sizes
What is PBS in Progressive ER?
Progressive Block Scheduling - orders blocks by weight assuming smaller blocks are more informative
What are the two main types of related tables?
1) Unionable tables 2) Joinable tables
What are Unionable tables?
Tables that are entity complements sharing the same schema but different records
What are Joinable tables?
Tables that are schema complements sharing some key attributes but having different additional attributes
What is Similarity Join?
An operation that retrieves all pairs of records whose similarity exceeds a threshold
What are the three main filters used in Similarity Joins?
1) Prefix filter 2) Length filter 3) Positional filter
What is the Prefix filter?
Filter that requires matching records to share at least one token in their prefixes
What is shingling in document similarity?
Representing documents as sets of k-length substrings found within them
What is minhashing?
A technique that compresses sets into small signatures while preserving their Jaccard similarity
What is the connection between minhashing and Jaccard similarity?
The probability that the minhash function produces the same value for two sets equals their Jaccard similarity
What is Locality-Sensitive Hashing (LSH)?
A technique to reduce comparisons by hashing similar items to the same buckets
What is the banding technique in LSH?
Dividing the signature matrix into bands and hashing each band to find candidate pairs that match in at least one band
What is the basic LSH method for cosine similarity?
Using random projections where the probability of hash collision is proportional to the cosine of the angle between vectors
What are examples of clean-clean entity resolution?
Finding matches between two databases without duplicates like merging customer records from two different companies
What is a real-world example of dirty-dirty entity resolution?
Finding matches across multiple product catalogs that may each contain duplicates
What is a use case for schema complement tables?
Finding tables that contain additional attributes for existing entities like adding demographic data to customer records
What are common real-world applications of similarity joins?
Record linkage data cleaning and deduplication tasks
How does shingling help with document similarity?
It creates a set-based representation that captures local text structure and enables set similarity comparisons
Why is minhashing more efficient than direct set comparison?
It creates small fixed-size signatures that preserve similarity while reducing storage and comparison costs
What problem does LSH solve in similarity search?
It reduces the number of necessary comparisons by only comparing items likely to be similar based on their hash values
What are key considerations in choosing shingle size?
Balance between discriminative power and computational cost larger k gives more precision but requires more processing
What makes Meta-blocking more accurate than basic blocking?
It uses block relationships and weights to identify promising comparisons rather than just shared blocking keys
How does Pay-as-you-go ER handle resource constraints?
It prioritizes likely matches to identify as many matches as possible early with limited time/computation
What is the main trade-off in choosing LSH bands and rows?
More bands increase recall but also false positives while more rows per band increase precision but may miss matches
How do you choose blocking keys for traditional blocking?
Select attributes that are unlikely to contain errors and have good discriminative power
Why might you choose GS-PSN over LS-PSN?
GS-PSN avoids repeated pair comparisons and has better matching probability estimation
What advantage does Block Filtering offer over Block Purging?
It retains some utility from larger blocks while still reducing comparisons rather than completely removing them
Why use multiple minhash functions rather than just one?
Multiple functions provide better similarity estimation by sampling more permutations
How does the prefix filter guarantee completeness?
If records don’t share any prefix tokens they cannot meet the similarity threshold
What role does transitivity play in entity clustering?
It helps group related pairs into clusters assuming if A=B and B=C then A=C
Why might you choose supervised over unsupervised meta-blocking?
Supervised methods can learn better weighting schemes if labeled training data is available
What makes schema-agnostic methods valuable for big data?
They don’t require schema alignment or domain knowledge making them more flexible and scalable
How do you handle multi-valued attributes in similarity computation?
Use set-based similarity measures like Jaccard similarity rather than exact matching
What is the purpose of the triangle inequality in similarity metrics?
It ensures consistent distance relationships between multiple items
Why are character-based similarities good for catching typos?
They can detect small character-level differences that indicate typing errors
How do token-based similarities handle word rearrangement?
They treat text as bags of words so word order doesn’t affect similarity
What role do thresholds play in similarity matching?
They determine the minimum similarity required to consider items as matching
Why use progressive techniques for entity resolution?
They allow useful partial results with limited resources by prioritizing likely matches
What makes blocking key selection challenging?
Keys must balance discriminative power error tolerance and computational efficiency
How do you evaluate the quality of blocking results?
Measure pair completeness (recall) and pair quality (precision) of the generated blocks
What is the relationship between block size and matching probability?
Smaller blocks generally have higher matching probability but may miss some matches
What factors affect the choice of similarity measure?
Data type expected error patterns and computational requirements
Why might you combine multiple similarity measures?
Different measures catch different types of variations and errors
What makes some blocking methods more scalable than others?
Efficient indexing structures and filtering techniques that reduce necessary comparisons
How do you handle evolving data in entity resolution?
Use incremental techniques that can process updates without full recomputation
What role does data cleaning play in entity resolution?
It reduces noise and standardizes formats improving matching accuracy
Why is entity resolution an ongoing challenge?
Data volumes variety and velocity keep increasing while quality requirements remain high
What makes schema alignment particularly challenging for big data?
Scale heterogeneity and lack of schema information in many data sources
How do probabilistic approaches help with uncertainty?
They model and propagate uncertainty rather than forcing early decisions
What role does human feedback play in entity resolution?
It helps validate matches train models and resolve difficult cases
How do you balance precision and recall in blocking?
Adjust blocking key selection and filtering parameters based on application needs
What makes some entity resolution tasks harder than others?
Factors like data quality schema heterogeneity and scale affect difficulty
How do you handle missing values in similarity computation?
Use similarity measures that can handle missing data or impute values
What role does domain knowledge play in entity resolution?
It helps select features blocking keys and similarity measures
What makes incremental entity resolution challenging?
New data may affect existing clusters requiring efficient updates
How do you handle temporal aspects in entity resolution?
Consider time-stamped values and evolution of entities over time
What role does data profiling play in entity resolution?
It helps understand data characteristics to choose appropriate methods
How do you handle multi-source entity resolution?
Consider source reliability and potential conflicts between sources
What makes real-time entity resolution challenging?
Need for quick decisions with limited information and resources
How do you evaluate entity resolution results?
Measure precision recall and clustering quality against ground truth
What role does scalability play in choosing ER methods?
Methods must handle data volume while maintaining acceptable accuracy
How do you handle privacy in entity resolution?
Use privacy-preserving techniques while maintaining matching ability
What makes some entity pairs harder to resolve than others?
Factors like data quality conflicting information and ambiguity
How do you handle hierarchical relationships in ER?
Consider entity relationships and dependencies during matching
What role does data standardization play in ER?
It reduces superficial differences improving matching accuracy
How do you handle multi-lingual entity resolution?
Use language-independent features or cross-lingual matching techniques
What makes schema evolution challenging for ER?
Changes in data structure require updating matching rules and models
How do you maintain entity resolution results over time?
Track changes updates and maintain cluster consistency
What role does data governance play in entity resolution?
It ensures consistent policies for matching and merging entities
How do you handle entity resolution in distributed systems?
Use distributed algorithms and maintain consistency across nodes
What makes reference data important for entity resolution?
It provides authoritative information for matching and validation
How do you handle streaming entity resolution?
Process updates incrementally with limited historical information
What role does metadata play in entity resolution?
It provides context and constraints for matching decisions
How do you handle uncertainty in entity resolution?
Model and propagate uncertainty through the resolution process
What makes entity resolution important for data quality?
It identifies and resolves duplicate and conflicting entity information
How do you handle scale in entity resolution?
Use efficient indexing blocking and filtering techniques
What role does automation play in entity resolution?
It reduces manual effort while maintaining acceptable accuracy
How do you handle complex matching rules in ER?
Break down into simpler components and combine results
What makes online entity resolution different from batch?
Need for immediate decisions with partial information
How do you handle data quality issues in ER?
Use robust matching methods and clean data when possible
What role does monitoring play in entity resolution?
Track quality and performance to maintain effectiveness
How do you handle updates to resolved entities?
Efficiently propagate changes while maintaining consistency
What makes some data sources better for ER than others?
Factors like completeness accuracy and structure
How do you handle schema mapping in entity resolution?
Align schemas while handling uncertainty and variations
What role does testing play in entity resolution?
Validate matching rules and measure effectiveness
How do you handle large-scale entity resolution?
Use distributed processing and efficient algorithms
What makes incremental updates challenging for ER?
Need to maintain consistency while processing changes
How do you handle entity resolution across domains?
Consider domain-specific features and matching rules
What role does documentation play in entity resolution?
Track decisions rules and processes for maintenance
How do you handle entity resolution in real time?
Use efficient methods that can make quick decisions
What makes data preparation important for ER?
Good preparation improves matching accuracy
How do you handle entity resolution failure cases?
Analyze failures to improve matching rules
What role does optimization play in entity resolution?
Improve efficiency while maintaining accuracy
What is JOSIE and what problem does it solve?
JOSIE (JOining Search using Intersection Estimation) is an algorithm that finds the k sets in a data lake with the largest intersections with a query set, where sets represent columns in tables
What basic data structures does JOSIE use to handle large sets?
- Inverted index (containing posting lists) 2. Dictionary (storing tokens, frequencies, and pointers to posting lists)
What information does each posting list entry contain in JOSIE?
A tuple containing: (SetID, Position, SetSize)
What is the main limitation of the MergeList algorithm in JOSIE?
Its read time is linear to the number of matched tokens, making it inefficient for sets with thousands or millions of tokens
How does Prefix Filter optimize JOSIE’s performance?
It reduces the number of posting lists that need to be read by using the k-th candidate’s intersection size as a threshold, making a prefix of |Q| − |Q ∩ Xk| + 1 lists
What two requirements must be met to use Position filter in JOSIE?
- Global ordering for all tokens (e.g., lexicographic, length) 2. Posting lists must contain token positions and set sizes
What are the two benefits of using Position filter in JOSIE?
- Prunes candidates whose intersection sizes are less than threshold before reading them 2. Reduces time of reading individual candidates by only reading from first matching position
How does the Position filter calculate the upper bound of intersection size?
Using the equation: |Q ∩ X| ≤ |Q ∩ X|ub = 1 + min, where jX,0 is the position of the first matching token
What is the key difference between ProbeSet and MergeList in JOSIE?
ProbeSet probes candidates as it encounters them and can stop early, while MergeList reads all posting lists completely
When does JOSIE stop reading new posting lists?
When the number of lists read equals |Q| − |Q ∩ Xk| + 1, where Xk is the k-th candidate
What is MATE in contrast to JOSIE?
A system for detecting top k n-ary joinable tables from large table corpora, using XASH hash function and super keys
What makes XASH unique compared to other hash functions?
It encodes distinctive properties: less frequent characters, their positions, and value length, rather than relying on uniform distribution