BigDataIntegration Flashcards
What is the main objective of Big Data integration?
To make heterogeneous data from different sources accessible and usable by combining traditional data integration principles with Big Data characteristics (4 Vs)
What are the 4 Vs that characterize Big Data?
Volume Velocity Variety and Value
What is an example of Big Data integration in company mergers?
Customer database integration dealing with differences in schemas entity duplication and data inconsistencies
Why are traditional ETL approaches less effective for Big Data?
1) Schema rigidity limits flexibility 2) Duplicate detection is computationally expensive 3) Choosing a single true representation may not work for all applications
What is the main difference between ETL and ELT?
ELT loads data before transformation while ETL transforms data before loading making it better suited for big data due to cheaper storage and on-demand analysis
What is a Data Lake?
A large centralized repository allowing storage of raw structured semi-structured and unstructured data from various sources in one place
What are the three main challenges in Big Data Integration?
1) Data Discovery 2) Schema Alignment 3) Entity Resolution
What is Schema Alignment?
The process of aligning schemas from different data sources to enable integration
What is Entity Resolution (ER)?
The process of identifying and merging records from different data sources that refer to the same real-world entity
What are probabilistic mediated schemas?
Schemas that use weighted attribute correspondences to model uncertainty about the semantics of attributes in the sources
What are the two main types of similarity used in schema alignment?
1) Metadata similarity (attribute names documentation table captions) 2) Instance similarity (attribute cell values entities columns)
What are the four main phases of Entity Resolution?
1) Candidate Generation 2) Matching 3) Clustering 4) Merging
What is Candidate Generation in ER?
The phase where pairs of records that might refer to the same entity are identified based on attribute similarity or group membership
What is the Matching phase in ER?
The phase where a matching function evaluates the probability that candidate record pairs refer to the same entity
What is the purpose of the Clustering phase in ER?
To group matching records into clusters where each cluster represents a single real-world entity
What is the Merging phase in ER?
The phase where a single record representing the entity is created from each cluster by choosing a standard format and resolving conflicts
What are the three main challenges in Entity Resolution?
1) Real-world Ambiguity 2) Data Errors 3) Scalability
What is Blocking in ER?
A technique to reduce comparisons by grouping similar records into blocks and only comparing within each block
What are the three main types of text similarity measures?
1) Character-based 2) Token-based 3) Vector-based
What is Hamming distance?
Number of positions in which two strings of equal length differ
What is Levenshtein distance?
Minimum number of character insertions deletions and replacements needed to transform one string into another
What is Jaccard similarity?
The size of the intersection divided by the size of the union of two sets
What is Jaccard containment?
The size of the intersection divided by the minimum size of the two sets
What are the three subtasks of Entity Resolution?
1) Clean-Clean ER (Record Linkage) 2) Dirty-Clean ER 3) Dirty-Dirty ER