Week 4: Integration & Processing Pipeline Flashcards
1
Q
What is a data lake ?
A
- Holds a vast amount of raw data
- No hierarchy or organisation
- Holds structured, semi-structred, unstructred
2
Q
Data swamp
A
Highly disorganised data repository
3
Q
Data lake (RUDEAS)
A
- Reason for storing data is undefined
- Used by data scientist
- Data is left raw until it is needed
- Emerging technology
- Adapt easily to changes
- Schema on read
4
Q
Data Warehouse (DDRUSS)
A
- Data is processed and ready to be queried
- Difficult to change the structure
- Reason for storing data is pre-defined
- Used by business professionals
- Strong maturity model
- Schema on write
5
Q
3 Techniques of big data integration (DRS)
A
- Data Fusion
- Record Linkage
- Schema Mapping
6
Q
Schema Mapping
A
Create a mediated global schema that is relevant to the business
Identfiy mappings between the schema and the data source
7
Q
Record Linkage
A
Identify records that refer to teh same logical entity across different data sources
8
Q
3 Record Linkage techniques (PCB)
A
- Pairwise matching
- Clustering
- Blocking
9
Q
Data Fushion
A
A combination of techniques that aims to resolve conflicts from a collection of sources to find truth
10
Q
Data Fushion (techniques) (CSV)
A
- Copy detection
- Voting
- Source Quality
11
Q
Big Data Processing Pipeline 3 steps
A
- Split
- Do Something (machine learning)
- Merge