Week 4: Integration & Processing Pipeline Flashcards

1
Q

What is a data lake ?

A
  • Holds a vast amount of raw data
  • No hierarchy or organisation
  • Holds structured, semi-structred, unstructred
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Data swamp

A

Highly disorganised data repository

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Data lake (RUDEAS)

A
  1. Reason for storing data is undefined
  2. Used by data scientist
  3. Data is left raw until it is needed
  4. Emerging technology
  5. Adapt easily to changes
  6. Schema on read
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Data Warehouse (DDRUSS)

A
  1. Data is processed and ready to be queried
  2. Difficult to change the structure
  3. Reason for storing data is pre-defined
  4. Used by business professionals
  5. Strong maturity model
  6. Schema on write
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

3 Techniques of big data integration (DRS)

A
  1. Data Fusion
  2. Record Linkage
  3. Schema Mapping
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Schema Mapping

A

Create a mediated global schema that is relevant to the business

Identfiy mappings between the schema and the data source

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Record Linkage

A

Identify records that refer to teh same logical entity across different data sources

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

3 Record Linkage techniques (PCB)

A
  1. Pairwise matching
  2. Clustering
  3. Blocking
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Data Fushion

A

A combination of techniques that aims to resolve conflicts from a collection of sources to find truth

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Data Fushion (techniques) (CSV)

A
  • Copy detection
  • Voting
  • Source Quality
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Big Data Processing Pipeline 3 steps

A
  1. Split
  2. Do Something (machine learning)
  3. Merge
How well did you know this?
1
Not at all
2
3
4
5
Perfectly