Lecture 4 – Data quality Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

What are the 4 Vs of big data?

A

Volume
Velocity
Variety
Veracity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

4Vs: What does Velocity refer to?

A

The speed at which data is generated
- sensors generate data every seconds
- interactions on a website create data every second
high velocity –> analysis of streaming data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

4Vs: What does Volume refer to?

A

Scale of data

Datasets in Terrabytes or Petabytes –> too big to process by a single processing computer –> new data storage and processing technology

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

4Vs: What does Veracity refer to?

A

Uncertainty of data
Quality of the data (high veracity = valuable to analyze and contributes in a meaningful way, low veracity = not valuable, inaccurate, not contributing)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

4Vs: What does Variety refer to?

A

Different forms of data
(sources, formats, structured/unstructured)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Which of the 4Vs are related to each growth law?

A

Moore’s: Velocity & Variety
Koomey’s: Variety
Bell’s: Variety & Veracity
Zimmerman’s: all of them

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Elements of the NIST Framework

A

Data sources
Data volume
Data velocity
Data variety
Data veracity
Software
Analytics
Processing
Capabilities
Security/Privacy
Lifecycle
Other

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is data wrangling?

A

Cleaning the data so it can be used

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Issues related to the 4Vs that data wrangling may need to correct for?

A

Volume: with a lot of data, irregularities creep in

Velocity: data can be our-of-date quickly

Variety: data can be of different formats and types

Veracity: the accuracy of consistency of data from different sources or sets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

R: What is the difference between NA and NaN

A

NA = not available, missing data point
NaN = not a number, undefined or unrepresentable value (e.g. we divided number by 0)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are possible strategies to deal with missing data?

A
  • omit them from the data (drop whole column, or drop observations)
  • give them a value (impute)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the purpose of the shadow matrix in R?

A

see how missing values relate to other variables in the table

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Different methods for imputation?

A

Simple parametric:
use mean/median

Simple non-parametric:
find the k nearest neighbors and average these

multiple imputation:
use a statistical distribution and simulate for the missing values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly