Lecture 4 – Data quality Flashcards

Question 1

Q

What are the 4 Vs of big data?

Answer

A

Volume
Velocity
Variety
Veracity

Question 2

Q

4Vs: What does Velocity refer to?

Answer

A

The speed at which data is generated
- sensors generate data every seconds
- interactions on a website create data every second
high velocity –> analysis of streaming data

Question 3

Q

4Vs: What does Volume refer to?

Answer

A

Scale of data

Datasets in Terrabytes or Petabytes –> too big to process by a single processing computer –> new data storage and processing technology

Question 4

Q

4Vs: What does Veracity refer to?

Answer

A

Uncertainty of data
Quality of the data (high veracity = valuable to analyze and contributes in a meaningful way, low veracity = not valuable, inaccurate, not contributing)

Question 5

Q

4Vs: What does Variety refer to?

Answer

A

Different forms of data
(sources, formats, structured/unstructured)

Question 6

Q

Which of the 4Vs are related to each growth law?

Answer

A

Moore’s: Velocity & Variety
Koomey’s: Variety
Bell’s: Variety & Veracity
Zimmerman’s: all of them

Question 7

Q

Elements of the NIST Framework

Answer

A

Data sources
Data volume
Data velocity
Data variety
Data veracity
Software
Analytics
Processing
Capabilities
Security/Privacy
Lifecycle
Other

Question 8

Q

What is data wrangling?

Answer

A

Cleaning the data so it can be used

Question 9

Q

Issues related to the 4Vs that data wrangling may need to correct for?

Answer

A

Volume: with a lot of data, irregularities creep in

Velocity: data can be our-of-date quickly

Variety: data can be of different formats and types

Veracity: the accuracy of consistency of data from different sources or sets

Question 10

Q

R: What is the difference between NA and NaN

Answer

A

NA = not available, missing data point
NaN = not a number, undefined or unrepresentable value (e.g. we divided number by 0)

Question 11

Q

What are possible strategies to deal with missing data?

Answer

A

omit them from the data (drop whole column, or drop observations)
give them a value (impute)

Question 12

Q

What is the purpose of the shadow matrix in R?

Answer

A

see how missing values relate to other variables in the table

Question 13

Q

Different methods for imputation?

Answer

A

Simple parametric:
use mean/median

Simple non-parametric:
find the k nearest neighbors and average these

multiple imputation:
use a statistical distribution and simulate for the missing values

Lecture 4 – Data quality Flashcards

(13 cards)