Final Flashcards
Data Analysis Cycle
Data capture
Extraction
Data preparation
Data analysis
Communicating insights
big data Four V’s
Volume = mass of data
Velocity = real time speeds
Variety = unstructured or unprocessed data
Veracity = quality of data
SMART questions
o Specific
o Measurable
o achievable
o Relevant
o Timely
Data structure types
o Structured, unstructured, and semi-structured
Structured query language
o Common way to extract data from a relational database (SELECT, FROM, WHERE)
Data Prep Steps
Understand data
Standardize, structure, clean
Validate data quality and verify data meets requirements
Document transformation process
aggregate
presentation of data in summarized form
data joining
process of combining different data sources
data pivoting
rotating data from rows to columns
parsing
separating data from a single field to multiple fields
concatenation
combining from multiple fields to single field
cryptic data values
data items with no meaning without understanding a coding scheme
misfielded data values
data values that are correctly formatted but not listed in correct field
consistency
every value in a field should be stored in the same way
imputation
process of replacing null or missing value with substituted data
contradiction errors
errors exist when the same entity is described in 2 conflicting ways
threshold violation
errors that occur when a data value falls outside an allowable level
violated attribute dependencies
errors occur when a secondary attribute in row of data does not match the primary attribute.
predictive analytics
will it happen in the future?
Descriptive Analytics
what happened?
diagnostic analytics
why did it happen?
Prescriptive analytics
what should we do based on what we expect to happen?
common analytics problems
o Data overfitting = model is designed to fit training data very well but does not predict well when applying it to other datasets
o extrapolating being the range = process of estimating a value that is beyond the data used to create the model
benefits of visualization
Visualized data is processed faster than written, easier to use, and supports the dominant learning style of the population (most people are visual learners)