data exploration Flashcards
data does not
- speak for itself
- it can be biased and is not objective (based on how selected)
- the people behind it interprets the data
answers/results depend on..
question to solve and perspective
types of data sets
- cross-sectional
- time-series
- panel
cross-sectional
- many subjects/variables, one point in time
- eg sales, expenses, profit
time-series
- one subject/variable, many points in time
- eg sales over time
panel
- many subjects/variables, many points in time
- eg sales, expenses, profit over time
dimensions of data quality
- completeness
- consistency
- conformity
- accuracy
- integrity
- timeliness
completeness
comprehensive and meets expectations
consistency
across all systems/sourced from different places reflects the same information
conformity
follows set of standard data definitions like data type, size and format
accuracy
correctly reflects the real world object OR an event being described
integrity
all in a database can be traced and connected to other data
timeliness
information is available when it is expected and needed
first two steps of data cleansing/processing
- sourcing raw data
- technically correct data
sourcing raw data
What do we want and need to achieve?
What data will support this outcome?
How can we source it and ensure it is of a high quality?