data exploration Flashcards
data does not
- speak for itself
- it can be biased and is not objective (based on how selected)
- the people behind it interprets the data
answers/results depend on..
question to solve and perspective
types of data sets
- cross-sectional
- time-series
- panel
cross-sectional
- many subjects/variables, one point in time
- eg sales, expenses, profit
time-series
- one subject/variable, many points in time
- eg sales over time
panel
- many subjects/variables, many points in time
- eg sales, expenses, profit over time
dimensions of data quality
- completeness
- consistency
- conformity
- accuracy
- integrity
- timeliness
completeness
comprehensive and meets expectations
consistency
across all systems/sourced from different places reflects the same information
conformity
follows set of standard data definitions like data type, size and format
accuracy
correctly reflects the real world object OR an event being described
integrity
all in a database can be traced and connected to other data
timeliness
information is available when it is expected and needed
first two steps of data cleansing/processing
- sourcing raw data
- technically correct data
sourcing raw data
What do we want and need to achieve?
What data will support this outcome?
How can we source it and ensure it is of a high quality?
technically correct data
- when can be directly recognised as belonging to a certain variable
- is stored in a data type that represents the value domain of the real-world variable
data issues
- formatting/data type
- missing values
- outliers
formatting/data type
- sex; Male, M, Boy
- month; January, 1-Jan, 1
missing values - listwise deletion
remove records with missing values in any variable
missing values - mode/median/mean imputation
- mean for continuous variables
- median for skewed continuous variables
- mode for categorical variables
missing values - model imputation
- interpolate/extrapolate
- use regression model to predict missing value
outliers - drop outlier record
completely remove record to avoid severe skewness
outliers - winsorisation
- cap your outliers data
- limit extreme values in statistical data to reduce effect of possibly spurious (false) outliers
outliers - imputation
- assign a new value
- mean or regression
data privacy
claim of individuals, groups, and institutions to determine for themselves, when, how, and to what extent information about them is communicated to others
data privacy principles
- notice
- choice and consent
- use and retention
- access
- protection
- enforcement and redress
notice
inform users about privacy policy/protection procedures
choice and consent
consent from individuals about collection, use, disclosure, and retention of information
use and retention
data is retained/protected according to law or business practices required
access
provide access to individuals to review, update, and modify data about personal information
protection
data is used only for purpose stated
enforcement and redress
provide channels for individuals to report, provide feedback, or complain
ethics of data security
- managing quality personnel to address ethical issues
- perceived potential conflict of interest also exists relative to ethical behaviours and technical knowledge
Australian Security Principles protects against
- misuse
- interference
- loss
- unauthorised access, modification, disclosure