intro/eda Flashcards
ways to classify analytics
- descriptive
- prescriptive
- predictive
what is descriptive analytics
gather, organise
- tells you what is happening
what is predictive analytics
- uses data to predict future
- uses association among variables and predicting the likelihood of a phenomenon based on the relationships identified
what is prescriptive analytics/decision analytics
- looks at multiple options and strategies then decide best course of action
- recommends course of action
steps to EDA
- define problem
- gater data
- analyse data
- act on anaylsis
population
includes all entities of interest in a study
sample
- subset of population
- often randomly chosen and preferably representative of the population as a whole
what is descriptive statistics
data for whole population
- techniques to describe data
- parameters
what is inferential statistics
- steps (3)?
- generalise findings of a sample to population
- statistics
- model, estimate parameters, estimate erros via testing
types of data
- numerical data: continuous, discrete / interval, ratio
- categorical data: nominal, ordinal
- text
- geolocation data
types of data
- numerical data: continuous, discrete / interval, ratio
- categorical data: nominal, ordinal
- text
- geolocation data
what is numerical data
- types?
- discrete
- continuous
- cross-sectional: data on cross section of a population at a distinct point in time
- time series: data collected over time
- pooed data: time series of cross sections; observations in each cross section not the same
- panel data: samples of SAME cross-sectional data observed at multiple points in time
descriptive measures for numerical variables
- mean: average value of an interval or ratio variable; affected by outliers
- median: middle value for data arranged in either ascending or descending order; good for ordinal data; less affected by outliers
- mode: most frequent value for a variable; good for nominal data
measuring spread for numerical data
- range: max-min
- IQR: Q3 - Q1
- variance
measures of symmetry of distribution
- skewness
- kurtosis
charts for numerical variables
- for cross-sectional variables?
- for time series variables?
CR: histogram, boxplot
TS: time series graphs
histogram/boxplot
- purpose?
- difference?
- shows distribution/skewness of a distribution
- histogram for values, bar chart for categorical
time series graphs
- purpose?
- how variables change over time
analysis for categorical data
- types of summarization
- data processing?
bar graph, pie chart, 2way frequency table
categorical variables can be coded numerically or left uncoded
- nominal data: dummy encoding, one-hot encoding
- ordinal data: one-hot encoding, label encoding
what is dummy encoding
- use (m-1) variables
- binary or 2-value variable can be coded as 0-1 variable for a specific category
- for more than 2, drop a category
what is one-hot encoding
- m is number of possible values in the categorical variable
what is label encoding
- use a variable with an ordering label
- used if order is required for comparison analysis
data errors
- sources?
- data entry/cleaning
- transmission
- integration
- incorrect manipulation
data errors
- mitigation?
- check spread, missing, invalid
- use summary tables, one/two-way frequency tables, etc
what are outliers
- a value or observation that lies well outside of the norm
missing values
- what to do?
- ignore row with missing data
- clean data by going back to the source and get actual value
- impute missing data with substitute data
how to impute missing values
- use central tendency
- use constant value meaningful to domain
- randomly selected
- predict by inference
whats a univariate analysis
analysis of 1 variable
- simplest form of data analysis
- main purpose is to describe the data and find patterns that exist within it
bivariate analysis
analysis of 2 variables
- find out if there is a relationship between 2 different variables
multivariate analysis
analysis of 3 of more variables
- eg. regression analysis