Data Types Flashcards
3 types of data
quantitative, qualitative, textual
Quantitative data (1 word)
numerical
Qualitative data (1 words)
categorical
2 types of quantitative data
continuous and discrete
numerical qualitative data
not mathematical, category numbers
2 types of qualitative data
ordinal (ordered), norminal (un ordered)
1st step of investigating data
examine the univariate statistics
purposes (3) of starting with univariate stats
1) detect distribution anomalies 2) get an idea of some orders of magnitude 3) see how to discretize the continuous variables, if needed
type of plot for univariate review of discrete or qualitative data
frequency table
type of plot for univariate review of continuous
box blot, taking note of extreme percentiles
purposes (3) of bivariate analysis
1) incompatible variables 2) links between dependent (target) variable and independent 3) links between independent variables
simple table for bivariate analysis
contingency table
2nd step of investigating data
Rare or Missing values
problem with rare values
can create bias in factor analysis or skew in measures of center
dealing with rare values
remove or replace with more frequent value
problem with missing values
1) may not be random, skewing data 2) aggregates over multiple variables
dealing with missing values (4 options)
1) remove records 2) remove/replace variable 3) replace value 4) treat ‘missing’ qualitative data as it’s own value
when missing values >= 15-20% of values
cannot use replace values or treat missing data as it’s own value
Statistical replacement of the missing values uses a process called
imputation
simplest method of imputation
replace missing value with most frequent value or mean/median
most widespread imputation model (simple imputation)
each missing value is replaced with an assumed value
multiple imputation
missing vales are replaced with multiple plausible values creating several complete data tables
3rd step of investigating data
Aberrant Values
define aberrant value
erroneous value: can be caused by incorrect measurement, calculation error, input error, false declaration.
extreme values and aberrant values relationship
extreme values not always aberrant, aberrant values not always extreme.
tools for detecting aberrant values
frequency tables, univariant statistics
dealing with aberrant values (4)
1) delete records 2) delete/replace variable 3) replace value 4) tolerate small margin of error and keep values as are.
4th step of investigating data
extreme value
what situations are tolerant of extreme values
1) decision trees 2) where rare profiles are the subject of study (fraud prediction, etc)
what situations are especially intolerant of extreme values
continuous variables used in logistic regression, PCA, and variance.
dealing with extreme values (#1)
exclude outliers from model learning sample (ensuring not more than 1-2% are excluded)
dealing with extreme values (#2)
divide continuous variable into classes
dealing with extreme values (#3)
Winsorize the variable- values beyond 1st or 99th percentile are given 1st and 99th percentile values.
Step 5 of investigating data
tests of normality
why normality matters
normal distribution is required for all parametric and linear regression methods.
Tests of normality: when to use Shapiro–Wilk
uses a P-P plot with a diagonal line that represents normality. small samples (under 2000)
Tests of normality: when to use Kolmogorov–Smirnov
very general, typically will use variants
Tests of normality: when to use Anderson–Darling test
Variant of Kolmogorov-Smirnov that corrects less-sensitive tails
Tests of normality: when to use Lilliefors test
Variant of Kolmogoro-Smirnov for cases when mean and variance are estimated from sample data
define homoscedasticity
the property of having equal statistical variances
homoscedasticity with a single independent variable in a discrimination model
equality of the variances of the variable in a number of samples, for example in the different groups of a population
homoscedasticity with more than one independent variable in a discrimination model
equality of the covariance matrices of the variables in a number of samples
homoscedasticity in linear regression
variance of the residuals does not depend on the value of the predictors