Advanced Data Preparation Flashcards
Normality Assumption
-some models assume data is normally distributed
-when we don’t have normality (doesn’t fit normal distribution) results have bias when assumption is wrong
EX: estimating height-normally distributed
-weight - smaller weights less variance higher weights more variance (wider area)
heteroscedasticity
unequal variance
variance definition and equation
a statistical measurement of the spread between numbers in a dataset. it measures how far each numbers in the set is from the mean, and thus every other number in a set
σ 2= (∑i = 1 to n(x i−x)2 )/n
where:
x i=Each value in the data set
x=Mean of all values in the data set
N=Number of values in the data set
what is box-cox transformation? What does it do and what is it for?
logarithmic transformation
-transform data before trying to fit it to a model - how you deal with hetero skedasticity
What does it do?
-stretches out the smaller range to enlarge its variability
-shrinks the larger range to reduce it’s variability
goal find best value of lambda t(y) = y^lambda -1) / lambda
-software can do it for you
-check whether you need the transformation (qq plot)
what is a qq plot used for?
quantile- quantile plot
-helps us asses if a data came from a theoretical distribution such as normal or exponential
-if the dots fit a straight line, then the dataset is normally distributed (the distribution we chose has comperable quantiles to the theoretical distribution)
trend in this context
from time series - increase or decrease of data over time
What can you detrend and when?
-response
-predictors
when you’re using a factor based model (regression, svm,etc)
How to detrend?
-factor by factor
-1 dimensional regression - usually y = a0 + a1x
detrended price = subtract the actual value from (a0+a1x)
(subtract trend line estimate form real value)
factor by factor Detrending approach…
-simple
-works well to remove trend effects for time series data
-helpful in factor based analysis
What can PCA (principal component analysis) help you do?
-choose which predictors to use
-find which predictors are highly correlated
What 2 things does PCA do?
pca transforms data
-removes correlations within the data
-ranks coordinates by importance by variance
concentrate on the first n principal components’
-reduces effect of randomness
-earlier principal components are likely to have higher signal to noise ratio
PCA- linear transformation equation
X: initial matrix of data ; xij is the jth factor of data point i
-scale such that the mean = 0
find all of the eigenvectors of XT(Transpose)X
-V: Matrix of eigenvectors (sorted by eigenvalue)
-V = [V1 V2…], where Vj is the jth egenvector of XT(transpose)X
PCA- linear transformation
-first component is xv1, second component is Xv2, etc
-kth new fator value for the ith data point: tik = sum j=1 to m xijvjk
What does - linear transformation eliminate?
it eliminates correlation between factors
How can you get fewer variables with - linear transformation?
only include first n principal components in your model
Can you use a kernel for non-linear pca?
yes
-similar to svm