Advanced Data Preparation Flashcards

1
Q

Normality Assumption

A

-some models assume data is normally distributed
-when we don’t have normality (doesn’t fit normal distribution) results have bias when assumption is wrong

EX: estimating height-normally distributed
-weight - smaller weights less variance higher weights more variance (wider area)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

heteroscedasticity

A

unequal variance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

variance definition and equation

A

a statistical measurement of the spread between numbers in a dataset. it measures how far each numbers in the set is from the mean, and thus every other number in a set

σ 2= (∑i = 1 to n(x i−x)2 )/n

where:
x i​=Each value in the data set
x=Mean of all values in the data set
N=Number of values in the data set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what is box-cox transformation? What does it do and what is it for?

A

logarithmic transformation
-transform data before trying to fit it to a model - how you deal with hetero skedasticity

What does it do?
-stretches out the smaller range to enlarge its variability
-shrinks the larger range to reduce it’s variability

goal find best value of lambda t(y) = y^lambda -1) / lambda
-software can do it for you
-check whether you need the transformation (qq plot)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

what is a qq plot used for?

A

quantile- quantile plot
-helps us asses if a data came from a theoretical distribution such as normal or exponential
-if the dots fit a straight line, then the dataset is normally distributed (the distribution we chose has comperable quantiles to the theoretical distribution)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

trend in this context

A

from time series - increase or decrease of data over time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What can you detrend and when?

A

-response
-predictors

when you’re using a factor based model (regression, svm,etc)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How to detrend?

A

-factor by factor
-1 dimensional regression - usually y = a0 + a1x
detrended price = subtract the actual value from (a0+a1x)
(subtract trend line estimate form real value)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

factor by factor Detrending approach…

A

-simple
-works well to remove trend effects for time series data
-helpful in factor based analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What can PCA (principal component analysis) help you do?

A

-choose which predictors to use
-find which predictors are highly correlated

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What 2 things does PCA do?

A

pca transforms data
-removes correlations within the data
-ranks coordinates by importance by variance

concentrate on the first n principal components’
-reduces effect of randomness
-earlier principal components are likely to have higher signal to noise ratio

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

PCA- linear transformation equation

A

X: initial matrix of data ; xij is the jth factor of data point i
-scale such that the mean = 0

find all of the eigenvectors of XT(Transpose)X
-V: Matrix of eigenvectors (sorted by eigenvalue)
-V = [V1 V2…], where Vj is the jth egenvector of XT(transpose)X

PCA- linear transformation
-first component is xv1, second component is Xv2, etc
-kth new fator value for the ith data point: tik = sum j=1 to m xijvjk

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What does - linear transformation eliminate?

A

it eliminates correlation between factors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How can you get fewer variables with - linear transformation?

A

only include first n principal components in your model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Can you use a kernel for non-linear pca?

A

yes
-similar to svm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

PCA - regression coefficient

A

aj = sum k = 1 to L bkvjk

17
Q

pca summary

A

-high dimensional and correlated data
-attempts to remove these correlations and rank by performance
-can be explained over the original factor space

18
Q

eigenvalue and eigenvectors

A

given lambda, solve Av = lambdav to find the corresponding eigenvalue v

-v is a vectors such that: Av = lambdav
v: eigenvector of A
lambda: eigenvalue of A
-det(A-lambdaI) = 0

19
Q

How are eigenvalue and eigenvectors important for PCA?

A

-1st step of PCA is find all eigenvectors v1…vn of (X(transpose)X)

-find principal components
-multiply X by the eigenvectors
-Xv1, Xv2, Xvn are the principal components
-transformed set of orthogonal coordinate directions

20
Q

Pitfalls of PCA

A

pca is calculated without any regard for the response variable. it could be possible that the response is actually more influenced by factors with less variable when we would normally want higher variability

21
Q

How can pca be wrong?

A

if the dimension that explains the most variance does not help separate between classes

22
Q

How to make qq plot

A

How to Draw Q-Q plot?
To draw a Quantile-Quantile (Q-Q) plot, you can follow these steps:

Collect the Data: Gather the dataset for which you want to create the Q-Q plot. Ensure that the data are numerical and represent a random sample from the population of interest.
Sort the Data: Arrange the data in either ascending or descending order. This step is essential for computing quantiles accurately.
Choose a Theoretical Distribution: Determine the theoretical distribution against which you want to compare your dataset. Common choices include the normal distribution, exponential distribution, or any other distribution that fits your data well.
Calculate Theoretical Quantiles: Compute the quantiles for the chosen theoretical distribution. For example, if you’re comparing against a normal distribution, you would use the inverse cumulative distribution function (CDF) of the normal distribution to find the expected quantiles.
Plotting:
Plot the sorted dataset values on the x-axis.
Plot the corresponding theoretical quantiles on the y-axis.
Each data point (x, y) represents a pair of observed and expected values.
Connect the data points to visually inspect the relationship between the dataset and the theoretical distribution.

Interpretation of Q-Q plot
If the points on the plot fall approximately along a straight line, it suggests that your dataset follows the assumed distribution.
Deviations from the straight line indicate departures from the assumed distribution, requiring further investigation.