Advanced Data Preparation Flashcards

Question 1

Q

Normality Assumption

Answer

A

-some models assume data is normally distributed
-when we don’t have normality (doesn’t fit normal distribution) results have bias when assumption is wrong

EX: estimating height-normally distributed
-weight - smaller weights less variance higher weights more variance (wider area)

Question 2

Q

heteroscedasticity

Answer

A

unequal variance

Question 3

Q

variance definition and equation

Answer

A

a statistical measurement of the spread between numbers in a dataset. it measures how far each numbers in the set is from the mean, and thus every other number in a set

σ 2= (∑i = 1 to n(x i−x)2 )/n

where:
x i=Each value in the data set
x=Mean of all values in the data set
N=Number of values in the data set

Question 4

Q

what is box-cox transformation? What does it do and what is it for?

Answer

A

logarithmic transformation
-transform data before trying to fit it to a model - how you deal with hetero skedasticity

What does it do?
-stretches out the smaller range to enlarge its variability
-shrinks the larger range to reduce it’s variability

goal find best value of lambda t(y) = y^lambda -1) / lambda
-software can do it for you
-check whether you need the transformation (qq plot)

Question 5

Q

what is a qq plot used for?

Answer

A

quantile- quantile plot
-helps us asses if a data came from a theoretical distribution such as normal or exponential
-if the dots fit a straight line, then the dataset is normally distributed (the distribution we chose has comperable quantiles to the theoretical distribution)

Question 6

Q

trend in this context

Answer

A

from time series - increase or decrease of data over time

Question 7

Q

What can you detrend and when?

Answer

A

-response
-predictors

when you’re using a factor based model (regression, svm,etc)

Question 8

Q

How to detrend?

Answer

A

-factor by factor
-1 dimensional regression - usually y = a0 + a1x
detrended price = subtract the actual value from (a0+a1x)
(subtract trend line estimate form real value)

Question 9

Q

factor by factor Detrending approach…

Answer

A

-simple
-works well to remove trend effects for time series data
-helpful in factor based analysis

Question 10

Q

What can PCA (principal component analysis) help you do?

Answer

A

-choose which predictors to use
-find which predictors are highly correlated

Question 11

Q

What 2 things does PCA do?

Answer

A

pca transforms data
-removes correlations within the data
-ranks coordinates by importance by variance

concentrate on the first n principal components’
-reduces effect of randomness
-earlier principal components are likely to have higher signal to noise ratio

Question 12

Q

PCA- linear transformation equation

Answer

A

X: initial matrix of data ; xij is the jth factor of data point i
-scale such that the mean = 0

find all of the eigenvectors of XT(Transpose)X
-V: Matrix of eigenvectors (sorted by eigenvalue)
-V = [V1 V2…], where Vj is the jth egenvector of XT(transpose)X

PCA- linear transformation
-first component is xv1, second component is Xv2, etc
-kth new fator value for the ith data point: tik = sum j=1 to m xijvjk

Question 13

Q

What does - linear transformation eliminate?

Answer

A

it eliminates correlation between factors

Question 14

Q

How can you get fewer variables with - linear transformation?

Answer

A

only include first n principal components in your model

Question 15

Q

Can you use a kernel for non-linear pca?

Answer

A

yes
-similar to svm

Question 16

Q

PCA - regression coefficient

Answer

Study These Flashcards

A

aj = sum k = 1 to L bkvjk

Question 17

Q

pca summary

Answer

Study These Flashcards

A

-high dimensional and correlated data
-attempts to remove these correlations and rank by performance
-can be explained over the original factor space

Question 18

Q

eigenvalue and eigenvectors

Answer

Study These Flashcards

A

given lambda, solve Av = lambdav to find the corresponding eigenvalue v

-v is a vectors such that: Av = lambdav
v: eigenvector of A
lambda: eigenvalue of A
-det(A-lambdaI) = 0

Question 19

Q

How are eigenvalue and eigenvectors important for PCA?

Answer

Study These Flashcards

A

-1st step of PCA is find all eigenvectors v1…vn of (X(transpose)X)

-find principal components
-multiply X by the eigenvectors
-Xv1, Xv2, Xvn are the principal components
-transformed set of orthogonal coordinate directions

Question 20

Q

Pitfalls of PCA

Answer

Study These Flashcards

A

pca is calculated without any regard for the response variable. it could be possible that the response is actually more influenced by factors with less variable when we would normally want higher variability

Question 21

Q

How can pca be wrong?

Answer

Study These Flashcards

A

if the dimension that explains the most variance does not help separate between classes

Question 22

Q

How to make qq plot

Answer

Study These Flashcards

A

How to Draw Q-Q plot?
To draw a Quantile-Quantile (Q-Q) plot, you can follow these steps:

Collect the Data: Gather the dataset for which you want to create the Q-Q plot. Ensure that the data are numerical and represent a random sample from the population of interest.
Sort the Data: Arrange the data in either ascending or descending order. This step is essential for computing quantiles accurately.
Choose a Theoretical Distribution: Determine the theoretical distribution against which you want to compare your dataset. Common choices include the normal distribution, exponential distribution, or any other distribution that fits your data well.
Calculate Theoretical Quantiles: Compute the quantiles for the chosen theoretical distribution. For example, if you’re comparing against a normal distribution, you would use the inverse cumulative distribution function (CDF) of the normal distribution to find the expected quantiles.
Plotting:
Plot the sorted dataset values on the x-axis.
Plot the corresponding theoretical quantiles on the y-axis.
Each data point (x, y) represents a pair of observed and expected values.
Connect the data points to visually inspect the relationship between the dataset and the theoretical distribution.

Interpretation of Q-Q plot
If the points on the plot fall approximately along a straight line, it suggests that your dataset follows the assumed distribution.
Deviations from the straight line indicate departures from the assumed distribution, requiring further investigation.

Advanced Data Preparation Flashcards

(22 cards)