Introduction Flashcards
What is a categorical variable?
Qualitative variables are referred as categorical variable
What is the logic behind classification technique?
We predict the probability of each of the categories of a qualitative variable (categorical).
What is variable selection?
Determining which predictors are associated with the response, in order to fit a single model involving only those predictors
How to determine which model is best?
- Mallow’s Cp
- AIC
- BIC
- Adjusted R2
What are the two types of resampling method?
- Cross validation
- Bootstrap
LOOCV
Leave One Out Cross Validation
Here we use (n-1) observation for training the model and 1 observation for testing. So overall, we have n cross validation (model is fit n times).
It produces the same result every time. There is no randomness in the training and validation set, However, it is computationally extensive
k-fold Cross validation
Whole data set is divided into k folds and model is fit on k-1 folds whereas validated on 1 fold.
LOOCV is a special case of k-fold CV, where k = n
Generally k = 5 or 10
Which gives low bias: LOOCV or k-fold CV
LOOCV,
because, each training set contains n-1 observation. It means the data is almost fit on the whole observation.
Which gives low variance: LOOCV or k-fold CV
k-fold CV,
because, In case of LOOCV, every fit is almost on the same observation. Therefore, the output are highly (positively) correlated with each other as compared to k-fold CV, where outputs are somewhat correlated.
The mean of highly correlated value has higher variance as compared to mean of value that are not so highly correlated Therefore, LOOCV has higher variance than k-fold cv.
Bias
The inability of the model to truly capture the relationship in the data.
Variance
The difference in performance of model when it is trained on different dataset.
PCA
It is feature extraction technique.
It is a dimension reduction technique. It transforms higher dimensional data to lower dimensional data, explaining the maximum variability in the data.
First principal component
It is the line that is as close as possible to the data.
It minimizes the sum of squared perpendicular distance between each point and the line.
It is the normalized linear combination of the feature.
Second principal component
It is the linear combination of variables that are not related to first component.
The first two principal components of a data set span the plane that is as close to the observation as possible in terms of average squared Euclidean distance.
What is high dimensional data?
When p>n