Introduction Flashcards by Harsh Raj

What is a categorical variable?

Qualitative variables are referred as categorical variable

How well did you know this?

Not at all

Perfectly

What is the logic behind classification technique?

We predict the probability of each of the categories of a qualitative variable (categorical).

How well did you know this?

Not at all

Perfectly

What is variable selection?

Determining which predictors are associated with the response, in order to fit a single model involving only those predictors

How well did you know this?

Not at all

Perfectly

How to determine which model is best?

Mallow’s Cp
AIC
BIC
Adjusted R2

How well did you know this?

Not at all

Perfectly

What are the two types of resampling method?

Cross validation
Bootstrap

How well did you know this?

Not at all

Perfectly

LOOCV

Leave One Out Cross Validation
Here we use (n-1) observation for training the model and 1 observation for testing. So overall, we have n cross validation (model is fit n times).

It produces the same result every time. There is no randomness in the training and validation set, However, it is computationally extensive

How well did you know this?

Not at all

Perfectly

k-fold Cross validation

Whole data set is divided into k folds and model is fit on k-1 folds whereas validated on 1 fold.
LOOCV is a special case of k-fold CV, where k = n
Generally k = 5 or 10

How well did you know this?

Not at all

Perfectly

Which gives low bias: LOOCV or k-fold CV

LOOCV,

because, each training set contains n-1 observation. It means the data is almost fit on the whole observation.

How well did you know this?

Not at all

Perfectly

Which gives low variance: LOOCV or k-fold CV

k-fold CV,

because, In case of LOOCV, every fit is almost on the same observation. Therefore, the output are highly (positively) correlated with each other as compared to k-fold CV, where outputs are somewhat correlated.

The mean of highly correlated value has higher variance as compared to mean of value that are not so highly correlated Therefore, LOOCV has higher variance than k-fold cv.

How well did you know this?

Not at all

Perfectly

Bias

The inability of the model to truly capture the relationship in the data.

How well did you know this?

Not at all

Perfectly

Variance

The difference in performance of model when it is trained on different dataset.

How well did you know this?

Not at all

Perfectly

PCA

It is feature extraction technique.

It is a dimension reduction technique. It transforms higher dimensional data to lower dimensional data, explaining the maximum variability in the data.

How well did you know this?

Not at all

Perfectly

First principal component

It is the line that is as close as possible to the data.
It minimizes the sum of squared perpendicular distance between each point and the line.

It is the normalized linear combination of the feature.

How well did you know this?

Not at all

Perfectly

Second principal component

It is the linear combination of variables that are not related to first component.

The first two principal components of a data set span the plane that is as close to the observation as possible in terms of average squared Euclidean distance.

How well did you know this?

Not at all

Perfectly

What is high dimensional data?

When p>n

How well did you know this?

Not at all

Perfectly

Can model evaluation metrics like Cp, AIC, BIC, adjusted R2 be used in high dimensional data?

Study These Flashcards

No,

Indirect performance metrics like Cp, AIC, BIC are not appropriate for high dimensional setting. This is because estimating RSS and σ2 is not feasible as well as RSS might become zero, suggesting a perfect fit.

And it is easier to obtain a model with adjusted R2 value = 1

Is regularization effective in high-dimensional settings?

Study These Flashcards

Yes

What happens to the test error with increase in dimension?

Study These Flashcards

Increases

Curse of dimensionality

Study These Flashcards

The quality of the model doesn’t always improve with the increase in the predictive variable.

Adding variables that are truly associated with the response variable will improve the fitted model and adding noise features that are not truly associated with response will deteriorate the quality of the model and will lead to overfitting.

Undersampling

Study These Flashcards

Random sampling on majority classes to reduce the number of dataset in majority classes equal to the minority class.

Oversampling

Study These Flashcards

Copying the datapoints of minority classes to equal the majority classes in case of imbalance dataset.

Advantages and disadvantages of Under sampling

Study These Flashcards

Advantages:
1. Handles class imbalance
2. Faster Training

Disadvantages:
1. Data loss
2. Sampling bias (Use random sampling)

Advantages and disadvantages of Oversampling

Study These Flashcards

Advantages:
1. Handles class imbalance

Disadvantages:
1. Duplicacy of data may cause overfitting

SMOTE

Study These Flashcards

Synthetic Minority Oversampling Technique:
Uses Interpolation instead of Duplicacy

A KNN is trained on the minority observation to find the K nearest neighbor.
Randomly select a minority datapoint and then randomly select its neighbor (for interpolation)
Take difference between a sample and selected neighbor.
Multiply the difference by a random number between 0 and 1.
Add this difference to the sample to generate a new synthetic example in feature space
Continue on with next nearest neighbor until minority = majority

Disadvantage of SMOTE

1. Does not handle categorical data well 2. Computational complexity due to KNN 3. Sensitive to outliers 4. Dependency of value of "K" 5. Oversampled data may not represent the actual distribution of minority class.

Balanced Random Forest

Similar to Undersampling, except the sample for majority class is taken multiple time.

Cost Sensitive learning

A type of learning that takes the misclassification costs into consideration. It is one of the way to tackle class imbalance. 1. Class weights - Hyperparameter in sklearn models, where we can change the weightage given to different classes. 2. Custom loss function

Advantage of PCA

1. Faster execution of algorithm 2. Visualization

What is the relation b/w No. of PCs and No. of predictors "p"

No. of PCs = No. of predictors "p"

Why is MAD not used in PCA instead of variance?

Because MAD is not differentiable at 0.

How to find the optimum no. of PCs?

We can use elbow plot. However, the number of optimum PCs depends on the problem statement and the data set. If the problem is for data visualization, then we can start with small number of PCs and increase it to get more information about data. And if the problem is, let's say, principal component regression, then we can use cross validation to find the optimum number of PCs.

When PCA does not work?

1. Spherical data 2. When data follows some kind of pattern (SinX, X2, LogX)

Do we need to scale the data before performing PCA?

Yes, the data is scaled with mean zero and standard deviation. This is because features may have different units. And also different feature may have different ranges. So it is not ideal for feature with higher value to affect the principal component more.

Preliminary question before EDA?

1. How big is your data 2. How does your data look like 3. What is the data type of the columns 4. Are there any missing value in the data 5. Are there any duplicate value in the data 6. How does the data look mathematically 7. Is there any correlation b/w the columns

EDA

1. Univariate Analysis - For categorical data - Count plot, Pie chart For numerical data - Histogram, KDE, Box plot 2. Multi-variate Analysis - Numerical - Numerical - Scatter plot, Line plot Numerical - Categorical - Bar plot, Distplot (KDE), Box plot Categorical - Categorical - Heat map, Cluster map, Pair plot

What is EDA?

Exploratory Data Analysis To visualize and analyze the trend and pattern present in the data and the relationship between the variables to derive meaningful insights. EDA includes missing value, duplicate value, outliers identification, correlation, univariate, bivariate & multivariate analysis.

Feature Engineering

1. Feature Transformation a. Missing value Imputation b. Handling categorical features c. Outlier detection d. Feature scaling 2. Feature Construction Creating new feature based on domain knowledge 3. Feature Selection a. Forward selection b. Backward selection c. Mixed selection 4. Feature Extraction a. PCA b. t-SNE c. LDA

Feature scaling

1. Normalization 2. Standardization (Z-score normalization) Mean = 0; Sd = 1

When to use and when not to use standardization?

When to use: - 1. K-means Clustering 2. KNN 3. Gradient descent 4. ANN (uses GD) 5. PCA Not to use: - 1. Tree based algorithm

Introduction Flashcards

(39 cards)