Introduction Flashcards

1
Q

What is a categorical variable?

A

Qualitative variables are referred as categorical variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the logic behind classification technique?

A

We predict the probability of each of the categories of a qualitative variable (categorical).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is variable selection?

A

Determining which predictors are associated with the response, in order to fit a single model involving only those predictors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How to determine which model is best?

A
  1. Mallow’s Cp
  2. AIC
  3. BIC
  4. Adjusted R2
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the two types of resampling method?

A
  1. Cross validation
  2. Bootstrap
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

LOOCV

A

Leave One Out Cross Validation
Here we use (n-1) observation for training the model and 1 observation for testing. So overall, we have n cross validation (model is fit n times).

It produces the same result every time. There is no randomness in the training and validation set, However, it is computationally extensive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

k-fold Cross validation

A

Whole data set is divided into k folds and model is fit on k-1 folds whereas validated on 1 fold.
LOOCV is a special case of k-fold CV, where k = n
Generally k = 5 or 10

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Which gives low bias: LOOCV or k-fold CV

A

LOOCV,

because, each training set contains n-1 observation. It means the data is almost fit on the whole observation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Which gives low variance: LOOCV or k-fold CV

A

k-fold CV,

because, In case of LOOCV, every fit is almost on the same observation. Therefore, the output are highly (positively) correlated with each other as compared to k-fold CV, where outputs are somewhat correlated.

The mean of highly correlated value has higher variance as compared to mean of value that are not so highly correlated Therefore, LOOCV has higher variance than k-fold cv.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Bias

A

The inability of the model to truly capture the relationship in the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Variance

A

The difference in performance of model when it is trained on different dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

PCA

A

It is feature extraction technique.

It is a dimension reduction technique. It transforms higher dimensional data to lower dimensional data, explaining the maximum variability in the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

First principal component

A

It is the line that is as close as possible to the data.
It minimizes the sum of squared perpendicular distance between each point and the line.

It is the normalized linear combination of the feature.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Second principal component

A

It is the linear combination of variables that are not related to first component.

The first two principal components of a data set span the plane that is as close to the observation as possible in terms of average squared Euclidean distance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is high dimensional data?

A

When p>n

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Can model evaluation metrics like Cp, AIC, BIC, adjusted R2 be used in high dimensional data?

A

No,

Indirect performance metrics like Cp, AIC, BIC are not appropriate for high dimensional setting. This is because estimating RSS and σ2 is not feasible as well as RSS might become zero, suggesting a perfect fit.

And it is easier to obtain a model with adjusted R2 value = 1

17
Q

Is regularization effective in high-dimensional settings?

A

Yes

18
Q

What happens to the test error with increase in dimension?

A

Increases

19
Q

Curse of dimensionality

A

The quality of the model doesn’t always improve with the increase in the predictive variable.

Adding variables that are truly associated with the response variable will improve the fitted model and adding noise features that are not truly associated with response will deteriorate the quality of the model and will lead to overfitting.

20
Q

Undersampling

A

Random sampling on majority classes to reduce the number of dataset in majority classes equal to the minority class.

21
Q

Oversampling

A

Copying the datapoints of minority classes to equal the majority classes in case of imbalance dataset.

22
Q

Advantages and disadvantages of Under sampling

A

Advantages:
1. Handles class imbalance
2. Faster Training

Disadvantages:
1. Data loss
2. Sampling bias (Use random sampling)

23
Q

Advantages and disadvantages of Oversampling

A

Advantages:
1. Handles class imbalance

Disadvantages:
1. Duplicacy of data may cause overfitting

24
Q

SMOTE

A

Synthetic Minority Oversampling Technique:
Uses Interpolation instead of Duplicacy

  • A KNN is trained on the minority observation to find the K nearest neighbor.
  • Randomly select a minority datapoint and then randomly select its neighbor (for interpolation)
  • Take difference between a sample and selected neighbor.
  • Multiply the difference by a random number between 0 and 1.
  • Add this difference to the sample to generate a new synthetic example in feature space
  • Continue on with next nearest neighbor until minority = majority
25
Q

Disadvantage of SMOTE

A
  1. Does not handle categorical data well
  2. Computational complexity due to KNN
  3. Sensitive to outliers
  4. Dependency of value of “K”
  5. Oversampled data may not represent the actual distribution of minority class.
26
Q

Balanced Random Forest

A

Similar to Undersampling, except the sample for majority class is taken multiple time.

27
Q

Cost Sensitive learning

A

A type of learning that takes the misclassification costs into consideration. It is one of the way to tackle class imbalance.

  1. Class weights - Hyperparameter in sklearn models, where we can change the weightage given to different classes.
  2. Custom loss function
28
Q

Advantage of PCA

A
  1. Faster execution of algorithm
  2. Visualization
29
Q

What is the relation b/w No. of PCs and No. of predictors “p”

A

No. of PCs = No. of predictors “p”

30
Q

Why is MAD not used in PCA instead of variance?

A

Because MAD is not differentiable at 0.

31
Q

How to find the optimum no. of PCs?

A

We can use elbow plot.

However, the number of optimum PCs depends on the problem statement and the data set. If the problem is for data visualization, then we can start with small number of PCs and increase it to get more information about data.
And if the problem is, let’s say, principal component regression, then we can use cross validation to find the optimum number of PCs.

32
Q

When PCA does not work?

A
  1. Spherical data
  2. When data follows some kind of pattern (SinX, X2, LogX)
33
Q

Do we need to scale the data before performing PCA?

A

Yes, the data is scaled with mean zero and standard deviation. This is because features may have different units.

And also different feature may have different ranges. So it is not ideal for feature with higher value to affect the principal component more.

34
Q

Preliminary question before EDA?

A
  1. How big is your data
  2. How does your data look like
  3. What is the data type of the columns
  4. Are there any missing value in the data
  5. Are there any duplicate value in the data
  6. How does the data look mathematically
  7. Is there any correlation b/w the columns
35
Q

EDA

A
  1. Univariate Analysis -
    For categorical data - Count plot, Pie chart
    For numerical data - Histogram, KDE, Box plot
  2. Multi-variate Analysis -
    Numerical - Numerical - Scatter plot, Line plot
    Numerical - Categorical - Bar plot, Distplot (KDE), Box plot
    Categorical - Categorical - Heat map, Cluster map, Pair plot
36
Q

What is EDA?

A

Exploratory Data Analysis

To visualize and analyze the trend and pattern present in the data and the relationship between the variables to derive meaningful insights.

EDA includes missing value, duplicate value, outliers identification, correlation, univariate, bivariate & multivariate analysis.

37
Q

Feature Engineering

A
  1. Feature Transformation
    a. Missing value Imputation
    b. Handling categorical features
    c. Outlier detection
    d. Feature scaling
  2. Feature Construction
    Creating new feature based on domain knowledge
  3. Feature Selection
    a. Forward selection
    b. Backward selection
    c. Mixed selection
  4. Feature Extraction
    a. PCA
    b. t-SNE
    c. LDA
38
Q

Feature scaling

A
  1. Normalization
  2. Standardization (Z-score normalization)
    Mean = 0; Sd = 1
39
Q

When to use and when not to use standardization?

A

When to use: -
1. K-means Clustering
2. KNN
3. Gradient descent
4. ANN (uses GD)
5. PCA

Not to use: -
1. Tree based algorithm