Classification Flashcards

Learn the general concept of classification.

1
Q

What should we do when the dataset is overfitting in cross-validation?

A

That means regularization would be necessary.Either that means Lasso / Ridge regularization, or it just means getting more data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Is it possible to predict more than two categories?

A

Yes, a classification task with more than two classes is called Multi-class classification.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the definition of Metric?

A

They’re often differentiable in the model’s parameters and are used to train a machine learning model (using some type of optimization like Gradient Descent).Metrics are used to track and measure a model’s performance (during training and testing), and they don’t have to be unique.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How do we deal with missing data?

A

There are various ways to handle the missing values.We can use the mean or median if it is a numerical column and mode if it is a categorical column.There are fancier methods like Iterative Imputers as well.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How do you deal with unbalanced data in classification problems?

A

There are ways to handle the imbalance in the data.We can use resampling methods like oversampling or undersampling.We can also try using methods like SMOTE or Adasyn.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Does categorical variable need normalization standardization?

A

In general, a categorical variable will never have a normal distribution.Simply code dichotomous variables as 0,1 (or 1,2).So, there’s no need for standardizing as it wouldn’t make much sense.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How to figure out the optimal threshold in the linear classifier?

A

The default threshold is 0.5, however, depending on the problem at hand, we can adjust it.For example, if correctly identifying one variable is more important than correctly identifying the other variable, or if two classes are unbalanced, we can adjust the threshold to meet the needs.When changing the threshold, there is a compromise between precision and recall.The precision-recall curve can be used to determine the appropriate threshold.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What type of variable is ordinal?

A

An ordinal variable is a categorical variable with an ordered set of possible values.Ordinal variables are a type of variable that falls somewhere between categorical and quantitative variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How do you adjust the threshold to reach the appropriate sensitivity if there are more than two categories?

A

We can employ a one-versus-all strategy.By dividing the multi-class dataset into a set of binary classification problems.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is gamma in machine learning?

A

Gamma is a hyperparameter that must be specified before training the model.Gamma determines the amount of curvature in a decision boundary.More curvature suggests a higher gamma.Low gamma indicates less curvature.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Is the covariance matrix symmetric?

A

Any covariance matrix is symmetric and positive semi-definite, with variances in the major diagonal (i.E., the covariance of each element with itself).To completely characterize the two-dimensional variation, a matrix would be required.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is a maximum posterior hypothesis?

A

A Bayesian-based strategy to generate a distribution and model parameters that best describe an observed dataset is known as Maximum a Posteriori (MAP).Calculating a conditional probability of witnessing data given a model weighted by a previous probability or belief about the model is what MAP is all about.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the difference between the false positive and the false negative?

A

When a researcher determines something is true when it is wrong, this is known as a false positive (also called a type I error).A “false alarm” is a term for a false positive.When you say something is false when it is true, you are using a false negative (also called a type II error).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are one-vs-Rest and one vs all?

A

One-vs-rest (OvR for short, sometimes known as One-vs-All or OvA) is a heuristic technique for multi-class classification utilizing binary classification methods.On each binary classification task, a binary classifier is trained, and predictions are made using the most confident model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What about the non-diagonal terms of the covariance?

A

In the covariance table, the off-diagonal values are different from zero.This indicates the presence of redundancy in the data.In other words, there is a certain amount of correlation between variables.This kind of matrix, with non-zero off-diagonal values, is called a “non-diagonal” matrix.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Not entirely understand why 3.33% is the misclassification rate?

A

If you were to simply predict “no default” on this dataset, since only 3.33% are the bad population, the error would be only 3.33%.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Is there a method to fine-tuning?

A

We can experiment with different threshold values to see which one best separates the data.It varies from case to case.Precision and recall have an inverse relationship.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is AUC-ROC?

A

AUC represents the degree or measure of separability, whereas ROC is a probability curve.It indicates how well the model can distinguish between classes.The better the model predicts 0 classes as 0 and 1 classes as 1, the higher the AUC.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

How do you ensure that you are not overfitting a model?

A

Keep the model simpler: remove some of the noise in the training data.By using cross-validation techniques such as k-folds cross-validation.By using regularization techniques such as LASSO.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Is it possible to use PCA?

A

We can use PCA.But we will lose all the interpretability of the variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What are Type I error and type II errors? When is a Type I error committed and how might you avoid committing a Type I error?

A

If your statistical test was significant, you would have then committed a Type I error, as the null hypothesis is actually true.In other words, you found a significant result merely due to chance.The flipside of this issue is committing a Type II error: failing to reject a false null hypothesis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

How do you verify causation?

A

The best technique to find causal correlations is to use randomized experiments.You can test for causation once you’ve found a correlation by doing experiments in which you “control the other variables and measure the difference.We can apply the following analysis to determine causation with your product: Hypothesis testing.A/B/n experiments.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What situation do you think where bootstrapping is not applicable?

A

There are several, mostly esoteric, conditions when bootstrapping is not appropriate, such as when the population variance is infinite, or when the population values are discontinuous at the median.And, there are various conditions where tweaks to the bootstrapping process are necessary to adjust for bias.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

How do you deal with high imbalanced data?

A

Approaches to deal with the imbalanced dataset problem are choosing proper evaluation metrics.The accuracy of a classifier is the total number of correct predictions by the classifier divided by the total number of predictions, Resampling (Oversampling and Undersampling).SMOTE, BalancedBaggingClassifier, Threshold moving.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Is clustering an unsupervised learning method?

A

Clustering is an unsupervised method that works on datasets where neither the outcome (target) variable nor the relationship between the observations is known, i.E.Unlabeled data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What is regularization?

A

Regularization is a technique used to reduce errors by fitting the function on the given training set to avoid overfitting.The commonly used regularization techniques are L1 regularization.L2 regularization.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is the best technique for dealing with heavily imbalanced datasets?

A

The resampling Technique is ​a widely adopted technique for dealing with highly unbalanced datasets.It consists of removing samples from the majority class (under-sampling) and/or adding more examples from the minority class (over-sampling).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

How can we use statistical Imputation for Missing Values in Machine Learning?

A

Using statistical methods to estimate a value for a column from the values that are there, then replacing all missing values in the column with the estimated statistic, is a straightforward and popular way to data imputation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Does adding more features prevent overfitting?

A

The addition of numerous new characteristics to the model aids in the prevention of overfitting on the training set.Adding additional features allows us to create more expressive models that are better suited to our training data.If too many new features are added, the training set may become overfitted.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

How is CNN different from machine learning?

A

The fundamental difference between convolutional neural network (CNN) and conventional machine learning is that, rather than using hand-crafted features, such as SIFT and HoG, CNN can automatically learn features from data (images) and acquire scores from the output of it.

31
Q

What are the challenges facing microfinance institutions?

A

Microfinance has a number of obstacles, including higher interest rates than mainstream banks, widespread reliance, over-indebtedness, insufficient investment validation, and a lack of understanding of financial services in the economy, to name a few.

32
Q

How statistics and machine learning are different?

A

The objective of machine learning and statistics differs significantly.Machine learning models are created with the goal of making the most precise predictions feasible.The purpose of statistical models is to make inferences about the relationships between variables.The mathematical study of data is known as statistics.

33
Q

Is machine learning numerical analysis?

A

Machine learning has recently grown in popularity as a method of teaching computers to learn from data.Many machine learning methods are built on the foundation of numerical analysis.

34
Q

What should we take into consideration when analyzing the values of RMSE, MAE, and MAPE to compare the model performance of the train and test dataset?

A

MAE, MAPE, and RMSE are the metrics to evaluate the model performance. MAE is the mean absolute error that gives the average of absolute values of the error made by the model. While MAPE is the mean absolute percentage error. It gives the error committed by the model in percentage. RMSE is Root Mean Squared Error that gives the square root of the average of squared error. Using RMSE is preferred because it gives the error in the same unit of the target variable but it is a good practice to check all the metrics to get the overall picture.

35
Q

In the loan default case study what is the meaning of default = No?

A

When a bank provides a loan to a customer there is a chance that the customer will not be able to pay back the loan amount. In the loan default case study, default =No means the customer has not defaulted and had paid back the loan on time.

36
Q

While evaluating a classification model how to determine the threshold?

A

To do the performance analysis of a classification model we need to set the default value of the predicted probability of the corresponding outcome. In general, this value is 0.5 which means if the probability is greater than 0.5 a certain outcome will be assigned to that record. But thresholds can also be tuned with the performance metrics and Precision-Recall curves. The value of the threshold at which the model is performing best should be selected.

37
Q

Are overfitting and underfitting applicable for both linear and non-linear classifiers?

A

Overfitting and underfitting are independent of the type of classifiers being used. If a classification model is complex enough to capture the noises of the data then it is overfitting while when it is simple enough to capture even the existing pattern in the data then it is an underfitting model. Hence both linear and non-linear classifiers can overfit and underfit both.

38
Q

If a data does not have NaN in its records, can it have other types of mistakes?

A

Yes, apart from missing data, a data can have some data input errors, for example, a negative value for time variable, it can have duplicate records i.e. few records might be repeated. These are the few common inconsistencies that can be there in the data.

39
Q

Can we do the error analysis for a linear classifier?

A

To do the error analysis, we have the option of a confusion matrix for a classification problem. it can be used for a linear classifier too. It contains both types of errors committed by a model.

40
Q

If type 1 and type 2 error goes down, does it mean the classifier is incorrect?

A

In the confusion matrix, if both types of errors type 1 and type 2 are going down, it indicates that the model is performing well. A lower value of error means the model is correct and is supposed to do well on unseen data.

41
Q

Can we automate the packages that run different functions, look at the different errors, and provide multiple options. Is there such a thing?

A

Yes, when we create the confusion matrix, it calculates its components TN (True Negative), TP (True Positive), FN (False Negative), FP (False Positive). We can also print the classification report for the model that can give different type of errors made by the model.

42
Q

In a classification problem, if the output is m-ary, we would end up with several errors types?

A

No, the general errors for a classification problem are type I and type II only. If we are solving an m-ary problem there will be multiple combinations of correct and incorrect outputs in predicted and actual data but they are essentially type I or type II errors,

43
Q

How do outliers affect classifiers?

A

An outlier is something not good for a classification model. If the outliers are high in the count then it is suitable to create a separate class for them. But if they are small in the count then they may lead to misclassification.
Note that outliers might represent real world observations and should not be removed without analyzing them.

44
Q

Can a categorical variable be used in the input feature vector?

A

Yes, in a classification problem in the input feature set categorical features can be included. The output feature is bound to be categorical but the input features can be only continuous, only categorical, or a combination of both of them.

45
Q

Can we know about good and bad predictions before the classification?

A

While solving a classification model without making predictions on the unseen data we cannot know about the count of correct and incorrect predictions. So, it can be known only after the prediction.

46
Q

Is it possible that a certain type of error in classification is more costly than the other type?

A

In a classification problem, it is possible that one of the error types is more costly than the other. For example, in the loan default case, the false-negative are those customers who are predicted to not default but actually they will. So this results in monetary losses for the bank. While FP is those who are predicted to default but actually they will not default. Such customers are loss of opportunity for the bank. So here cost of making these errors is not the same but different. The bank can decide which loss is more important for them to minimize depending on the problem at hand.

47
Q

Explain the confusion matrix? Is it made over the training or the testing data?

A

While solving a classification problem, it is needed to do the performance analysis of the model. For this, we do predict the outcomes on the test set of the data. So for the test set, we have both the predicted outcome and the actual outcome. Using this two the confusion matrix is prepared with four of its components (In a binary classification problem) namely FN(False negative), FP(False positive), TN(True Negative), and TP(True Positive).

48
Q

In a classification problem how to decide where to put the partition line?

A

In a classification problem, the model itself decides the position of the partition line. In the case of a linear classifier, it will put a straight line between the two classes that will indicate ensuring the minimum misclassification error. While a non-linear classifier puts a non-linear partition curve in the data to segregate it the best.

49
Q

What would be T that you are using?

A

The symbol T used with any vector is to represent the transpose of the vector/matrix.

50
Q

What is a convex versus non-convex curve?

A

A convex curve is a polynomial curve that resembles the quadratic plot. While a non-convex curve is every other plot that does not resemble the quadratic plot. In convex optimization, there will be one and only one solution that is the global optimum, but in non-convex, the solution may be the local minimum, because it contains more local and global minima.

51
Q

What is an F-1 score?

A

In a classification problem, the F-1 score is one of the performance assessment measures for the models. Mathematically it is equal to the harmonic mean of the precision and recall of the model.

52
Q

Is validation set the same as test set?

A

A validation set is the set of data that is used to validate the model trained on the training set while a test set is used to evaluate the model. Performance on the validation set is used to improve the model. While test set performance on the test set is used to validate the model against the unseen data.

53
Q

Can stratified sampling be used?

A

A stratified sample is one that contains the same ratio of output labels in each of the training and testing sets of the data. This is specially used in the presence of imbalanced data in the problem.

54
Q

Do we tune our hyperparameters to the training set or the validation set? Shouldn’t we choose k based on the error on the training set, and not the validation set?

A

One can do that but the problem is if we tune with training data performance there is a strong chance that we may overfit the model. We will find k that will perform the best on the training data and during the process, we will miss the point that the model performance should be generalized over unseen data.

55
Q

Can we say scaling is the reverse of tsne?

A

Scaling is done to project a set of features from their existing range to a fixed different range. While in t-SNE the existing distribution is projected to a lower dimension. So scaling should not be considered as a reverse process of t-SNE.

56
Q

Is the goal of classification is to make the type 1 and type 2 errors zero.

A

Yes, but it might not be always possible with real world datasets. It depends on the business case. Most of the time the cost of committing these errors is different. So it is not necessary always to make both types of errors zero.

57
Q

Our training and testing the part of building ML model?

A

Yes, while building a machine learning model it is needed to train the model. Once the model is trained then it is tested to ensure that it performs to a good level and can work well on even the unseen data.

58
Q

Should we ever filter out outliers?

A

We should not throw out outliers without analyzing them. They might represent imbalance or trends in the real world market, for example, income is most of the time is a skewed variable but all extreme points might not be outliers. Outliers should be dropped after the EDA part.

59
Q

Is the calculation for distance and weight are the same thing?

A

No, they are not the same. Calculation of weight involves the parameters while calculating the distances does not involve parameters. It just needs the data associated with the features. Also, weights are estimated by using optimization techniques while distance is just a algebraic calculation.

60
Q

What is the difference between the R-square calculation for two variables and the correlation calculation?

A

The correlation coefficient is a measure that determines the degree to which the movement of two different variables is associated. The correlation coefficient is calculated by dividing the covariance between two variables by the product of the standard deviations of each variable.
The coefficient of determination, R-Squared, shows how much of the variation of the dependent variable (y) can be explained by our model. In general, R-Squared values range from 0 to 1 and are commonly stated as percentages from 0% to 100%. An R-squared of 100% means that all movements of the dependent variable are completely explained by movements in the independent variable(s).

61
Q

Can we use classification algorithms for non-binary target variables as well?

A

Yes, the majority of classification algorithms works for non-binary target variable as well. Such a classification problem where the target variable has more than one category is called a multi-class classification problem.

62
Q

Should all independent variables be numeric or binary? Can it be other categories like Business Type?

A

The independent variables can be of any type - numerical or categorical. But we have to encode the categorical variables into numbers before passing them to the algorithm. For example, yes or no can be encoded as 0 or 1.

63
Q

Is it fine if the number of samples for one category is very low in number as compared to another category?

A

No, but in practice, you would find such datasets because historically very few people will default on a bank loan. When the number of samples in one category of the target variables is much higher than the number of samples in the other category, it is called data imbalance. Ideally, we want the data to be balanced but that does not happen most of the time and we need to deal with it while working on a problem.

64
Q

Does confounding between two variables have any effect?

A

This can happen and it is fine. It will be more problematic if we miss including variables in the problem in order to just avoid the confounding between two variables.

65
Q

Is one-hot encoding better than target encoding or vice versa?

A

One-hot encoding creates additional dummy features based on the number of unique values in the categorical feature. Every unique value in the category will be added as a feature. We can apply one-hot encoding when the categorical feature is not ordinal. In Label encoding, each label is assigned a unique integer. We can apply label encoding when the categorical feature is ordinal (low, medium, high).

66
Q

How do you define the threshold?

A

Generally, the default threshold is 0.5 but we can change it depending on the problem at hand. For example, if identifying one variable correctly is more important than identifying the other variable, or two classes are imbalanced, then we can change the threshold as per the requirement. There is a tradeoff between precision and recall while changing the threshold. We can use the precision-recall curve to identify the appropriate threshold.

67
Q

Can you have some misclassification better or closer than others?

A

Yes, ideally we want no wrong classifications but the reality is that this is very difficult to achieve; so if we have a binary classification problem (with two kinds of misclassifications), we prioritize and try to minimize the wrong positives in the more critical class, without having too many misclassifications of the other variety either. In loan default prediction, for example, you would ideally be cautious, and hence the more critical error is those customers wrongly classified as safe, but who are actually potential default risks. At the same time, the model cannot get too cautious, as the bank would then simply deny loans to most applicants and that would represent too much of an opportunity loss. A balance needs to be achieved, and the exact threshold of that balance varies from one business to the other.

68
Q

Is the confusion matrix built on the training dataset or validation dataset?

A

Ideally, it is built on the validation data to assess the performance of the model. But we can build it on the training as well to keep track of the model performance on the training data.

69
Q

Is the prior a constant or a distribution? How can we calculate it?

A

The “prior” probability is a single number. Hence, it is a constant. For each class, it is equal to the number of samples belonging to a class divided by the total number of samples.

70
Q

Most of the time the variables are not normal. What should we do?

A

Yes, most variables won’t be normal but it is just an assumption we make for mathematical ease and come up with the classification algorithm. We can build the model and if the assumption is not satisfied, the prediction would also be bad. If that is the case, we can move on and try some other algorithm.

71
Q

How does multiplying with P(X|Y=k) and dividing by P(X) convert the prior probability to posterior?

A

The formula comes from a famous rule called Bayes’ Theorem. Intuitively, the posterior probability is the revised or updated probability of an event occurring after taking into consideration new information. Here, the probability distribution of independent variables conditioned by randomly observed data for each class i.e. P(X|Y=k) is the new information.

72
Q

In the loan default dataset for example, can you artificially increase the number of defaults (create observations that are similar to the defaults) until 50% of your observations are defaults?

A

Yes, it is possible to artificially increase the proportion of an underrepresented class inside your dataset. There are many sampling techniques to do this, for eg. SMOTE oversampling where many observations are created for the minority class. You can also try changing the threshold, which would have the same effect as class balancing.

73
Q

When does the cost balancing happen, at the time of training or validation?

A

It is an iterative process. We build the model, and analyze the cost of the predictions and tune the model hyperparameters accordingly. If tuning hyperparameters does not work, we can try some other algorithm.

74
Q

Is there a way to change the objective function to precision?

A

Objective function is the function we minimize to get the best parameters for the model. While Precision is an evaluation method and we check the perfromal of the model after training