Classification Flashcards

Question

Is clustering an unsupervised learning method?

Answer 1

Clustering is an unsupervised method that works on datasets where neither the outcome (target) variable nor the relationship between the observations is known, i.E.Unlabeled data.

Answer 2

Regularization is a technique used to reduce errors by fitting the function on the given training set to avoid overfitting.The commonly used regularization techniques are L1 regularization.L2 regularization.

Answer 3

The resampling Technique is a widely adopted technique for dealing with highly unbalanced datasets.It consists of removing samples from the majority class (under-sampling) and/or adding more examples from the minority class (over-sampling).

Answer 4

Using statistical methods to estimate a value for a column from the values that are there, then replacing all missing values in the column with the estimated statistic, is a straightforward and popular way to data imputation.

Answer 5

The addition of numerous new characteristics to the model aids in the prevention of overfitting on the training set.Adding additional features allows us to create more expressive models that are better suited to our training data.If too many new features are added, the training set may become overfitted.

Answer 6

The fundamental difference between convolutional neural network (CNN) and conventional machine learning is that, rather than using hand-crafted features, such as SIFT and HoG, CNN can automatically learn features from data (images) and acquire scores from the output of it.

Answer 7

Microfinance has a number of obstacles, including higher interest rates than mainstream banks, widespread reliance, over-indebtedness, insufficient investment validation, and a lack of understanding of financial services in the economy, to name a few.

Answer 8

The objective of machine learning and statistics differs significantly.Machine learning models are created with the goal of making the most precise predictions feasible.The purpose of statistical models is to make inferences about the relationships between variables.The mathematical study of data is known as statistics.

Answer 9

Machine learning has recently grown in popularity as a method of teaching computers to learn from data.Many machine learning methods are built on the foundation of numerical analysis.

Answer 10

MAE, MAPE, and RMSE are the metrics to evaluate the model performance. MAE is the mean absolute error that gives the average of absolute values of the error made by the model. While MAPE is the mean absolute percentage error. It gives the error committed by the model in percentage. RMSE is Root Mean Squared Error that gives the square root of the average of squared error. Using RMSE is preferred because it gives the error in the same unit of the target variable but it is a good practice to check all the metrics to get the overall picture.

Answer 11

When a bank provides a loan to a customer there is a chance that the customer will not be able to pay back the loan amount. In the loan default case study, default =No means the customer has not defaulted and had paid back the loan on time.

Answer 12

To do the performance analysis of a classification model we need to set the default value of the predicted probability of the corresponding outcome. In general, this value is 0.5 which means if the probability is greater than 0.5 a certain outcome will be assigned to that record. But thresholds can also be tuned with the performance metrics and Precision-Recall curves. The value of the threshold at which the model is performing best should be selected.

Answer 13

Overfitting and underfitting are independent of the type of classifiers being used. If a classification model is complex enough to capture the noises of the data then it is overfitting while when it is simple enough to capture even the existing pattern in the data then it is an underfitting model. Hence both linear and non-linear classifiers can overfit and underfit both.

Answer 14

Yes, apart from missing data, a data can have some data input errors, for example, a negative value for time variable, it can have duplicate records i.e. few records might be repeated. These are the few common inconsistencies that can be there in the data.

Answer 15

To do the error analysis, we have the option of a confusion matrix for a classification problem. it can be used for a linear classifier too. It contains both types of errors committed by a model.

Answer 16

In the confusion matrix, if both types of errors type 1 and type 2 are going down, it indicates that the model is performing well. A lower value of error means the model is correct and is supposed to do well on unseen data.

Answer 17

Yes, when we create the confusion matrix, it calculates its components TN (True Negative), TP (True Positive), FN (False Negative), FP (False Positive). We can also print the classification report for the model that can give different type of errors made by the model.

Answer 18

No, the general errors for a classification problem are type I and type II only. If we are solving an m-ary problem there will be multiple combinations of correct and incorrect outputs in predicted and actual data but they are essentially type I or type II errors,

Answer 19

An outlier is something not good for a classification model. If the outliers are high in the count then it is suitable to create a separate class for them. But if they are small in the count then they may lead to misclassification. Note that outliers might represent real world observations and should not be removed without analyzing them.

Answer 20

Yes, in a classification problem in the input feature set categorical features can be included. The output feature is bound to be categorical but the input features can be only continuous, only categorical, or a combination of both of them.

Answer 21

While solving a classification model without making predictions on the unseen data we cannot know about the count of correct and incorrect predictions. So, it can be known only after the prediction.

Answer 22

In a classification problem, it is possible that one of the error types is more costly than the other. For example, in the loan default case, the false-negative are those customers who are predicted to not default but actually they will. So this results in monetary losses for the bank. While FP is those who are predicted to default but actually they will not default. Such customers are loss of opportunity for the bank. So here cost of making these errors is not the same but different. The bank can decide which loss is more important for them to minimize depending on the problem at hand.

Answer 23

While solving a classification problem, it is needed to do the performance analysis of the model. For this, we do predict the outcomes on the test set of the data. So for the test set, we have both the predicted outcome and the actual outcome. Using this two the confusion matrix is prepared with four of its components (In a binary classification problem) namely FN(False negative), FP(False positive), TN(True Negative), and TP(True Positive).

Answer 24

In a classification problem, the model itself decides the position of the partition line. In the case of a linear classifier, it will put a straight line between the two classes that will indicate ensuring the minimum misclassification error. While a non-linear classifier puts a non-linear partition curve in the data to segregate it the best.

Answer 25

The symbol T used with any vector is to represent the transpose of the vector/matrix.

Answer 26

A convex curve is a polynomial curve that resembles the quadratic plot. While a non-convex curve is every other plot that does not resemble the quadratic plot. In convex optimization, there will be one and only one solution that is the global optimum, but in non-convex, the solution may be the local minimum, because it contains more local and global minima.

Answer 27

In a classification problem, the F-1 score is one of the performance assessment measures for the models. Mathematically it is equal to the harmonic mean of the precision and recall of the model.

Answer 28

A validation set is the set of data that is used to validate the model trained on the training set while a test set is used to evaluate the model. Performance on the validation set is used to improve the model. While test set performance on the test set is used to validate the model against the unseen data.

Answer 29

A stratified sample is one that contains the same ratio of output labels in each of the training and testing sets of the data. This is specially used in the presence of imbalanced data in the problem.

Answer 30

One can do that but the problem is if we tune with training data performance there is a strong chance that we may overfit the model. We will find k that will perform the best on the training data and during the process, we will miss the point that the model performance should be generalized over unseen data.

Answer 31

Scaling is done to project a set of features from their existing range to a fixed different range. While in t-SNE the existing distribution is projected to a lower dimension. So scaling should not be considered as a reverse process of t-SNE.

Answer 32

Yes, but it might not be always possible with real world datasets. It depends on the business case. Most of the time the cost of committing these errors is different. So it is not necessary always to make both types of errors zero.

Answer 33

Yes, while building a machine learning model it is needed to train the model. Once the model is trained then it is tested to ensure that it performs to a good level and can work well on even the unseen data.

Answer 34

We should not throw out outliers without analyzing them. They might represent imbalance or trends in the real world market, for example, income is most of the time is a skewed variable but all extreme points might not be outliers. Outliers should be dropped after the EDA part.

Answer 35

No, they are not the same. Calculation of weight involves the parameters while calculating the distances does not involve parameters. It just needs the data associated with the features. Also, weights are estimated by using optimization techniques while distance is just a algebraic calculation.

Answer 36

The correlation coefficient is a measure that determines the degree to which the movement of two different variables is associated. The correlation coefficient is calculated by dividing the covariance between two variables by the product of the standard deviations of each variable. The coefficient of determination, R-Squared, shows how much of the variation of the dependent variable (y) can be explained by our model. In general, R-Squared values range from 0 to 1 and are commonly stated as percentages from 0% to 100%. An R-squared of 100% means that all movements of the dependent variable are completely explained by movements in the independent variable(s).

Answer 37

Yes, the majority of classification algorithms works for non-binary target variable as well. Such a classification problem where the target variable has more than one category is called a multi-class classification problem.

Answer 38

The independent variables can be of any type - numerical or categorical. But we have to encode the categorical variables into numbers before passing them to the algorithm. For example, yes or no can be encoded as 0 or 1.

Answer 39

No, but in practice, you would find such datasets because historically very few people will default on a bank loan. When the number of samples in one category of the target variables is much higher than the number of samples in the other category, it is called data imbalance. Ideally, we want the data to be balanced but that does not happen most of the time and we need to deal with it while working on a problem.

Answer 40

This can happen and it is fine. It will be more problematic if we miss including variables in the problem in order to just avoid the confounding between two variables.

Answer 41

One-hot encoding creates additional dummy features based on the number of unique values in the categorical feature. Every unique value in the category will be added as a feature. We can apply one-hot encoding when the categorical feature is not ordinal. In Label encoding, each label is assigned a unique integer. We can apply label encoding when the categorical feature is ordinal (low, medium, high).

Answer 42

Generally, the default threshold is 0.5 but we can change it depending on the problem at hand. For example, if identifying one variable correctly is more important than identifying the other variable, or two classes are imbalanced, then we can change the threshold as per the requirement. There is a tradeoff between precision and recall while changing the threshold. We can use the precision-recall curve to identify the appropriate threshold.

Answer 43

Yes, ideally we want no wrong classifications but the reality is that this is very difficult to achieve; so if we have a binary classification problem (with two kinds of misclassifications), we prioritize and try to minimize the wrong positives in the more critical class, without having too many misclassifications of the other variety either. In loan default prediction, for example, you would ideally be cautious, and hence the more critical error is those customers wrongly classified as safe, but who are actually potential default risks. At the same time, the model cannot get too cautious, as the bank would then simply deny loans to most applicants and that would represent too much of an opportunity loss. A balance needs to be achieved, and the exact threshold of that balance varies from one business to the other.

Answer 44

Ideally, it is built on the validation data to assess the performance of the model. But we can build it on the training as well to keep track of the model performance on the training data.

Answer 45

The "prior" probability is a single number. Hence, it is a constant. For each class, it is equal to the number of samples belonging to a class divided by the total number of samples.

Answer 46

Yes, most variables won't be normal but it is just an assumption we make for mathematical ease and come up with the classification algorithm. We can build the model and if the assumption is not satisfied, the prediction would also be bad. If that is the case, we can move on and try some other algorithm.

Answer 47

The formula comes from a famous rule called Bayes' Theorem. Intuitively, the posterior probability is the revised or updated probability of an event occurring after taking into consideration new information. Here, the probability distribution of independent variables conditioned by randomly observed data for each class i.e. P(X|Y=k) is the new information.

Answer 48

Yes, it is possible to artificially increase the proportion of an underrepresented class inside your dataset. There are many sampling techniques to do this, for eg. SMOTE oversampling where many observations are created for the minority class. You can also try changing the threshold, which would have the same effect as class balancing.

Answer 49

It is an iterative process. We build the model, and analyze the cost of the predictions and tune the model hyperparameters accordingly. If tuning hyperparameters does not work, we can try some other algorithm.

Answer 50

Objective function is the function we minimize to get the best parameters for the model. While Precision is an evaluation method and we check the perfromal of the model after training

Classification Flashcards

Learn the general concept of classification. (74 cards)