Classification Flashcards
impute/imputation
In statistics, imputation is the process of replacing missing data with substituted values.
confusion matrix
- A cross-tabulation of our model’s predictions against actual values
- A matrix (table) used to measure the performance of a machine learning algorithm
- Rows: actual classes (Ci)
- Columns: predicted classes (Cj)
What are the 4 possible outcomes of classification task?
True Positive
False Postive
False Negative
True Negative
What is the common choice for the baseline model for a classification problem?
a model that simply predicts the most common class every single time
What are the common evaluation metrics for a classification problem/model?
- Accuracy
- Precision
- Recall
- Specificity
- f1 score
- ROC curve
What is accuracy?
the number of times we predicted correctly divided by the total number of observations
What is precision / positive predictive value?
the percentage of positive predictions that we made that are correct.
What is recall / true positive rate / sensitivity?
the percentage of positive cases we accurately predicted.
What is specificity / true negative rate?
the percentage of negative cases we accurately predicted.
The percentage of predicting true negative out of all negatives.
logistic regression
- A regression algorithm
- To find the values of the coefficients that weight each input variable)
- To assign observations to a discrete set of classes
- To predict discrete outcomes
- binomial and multinomial
- The output is a value between 0 and 1 that represents the probability of one class over the other.
regularized least squares
- A way of solving least squares regression problems
- An extra constraint on the solution, which is called regularization
- It adds a penalty term to the error.
- A argument in LogisticRegression
What are the components of a decision tree?
- root
- condition/internal node
- branches/edges
- decision/leaf
What is classification tree?
to classify the outcome variable
What is regression tree?
to predict continuous values like price of a house
CART = ?
classification and regression trees
Recursive binary splitting / greedy algorithm
- Consider all the features
- Use a cost function to try and test all different (candidate) split points
- Select the split with the best/lowest cost.
- Make the root node the best predictor/classifier
What are Decision Trees?
- Is a supervised machine learning process / train on labeled data.
- Use the training data to train the tree to find a decision boundary / a sequence of rules.
- Use the boundary as a decision rule to classify 2 or more classes.
What does each node represent?
- Is a splitting point in the decision tree.
- Represents a single input variable(x).
- a split point or class of that variable
What are the pros of decision tree?
- Simple to understand
- Simple to visualize
- Simple to explain the output
- Requires little data preparation
- Don’t need to encode our target variable
- Perform well for a broad range of problems
What is f1-score?
- harmonic mean of Recall and Precision
- giving both metrics equal weight. - When you are looking to optimize for both Recall and Precision.
What is support?
number of occurrences of each class in where y is true.
What is overfitting?
Don’t generalize the data well.
How to avoid overfitting in Decision-tree?
Mechanisms such as
- Pruning.
- Set the minimum number of samples required at a leaf node
- Set the maximum depth
How to handle overfitting?
- Obtain more training data
- Feature engineering
3.
What is Logistic Regression model?
- Maps any real value into a number between 0 and 1,
- representing the probability that
- an observation is in the positive class.
How does the threshold in LR model affect the metrics?
- decrease the threshold, Recall increases.
2. increase the threshold, Precision increases
What is ROC curve?
- Receiver Operating Characteristic Curve
- Summarize the trade-off between TPR and FPR
- for a predictive model
- using different probability thresholds
How to calculate baseline prediction?
- Average value of dependent variable
What is encoding?
Transform categorical variables to binary or numeric counterparts
What is the purpose of splitting data?
to avoid overfitting the model to one sample