ML learn Flashcards
List xgboost benefits
L1l2 reg prevents overfitting on high dimensional space
Missing values handling
Cross-validation
Allow early stopping
Option to look at the learning graph to choose different checkpoint
Multiple cpu optimised
Goes to deep trees -> allows more optimised trees for inference
Allow multiple objective functions
Easy interface (python)
GBoost start - explain in words what is the initial prediction
The value that minimises the loss function over the observations
GBoost - explain in words how to refine the prediction over the previous prediciton
Add a decision classification/regression tree that predictits each input. In each leaf have a value that minimises the errors in that leaf
GBoost - explain in words how to add the new tree prediction that minimises the errors of the previous prediction
Scale the tree with the learning rate
GBoost - How each error is being added to the tree
Each error goes through the decision tree, and in the leaf we add the error that minimise the loss function
What is the purpose of the Linear Regression algorithm in machine learning?
Linear Regression is used to model the relationship between a dependent variable and one or more independent variables. It predicts continuous values by fitting a linear equation to the data. The goal is to minimize the sum of squared residuals between predicted and actual values.
What is Logistic Regression used for in machine learning?
Logistic Regression is used for binary classification tasks. It models the probability that a given input belongs to a certain class using a logistic function (sigmoid) to output values between 0 and 1. It estimates the parameters using Maximum Likelihood Estimation.
How do Decision Trees work in machine learning?
Decision Trees partition the data into subsets based on feature values, making decisions at each node to minimize impurity (like Gini index or entropy). They are simple to interpret and can be used for classification and regression tasks.
What is the key idea behind Random Forests in machine learning?
Random Forest is an ensemble method that builds multiple decision trees on random subsets of the data and features. It then aggregates their predictions (by majority vote for classification or averaging for regression) to improve accuracy and reduce overfitting.
How does Gradient Boosting (e.g., XGBoost, LightGBM, CatBoost) improve prediction accuracy?
Gradient Boosting builds an ensemble of trees sequentially, where each new tree corrects the errors of the previous ones by focusing on the residuals.
What is the role of Support Vector Machines (SVM) in classification tasks?
SVM is a supervised learning algorithm used for classification and regression. It aims to find the hyperplane that best separates the data into distinct classes with the maximum margin. SVMs can handle both linear and nonlinear classification using the kernel trick.
How does the K-Nearest Neighbors (KNN) algorithm work?
KNN is a simple, non-parametric algorithm used for classification and regression. It classifies a data point based on the majority class (for classification) or the average value (for regression) of its k nearest neighbors in the feature space.
What is the Naive Bayes classifier based on?
Naive Bayes is a probabilistic classifier based on Bayes’ Theorem, assuming independence between features. It calculates the probability of each class given the features and assigns the class with the highest probability. It’s often used for text classification tasks.
What is K-Means clustering used for?
K-Means is a clustering algorithm that partitions data into k clusters based on the mean of the points in each cluster. It minimizes the within-cluster variance by iteratively assigning points to clusters and recalculating the cluster centroids.
What is DBSCAN (Density-Based Spatial Clustering of Applications with Noise)?
DBSCAN is a density-based clustering algorithm that groups points closely packed together while marking outliers as noise. It requires two parameters: epsilon (the radius of a neighborhood) and min_samples (minimum points to form a dense region).