ML learn Flashcards

Question 1

Q

List xgboost benefits

Answer

A

L1l2 reg prevents overfitting on high dimensional space
Missing values handling
Cross-validation
Allow early stopping
Option to look at the learning graph to choose different checkpoint
Multiple cpu optimised
Goes to deep trees -> allows more optimised trees for inference
Allow multiple objective functions
Easy interface (python)

Question 2

Q

GBoost start - explain in words what is the initial prediction

Answer

A

The value that minimises the loss function over the observations

Question 3

Q

GBoost - explain in words how to refine the prediction over the previous prediciton

Answer

A

Add a decision classification/regression tree that predictits each input. In each leaf have a value that minimises the errors in that leaf

Question 4

Q

GBoost - explain in words how to add the new tree prediction that minimises the errors of the previous prediction

Answer

A

Scale the tree with the learning rate

Question 5

Q

GBoost - How each error is being added to the tree

Answer

A

Each error goes through the decision tree, and in the leaf we add the error that minimise the loss function

Question 6

Q

What is the purpose of the Linear Regression algorithm in machine learning?

Answer

A

Linear Regression is used to model the relationship between a dependent variable and one or more independent variables. It predicts continuous values by fitting a linear equation to the data. The goal is to minimize the sum of squared residuals between predicted and actual values.

Question 7

Q

What is Logistic Regression used for in machine learning?

Answer

A

Logistic Regression is used for binary classification tasks. It models the probability that a given input belongs to a certain class using a logistic function (sigmoid) to output values between 0 and 1. It estimates the parameters using Maximum Likelihood Estimation.

Question 8

Q

How do Decision Trees work in machine learning?

Answer

A

Decision Trees partition the data into subsets based on feature values, making decisions at each node to minimize impurity (like Gini index or entropy). They are simple to interpret and can be used for classification and regression tasks.

Question 9

Q

What is the key idea behind Random Forests in machine learning?

Answer

A

Random Forest is an ensemble method that builds multiple decision trees on random subsets of the data and features. It then aggregates their predictions (by majority vote for classification or averaging for regression) to improve accuracy and reduce overfitting.

Question 10

Q

How does Gradient Boosting (e.g., XGBoost, LightGBM, CatBoost) improve prediction accuracy?

Answer

A

Gradient Boosting builds an ensemble of trees sequentially, where each new tree corrects the errors of the previous ones by focusing on the residuals.

Question 11

Q

What is the role of Support Vector Machines (SVM) in classification tasks?

Answer

A

SVM is a supervised learning algorithm used for classification and regression. It aims to find the hyperplane that best separates the data into distinct classes with the maximum margin. SVMs can handle both linear and nonlinear classification using the kernel trick.

Question 12

Q

How does the K-Nearest Neighbors (KNN) algorithm work?

Answer

A

KNN is a simple, non-parametric algorithm used for classification and regression. It classifies a data point based on the majority class (for classification) or the average value (for regression) of its k nearest neighbors in the feature space.

Question 13

Q

What is the Naive Bayes classifier based on?

Answer

A

Naive Bayes is a probabilistic classifier based on Bayes’ Theorem, assuming independence between features. It calculates the probability of each class given the features and assigns the class with the highest probability. It’s often used for text classification tasks.

Question 14

Q

What is K-Means clustering used for?

Answer

A

K-Means is a clustering algorithm that partitions data into k clusters based on the mean of the points in each cluster. It minimizes the within-cluster variance by iteratively assigning points to clusters and recalculating the cluster centroids.

Question 15

Q

What is DBSCAN (Density-Based Spatial Clustering of Applications with Noise)?

Answer

A

DBSCAN is a density-based clustering algorithm that groups points closely packed together while marking outliers as noise. It requires two parameters: epsilon (the radius of a neighborhood) and min_samples (minimum points to form a dense region).

Question 16

Q

How does Hierarchical Clustering work?

Answer

A

Hierarchical Clustering creates a tree-like structure (dendrogram) by successively merging or splitting clusters based on their similarity. It can be agglomerative (bottom-up) or divisive (top-down) and doesn’t require the number of clusters to be predefined.

Question 17

Q

What is the goal of Dimensionality Reduction?

Answer

A

Dimensionality Reduction techniques aim to reduce the number of features in a dataset while preserving its important information. This can help improve model performance, reduce overfitting, and speed up training.

Question 18

Q

What is PCA (Principal Component Analysis)?

Answer

A

PCA is a linear dimensionality reduction technique that transforms data into a new coordinate system where the greatest variances in the data come first. It helps reduce dimensionality while retaining the most significant features.

Question 19

Q

What is t-SNE (t-Distributed Stochastic Neighbor Embedding) used for?

Answer

A

t-SNE is a non-linear dimensionality reduction technique that is particularly effective for visualizing high-dimensional data in 2 or 3 dimensions. It preserves the pairwise distances between points in high-dimensional space, making it useful for cluster analysis.

Question 20

Q

What is UMAP (Uniform Manifold Approximation and Projection)?

Answer

A

UMAP is a non-linear dimensionality reduction technique that preserves both local and global structures in the data. It is often faster and more scalable than t-SNE while providing similar quality for visualizations.

Question 21

Q

Give the examples of how XGBoost, LightGBM, and CatBoost optimise Gradient boosting

Answer

A

Algorithms like optimize this process using techniques like regularization, advanced splitting, and efficient training.