Machine Learning Flashcards
What is Machine learning?
The fundamental idea of machine learning is to use data from past observations to predict unknown outcomes or values. Machine learning has its origins in statistics and mathematical modeling of data.
What are the processes in Machine Learning?
Fundamentally, a machine learning model is a software application that encapsulates a function to calculate an output value based on one or more input values. The process of defining that function is known as training. After the function has been defined, you can use it to predict new values in a process called inferencing.
What is training data ?
The training data consists of past observations. In most cases, the observations include the observed attributes or features of the thing being observed, and the known value of the thing you want to train a model to predict (known as the label).
In mathematical terms, you’ll often see the features referred to using the shorthand variable name x, and the label referred to as y. Usually, an observation consists of multiple feature values, so x is actually a vector (an array with multiple values), like this: [x1,x2,x3,…].
Types of Machine Learning
Major Types
1. Supervised Machine Learning
a. Regression
b. Classification
i. Binary Classification
ii. Multiclass classification
2. Unsupervised machine learning
a. Clustering
What is supervised machine learning?
Supervised machine learning is a general term for machine learning algorithms in which the training data includes both feature values and known label values.
What is regression?
Regression is a form of supervised machine learning in which the label predicted by the model is a numeric value.
What is classification?
Regression is a form of supervised machine learning in which the label predicted by the model is a numeric value.
Types:
1. In binary classification, the label determines whether the observed item is (or isn’t) an instance of a specific class. Or put another way, binary classification models predict one of two mutually exclusive outcomes.
2. Multiclass classification extends binary classification to predict a label that represents one of multiple possible classes
What is unsupervised machine learning?
Unsupervised machine learning involves training models using data that consists only of feature values without any known labels. Unsupervised machine learning algorithms determine relationships between the features of the observations in the training data.
What is clustering?
A clustering algorithm identifies similarities between observations based on their features, and groups them into discrete clusters.
In some cases, clustering is used to determine the set of classes that exist before training a classification model.
What is regression?
Regression models are trained to predict numeric label values based on training data that includes both features and known labels. The process for training a regression model (or indeed, any supervised machine learning model) involves multiple iterations in which you use an appropriate algorithm (usually with some parameterized settings) to train a model, evaluate the model’s predictive performance, and refine the model by repeating the training process with different algorithms and parameters until you achieve an acceptable level of predictive accuracy.
What is linear regression?
linear regression, which works by deriving a function that produces a straight line through the intersections of the x and y values while minimizing the average distance between the line and the plotted points
What are some Regression Evaluation Metrics?
- Mean Absolute Error (MAE) - This metric is known as the absolute error for each prediction, and can be summarized for the whole validation set as the mean absolute error (MAE).
- Mean Squared Error (MSE) - One way to produce a metric that “amplifies” larger errors by squaring the individual errors and calculating the mean of the squared values. This metric is known as the mean squared error (MSE).
3.Root Mean Squared Error (RMSE) - square root of MSE
4. Coefficient of determination (R2) - The coefficient of determination (more commonly referred to as R2 or R-Squared) is a metric that measures the proportion of variance in the validation results that can be explained by the model, as opposed to some anomalous aspect of the validation data (for example, a day with a highly unusual number of ice creams sales because of a local festival).
The calculation for R2 is more complex than for the previous metrics. It compares the sum of squared differences between predicted and actual labels with the sum of squared differences between the actual label values and the mean of actual label values, like this:
R2 = 1- ∑(y-ŷ)2 ÷ ∑(y-ȳ)2
the result is a value between 0 and 1. closer to 1 this value is, the better the model is fitting the validation data.
What is iterative Training?
In most real-world scenarios, a data scientist will use an iterative process to repeatedly train and evaluate a model, varying:
a. Feature selection and preparation
b. Algorithm selection
c. Algorithm parameters (numeric settings to control algorithm behavior, more accurately called hyperparameters to differentiate them from the x and y parameters).
After multiple iterations, the model that results in the best evaluation metric that’s acceptable for the specific scenario is selected.
What is Binary classification?
Classification, like regression, is a supervised machine learning technique; and therefore follows the same iterative process of training, validating, and evaluating models. Instead of calculating numeric values like a regression model, the algorithms used to train classification models calculate probability values for class assignment and the evaluation metrics used to assess model performance compare the predicted classes to the actual classes.
There are many algorithms that can be used for binary classification, such as logistic regression, which derives a sigmoid (S-shaped) function with values between 0.0 and 1.0
What are Binary Classification evaluation metrics
The first step in calculating evaluation metrics for a binary classification model is usually to create a matrix of the number of correct and incorrect predictions for each possible class label:
This visualization is called a confusion matrix, and it shows the prediction totals where:
ŷ=0 and y=0: True negatives (TN)
ŷ=1 and y=0: False positives (FP)
ŷ=0 and y=1: False negatives (FN)
ŷ=1 and y=1: True positives (TP)
where predicted class labels (ŷ) , actual class labels (y)