Machine Learning Flashcards

1
Q

What is Machine learning?

A

The fundamental idea of machine learning is to use data from past observations to predict unknown outcomes or values. Machine learning has its origins in statistics and mathematical modeling of data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the processes in Machine Learning?

A

Fundamentally, a machine learning model is a software application that encapsulates a function to calculate an output value based on one or more input values. The process of defining that function is known as training. After the function has been defined, you can use it to predict new values in a process called inferencing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is training data ?

A

The training data consists of past observations. In most cases, the observations include the observed attributes or features of the thing being observed, and the known value of the thing you want to train a model to predict (known as the label).

In mathematical terms, you’ll often see the features referred to using the shorthand variable name x, and the label referred to as y. Usually, an observation consists of multiple feature values, so x is actually a vector (an array with multiple values), like this: [x1,x2,x3,…].

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Types of Machine Learning

A

Major Types
1. Supervised Machine Learning
a. Regression
b. Classification
i. Binary Classification
ii. Multiclass classification
2. Unsupervised machine learning
a. Clustering

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is supervised machine learning?

A

Supervised machine learning is a general term for machine learning algorithms in which the training data includes both feature values and known label values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is regression?

A

Regression is a form of supervised machine learning in which the label predicted by the model is a numeric value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is classification?

A

Regression is a form of supervised machine learning in which the label predicted by the model is a numeric value.
Types:
1. In binary classification, the label determines whether the observed item is (or isn’t) an instance of a specific class. Or put another way, binary classification models predict one of two mutually exclusive outcomes.
2. Multiclass classification extends binary classification to predict a label that represents one of multiple possible classes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is unsupervised machine learning?

A

Unsupervised machine learning involves training models using data that consists only of feature values without any known labels. Unsupervised machine learning algorithms determine relationships between the features of the observations in the training data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is clustering?

A

A clustering algorithm identifies similarities between observations based on their features, and groups them into discrete clusters.

In some cases, clustering is used to determine the set of classes that exist before training a classification model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is regression?

A

Regression models are trained to predict numeric label values based on training data that includes both features and known labels. The process for training a regression model (or indeed, any supervised machine learning model) involves multiple iterations in which you use an appropriate algorithm (usually with some parameterized settings) to train a model, evaluate the model’s predictive performance, and refine the model by repeating the training process with different algorithms and parameters until you achieve an acceptable level of predictive accuracy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is linear regression?

A

linear regression, which works by deriving a function that produces a straight line through the intersections of the x and y values while minimizing the average distance between the line and the plotted points

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are some Regression Evaluation Metrics?

A
  1. Mean Absolute Error (MAE) - This metric is known as the absolute error for each prediction, and can be summarized for the whole validation set as the mean absolute error (MAE).
  2. Mean Squared Error (MSE) - One way to produce a metric that “amplifies” larger errors by squaring the individual errors and calculating the mean of the squared values. This metric is known as the mean squared error (MSE).

3.Root Mean Squared Error (RMSE) - square root of MSE
4. Coefficient of determination (R2) - The coefficient of determination (more commonly referred to as R2 or R-Squared) is a metric that measures the proportion of variance in the validation results that can be explained by the model, as opposed to some anomalous aspect of the validation data (for example, a day with a highly unusual number of ice creams sales because of a local festival).

The calculation for R2 is more complex than for the previous metrics. It compares the sum of squared differences between predicted and actual labels with the sum of squared differences between the actual label values and the mean of actual label values, like this:

R2 = 1- ∑(y-ŷ)2 ÷ ∑(y-ȳ)2

the result is a value between 0 and 1. closer to 1 this value is, the better the model is fitting the validation data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is iterative Training?

A

In most real-world scenarios, a data scientist will use an iterative process to repeatedly train and evaluate a model, varying:

a. Feature selection and preparation
b. Algorithm selection
c. Algorithm parameters (numeric settings to control algorithm behavior, more accurately called hyperparameters to differentiate them from the x and y parameters).
After multiple iterations, the model that results in the best evaluation metric that’s acceptable for the specific scenario is selected.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is Binary classification?

A

Classification, like regression, is a supervised machine learning technique; and therefore follows the same iterative process of training, validating, and evaluating models. Instead of calculating numeric values like a regression model, the algorithms used to train classification models calculate probability values for class assignment and the evaluation metrics used to assess model performance compare the predicted classes to the actual classes.

There are many algorithms that can be used for binary classification, such as logistic regression, which derives a sigmoid (S-shaped) function with values between 0.0 and 1.0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are Binary Classification evaluation metrics

A

The first step in calculating evaluation metrics for a binary classification model is usually to create a matrix of the number of correct and incorrect predictions for each possible class label:
This visualization is called a confusion matrix, and it shows the prediction totals where:

ŷ=0 and y=0: True negatives (TN)
ŷ=1 and y=0: False positives (FP)
ŷ=0 and y=1: False negatives (FN)
ŷ=1 and y=1: True positives (TP)

where predicted class labels (ŷ) , actual class labels (y)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is accuracy (in binary classification)

A

The simplest metric you can calculate from the confusion matrix is accuracy - the proportion of predictions that the model got right. Accuracy is calculated as:

(TN+TP) ÷ (TN+FN+FP+TP)

When chance of success is high, accuracy cannot be relied upon.

16
Q

What is recall (in binary classification)

A

Recall is a metric that measures the proportion of positive cases that the model identified correctly. In other words, compared to the number of patients who have diabetes, how many did the model predict to have diabetes?

The formula for recall is:

TP ÷ (TP+FN)

Another name for recall is the true positive rate (TPR), and there’s an equivalent metric called the false positive rate (FPR) that is calculated as FP÷(FP+TN).

17
Q

What is precision (in binary classification)

A

Precision is a similar metric to recall, but measures the proportion of predicted positive cases where the true label is actually positive. In other words, what proportion of the patients predicted by the model to have diabetes actually have diabetes?

The formula for precision is:

TP ÷ (TP+FP)

18
Q

What is F1 Score (in binary classification)

A

F1-score is an overall metric that combined recall and precision. The formula for F1-score is:

(2 x Precision x Recall) ÷ (Precision + Recall)

19
Q

what is Area Under the Curve (AUC) ( in binary classification)

A

TPR and FPR are often used to evaluate a model by plotting a received operator characteristic (ROC) curve that compares the TPR and FPR for every possible threshold value between 0.0 and 1.0:
The ROC curve for a perfect model would go straight up the TPR axis on the left and then across the FPR axis at the top.

20
Q

What is multiclass classfication?

A

Multiclass classification is used to predict to which of multiple possible classes an observation belongs. As a supervised machine learning technique, it follows the same iterative train, validate, and evaluate process as regression and binary classification in which a subset of the training data is held back to validate the trained model.

21
Q

What are some of the multiclass classification model algorthims?

A

To train a multiclass classification model, we need to use an algorithm to fit the training data to a function that calculates a probability value for each possible class. There are two kinds of algorithm you can use to do this:

One-vs-Rest (OvR) algorithms
Multinomial algorithms

22
Q

What is One-vs-Rest (OvR) algorithms?

A

One-vs-Rest algorithms train a binary classification function for each class, each calculating the probability that the observation is an example of the target class. Each function calculates the probability of the observation being a specific class compared to any other class.
f0(x) = P(y=0 | x)
f1(x) = P(y=1 | x)
f2(x) = P(y=2 | x)
Each algorithm produces a sigmoid function that calculates a probability value between 0.0 and 1.0. A model trained using this kind of algorithm predicts the class for the function that produces the highest probability output.

23
Q

What are multinomial algorithms?

A

Multinomial Algorithms creates a single function that returns a multi-valued output. The output is a vector (an array of values) that contains the probability distribution for all possible classes - with a probability score for each class which when totaled add up to 1.0:

f(x) =[P(y=0|x), P(y=1|x), P(y=2|x)]

An example of this kind of function is a softmax function, which could produce an output like the following example:

[0.2, 0.3, 0.5]

The elements in the vector represent the probabilities for classes 0, 1, and 2 respectively; so in this case, the class with the highest probability is 2.

24
Q

Evaluating a multiclass classification model

A

We can evaluate a multiclass classifier by calculating binary classification metrics for each individual class. Alternatively, you can calculate aggregate metrics that take all classes into account.

25
Q

What is clustering?

A

Clustering is a form of unsupervised machine learning in which observations are grouped into clusters based on similarities in their data values, or features. This kind of machine learning is considered unsupervised because it doesn’t make use of previously known label values to train a model. In a clustering model, the label is the cluster to which the observation is assigned, based only on its features.

26
Q

Clustering Algorithms?

A

K-Means clustering, which consists of the following steps:

The feature (x) values are vectorized to define n-dimensional coordinates (where n is the number of features). In the flower example, we have two features: number of leaves (x1) and number of petals (x2). So, the feature vector has two coordinates that we can use to conceptually plot the data points in two-dimensional space ([x1,x2])
You decide how many clusters you want to use to group the flowers - call this value k. For example, to create three clusters, you would use a k value of 3. Then k points are plotted at random coordinates. These points become the center points for each cluster, so they’re called centroids.
Each data point (in this case a flower) is assigned to its nearest centroid.
Each centroid is moved to the center of the data points assigned to it based on the mean distance between the points.
After the centroid is moved, the data points may now be closer to a different centroid, so the data points are reassigned to clusters based on the new closest centroid.
The centroid movement and cluster reallocation steps are repeated until the clusters become stable or a predetermined maximum number of iterations is reached.

27
Q

Evaluating a clustering model?

A

Since there’s no known label with which to compare the predicted cluster assignments, evaluation of a clustering model is based on how well the resulting clusters are separated from one another.

There are multiple metrics that you can use to evaluate cluster separation, including:

Average distance to cluster center: How close, on average, each point in the cluster is to the centroid of the cluster.
Average distance to other center: How close, on average, each point in the cluster is to the centroid of all other clusters.
Maximum distance to cluster center: The furthest distance between a point in the cluster and its centroid.
Silhouette: A value between -1 and 1 that summarizes the ratio of distance between points in the same cluster and points in different clusters (The closer to 1, the better the cluster separation).

28
Q

What is deep learning?

A

Deep learning is an advanced form of machine learning that tries to emulate the way the human brain learns. The key to deep learning is the creation of an artificial neural network that simulates electrochemical activity in biological neurons by using mathematical functions,

29
Q

How does deep learning work?

A

Just like other machine learning techniques discussed in this module, deep learning involves fitting training data to a function that can predict a label (y) based on the value of one or more features (x). The function (f(x)) is the outer layer of a nested function in which each layer of the neural network encapsulates functions that operate on x and the weight (w) values associated with them. The algorithm used to train the model involves iteratively feeding the feature values (x) in the training data forward through the layers to calculate output values for ŷ, validating the model to evaluate how far off the calculated ŷ values are from the known y values (which quantifies the level of error, or loss, in the model), and then modifying the weights (w) to reduce the loss. The trained model includes the final weight values that result in the most accurate predictions.

30
Q

How does a neural network learn?

A

The weights in a neural network are central to how it calculates predicted values for labels. During the training process, the model learns the weights that will result in the most accurate predictions.

The training and validation datasets are defined, and the training features are fed into the input layer.
The neurons in each layer of the network apply their weights (which are initially assigned randomly) and feed the data through the network.
The output layer produces a vector containing the calculated values for ŷ. For example, an output for a penguin class prediction might be [0.3. 0.1. 0.6].
A loss function is used to compare the predicted ŷ values to the known y values and aggregate the difference (which is known as the loss). For example, if the known class for the case that returned the output in the previous step is Chinstrap, then the y value should be [0.0, 0.0, 1.0]. The absolute difference between this and the ŷ vector is [0.3, 0.1, 0.4]. In reality, the loss function calculates the aggregate variance for multiple cases and summarizes it as a single loss value.
Since the entire network is essentially one large nested function, an optimization function can use differential calculus to evaluate the influence of each weight in the network on the loss, and determine how they could be adjusted (up or down) to reduce the amount of overall loss. The specific optimization technique can vary, but usually involves a gradient descent approach in which each weight is increased or decreased to minimize the loss.
The changes to the weights are backpropagated to the layers in the network, replacing the previously used values.
The process is repeated over multiple iterations (known as epochs) until the loss is minimized and the model predicts acceptably accurately.

31
Q
A