Learn the Basics of Machine Learning Flashcards

1
Q

Linear Regression

Gradient descent step

A

The size of the step that gradient descent takes is called the learning rate. Finding an adequate value for the learning rate is key to achieve convergence. If this value is too large the algorithm will never reach the optimus, but if is too small it will take too much time to achieve the desired value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Linear Regression

Gradient Descent in Regression

A

Gradient Descent is an iterative algorithm used to tune the parameters in regression models for minimum loss.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Classification: K-Nearest Neighbors

K-Nearest Neighbors Underfitting and Overfitting

A

The value of k in the KNN algorithm is related to the error rate of the model. A small value of k could lead to overfitting as well as a big value of k can lead to underfitting. Overfitting imply that the model is well on the training data but has poor performance when new data is coming. Underfitting refers to a model that is not good on the training data and also cannot be generalized to predict new data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Classification: K-Nearest Neighbors

KNN Classification Algorithm in Scikit Learn

A

Scikit-learn is a very popular Machine Learning library in Python which provides a KNeighborsClassifier object which performs the KNN classification. The n_neighbors parameter passed to the KNeighborsClassifier object sets the desired k value that checks the k closest neighbors for each unclassified point.

The object provides a .fit() method which takes in training data and a .predict() method which returns the classification of a set of data points.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Classification: K-Nearest Neighbors

Euclidean Distance

A

The Euclidean Distance between two points can be computed, knowing the coordinates of those points.

On a 2-D plane, the distance between two points p and q is the square-root of the sum of the squares of the difference between their x and y components. Remember the Pythagorean Theorem: a^2 + b^2 = c^2 ?

We can write a function to compute this distance. Let’s assume that points are represented by tuples of the form (x_coord, y_coord). Also remember that computing the square-root of some value n can be done in a couple of ways: math.sqrt(n), using the math library, or n ** 0.5 (n raised to the power of 1/2).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Classification: K-Nearest Neighbors

Elbow Curve Validation Technique in K-Nearest Neighbor Algorithm

A

Choosing an optimal k value in KNN determines the number of neighbors we look at when we assign a value to any new observation.

For a very low value of k (suppose k=1), the model overfits on the training data, which leads to a high error rate on the validation set. On the other hand, for a high value of k, the model performs poorly on both train and validation set. When k increases, validation error decreases and then starts increasing in a “U” shape. An optimal value of k can be determined from the elbow curve of the validation error.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Classification: K-Nearest Neighbors

K-Nearest Neighbors

A

The K-Nearest Neighbors algorithm is a supervised machine learning algorithm for labeling an unknown data point given existing labeled data.

The nearness of points is typically determined by using distance algorithms such as the Euclidean distance formula based on parameters of the data. The algorithm will classify a point based on the labels of the K nearest neighbor points, where the value of K can be specified.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Classification: K-Nearest Neighbors

KNN of Unknown Data Point

A

To classify the unknown data point using the KNN (K-Nearest Neighbor) algorithm:
* Normalize the numeric data
* Find the distance between the unknown data point and all training data points
* Sort the distance and find the nearest k data points
* Classify the unknown data point based on the most instances of nearest k points

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Classification: K-Nearest Neighbors

Normalizing Data

A

Normalization is a process of converting the numeric columns in the dataset to a common scale while retaining the underlying differences in the range of values.

For example, Min-max normalization converts each value of the numeric column to a value between 0 and 1 using the formula Normalized value = (NumericValue - MinValue) / (MaxValue - MinValue). A downside of Min-max Normalization is that it does not handle outliers very well.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Logistic Regression

Scikit-Learn Logistic Regression Implementation

A

Scikit-Learn has a Logistic Regression implementation that fits a model to a set of training data and can classify new or test data points into their respective classes. All important parameters can be specified, as the norm used in penalizations and the solver used in optimization.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Logistic Regression

Logistic Regression sigmoid function

A

Logistic Regression models use the sigmoid function to link the log-odds of a data point to the range [0,1], providing a probability for the classification decision. The sigmoid function is widely used in machine learning classification problems because its output can be interpreted as a probability and its derivative is easy to calculate.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Logistic Regression

Classification Threshold definition

A

A Classification Threshold determines the cutoff where the probabilistic output of a machine learning algorithm classifies data samples as belonging to the positive or negative class. A Classification Threshold of 0.5 is well suited to most problems, but particular classification problem could need a fine-tuned threshold in order to improve overall accuracy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Logistic Regression

Logistic Regression interpretability

A

Logistic Regression models have high interpretability compared to most classification algorithms due to optimized feature coefficients. Feature coefficients can be thought as a measure of sensitivity in feature values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Logistic Regression

Log-Odds calculation

A

The product of the feature coefficients and feature values in a Logistic Regression model is the Log-Odds of a data sample belonging to the positive class. Log odds can take any real value and it’s an indirect way to express probabilities.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Logistic Regression

Logistic Regression Classifier

A

Logistic Regression is supervised binary classification algorithm used to predict binary response variables that may indicate the presence or absence of some state. It is possible to extend Logistic Regression to multi-class classification problems by creating several one-vs-all binary classifiers. In a one-vs-all scheme, n - 1 classes are grouped as one and a classifier learns to discriminate the remaining class from the ensembled group.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Logistic Regression

Logistic Regression prediction

A

Logistic Regression models predict the probability of an n-dimensional data point belonging to a specific class by constructing a linear decision boundary. This decision boundary splits the n-dimensional plane in two. In a prediction stage, the point is classified according to which semiplane has the highest probability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Logistic Regression

Logistic Regression cost function

A

The cost function measuring the inaccuracy of a Logistic Regression model across all samples is Log Loss. The lower this value, the greater the overall classification accuracy. Log Loss is also known as Cross Entropy loss.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Decision Trees

Information Gain at decision trees

A

When making decision trees, two different methods are used to find the best feature to split a dataset on: Gini impurity and Information Gain. An intuitive interpretation of Information Gain is that it is a measure of how much information the individual features provide us about the different classes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Decision Trees

Gini impurity

A

When making decision trees, calculating the Gini impurity of a set of data helps determine which feature best splits the data. If a set of data has all of the same labels, the Gini impurity of that set is 0. The set is considered pure. Gini impurity is a statistical measure - the idea behind its definition is to calculate how accurate it would be to assign labels at random, considering the distribution of actual labels in that subset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Decision Trees

Decision trees leaf creation

A

When making a decision tree, a leaf node is created when no features result in any information gain. Scikit-Learn implementation of decision trees allows us to modify the minimum information gain required to split a node. If this threshold is not reached, the node becomes a leaf.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Decision Trees

Optimal decision trees

A

Creating an optimal decision tree is a difficult task. For example, the greedy approach of splitting a tree based on the feature that results in the best current information gain doesn’t guarantee an optimal tree. There are numerous heuristics to create optimal decision trees, and each of these methods proposes a unique way to build the tree.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Decision Trees

Decision Tree Representation

A

In a decision tree, leaves represent class labels, internal nodes represent a single feature, and the edges of the tree represent possible values of those features.

Unlike other classifiers, this visual structure gives us great insight about the algorithm performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Decision Trees

Decision trees pruning

A

Decision trees can be overly complex which can result in overfitting. A technique called pruning can be used to decrease the size of the tree to generalize it to increase accuracy on a test set. Pruning is not an exact method, as it is not clear which should be the ideal size of the tree. This technique can be made bottom-up (starting at the leaves) or up-bottom (starting at the root).

24
Q

Decision Trees

Decision Trees Construction

A

Decision Trees are usually constructed from top to bottom. At each level of the tree, the feature that best splits the training set labels is selected as the “question” of that level. Two different criteria are available to split a node, Gini Index and Information Gain. The convenience of one or the other depends on the problem.

25
Q

Decision Trees

Random Forest definition

A

A Random Forest Classifier is an ensemble machine learning model that uses multiple unique decision trees to classify unlabeled data. If compared to an individual decision tree, Random Forest is a more robust classifier but its interpretability is reduced.

26
Q

Decision Trees

Random Forest overfitting

A

Random Forests are used to avoid overfitting. By aggregating the classification of multiple trees, having overfitted trees in the random forest is less impactful. Reduced overfitting translates to greater generalization capacity, which increases classification accuracy on new unseen data.

27
Q

Decision Trees

Random Forest feature consideration

A

When creating a decision tree in a random forest, a random subset of features are considered as the best feature to split the data on. By splitting the data in a random subset of features, all estimators are trained considering different aspects of the data, which reduces the probability of overfitting.

28
Q

Decision Trees

Random Forest aggregative performance

A

A random forest classifier makes its classification by taking an aggregate of the classifications from all the trees in the random forest. For classification, this aggregate is a majority vote. For regression, this could be the average of the trees in the random forest. This aggregation allows the classifier to capture complex non-linear relations from the data. The model performance is far superior than a linear model.

29
Q

Decision Trees

Bagging at Random Forest

A

Trees in a random forest classifier are created by using a random subset of the original dataset with replacement. This process is known as bagging. Bagging prevents overfitting, given that each individual tree is trained on a subset of original data.

30
Q

Clustering: K-Means

K-Means: Inertia

A

Inertia measures how well a dataset was clustered by K-Means. It is calculated by measuring the distance between each data point and its centroid, squaring this distance, and summing these squares across one cluster.

A good model is one with low inertia AND a low number of clusters (K). However, this is a tradeoff because as K increases, inertia decreases.

To find the optimal K for a dataset, use the Elbow method; find the point where the decrease in inertia begins to slow. K=3 is the “elbow” of this graph.

31
Q

Clustering: K-Means

Unsupervised Learning Basics

A

Patterns and structure can be found in unlabeled data using unsupervised learning, an important branch of machine learning. Clustering is the most popular unsupervised learning algorithm; it groups data points into clusters based on their similarity. Because most datasets in the world are unlabeled, unsupervised learning algorithms are very applicable.

Possible applications of clustering include:

  • Search engines: grouping news topics and search results
  • Market segmentation: grouping customers based on geography, demographics, and behaviors
32
Q

Clustering: K-Means

K-Means Algorithm: Intro

A

K-Means is the most popular clustering algorithm. It uses an iterative technique to group unlabeled data into K clusters based on cluster centers (centroids). The data in each cluster are chosen such that their average distance to their respective centroid is minimized.

  1. Randomly place K centroids for the initial clusters.
  2. Assign each data point to their nearest centroid.
  3. Update centroid locations based on the locations of the data points.

Repeat Steps 2 and 3 until points don’t move between clusters and centroids stabilize.

33
Q

Clustering: K-Means

K-Means Algorithm: 2nd Step

A

After randomly choosing centroid locations for K-Means, each data sample is allocated to its closest centroid to start creating more precise clusters.

The distance between each data sample and every centroid is calculated, the minimum distance is selected, and each data sample is assigned a label that indicates its closest cluster.

The distance formula is implemented as .distance()and used for each data point.

np.argmin() is used to find the minimum distance and find the cluster at that distance.

34
Q

Clustering: K-Means

Scikit-Learn Datasets

A

The scikit-learn library contains built-in datasets in its datasets module that are often used in machine learning problems like classification or regression.
Examples:
* Iris dataset (classification)
* Boston house-prices dataset (regression)
The format of these datasets are important to their use with algorithms. For example, each piece of data in the Iris dataset is a sample (flower type), and each element within a sample is a feature (i.e. petal width).

35
Q

Clustering: K-Means

K-Means Using Scikit-Learn

A

Scikit-Learn, or sklearn, is a machine learning library for Python that has a K-Means algorithm implementation that can be used instead of creating one from scratch.
To use it:
* Import the KMeans() method from the sklearn.cluster library to build a model with n_clusters
* Fit the model to the data samples using .fit()
* Predict the cluster that each data sample belongs to using .predict() and store these as labels

36
Q

Clustering: K-Means

Cross Tabulation Overview

A

Cross-tabulations involve grouping pieces of data together in order to examine their relationship in a different way. Sometimes correlations within data can be seen better when not just looking at total responses.

This technique is often performed in Python after running K-Means; the Pandas method .crosstab() allows for comparison between resulting cluster labels and user-defined labels for each data sample. In order to validate the results of a K-Means model with this technique, there must be user-defined labels for all data samples.

37
Q

Clustering: K-Means

K-Means: Reaching Convergence

A

In K-Means, after placing K random centroids, the data samples are repeatedly assigned to the nearest centroid and then centroid locations are updated. This continues until each of the centroids’ coordinates converge, or stop changing.

This sequence of events can be implemented in Python using a while loop. The loop continues until the difference between each element of the updated centroids and each element of the past centroids_old is 0. This will mean the centroids have converged and the clusters are complete!

38
Q

Clustering: K-Means

K-Means Algorithm: 3rd Step

A

The third step of K-Means updates centroid locations. After the data are assigned to their respectively closest centroid in step 2, each cluster center location is adjusted to be the average of its assigned data points.

The NumPy .mean() function is used to find the average x and y-coordinates of all data points for each cluster and store these as the new centroid locations.

39
Q

Clustering: K-Means

K-Means Algorithm: 1st Step

A

The first step of the K-Means clustering algorithm requires placing K random centroids which will become the centers of the K initial clusters. This step can be implemented in Python using the Numpy random.uniform() function; the x and y-coordinates are randomly chosen within the x and y ranges of the data points.

40
Q

Perceptron

Perceptron Bias Term

A

The bias term is an adjustable, numerical term added to a perceptron’s weighted sum of inputs and weights that can increase classification model accuracy.

The addition of the bias term is helpful because it serves as another model parameter (in addition to weights) that can be tuned to make the model’s performance on training data as good as possible.

The default input value for the bias weight is 1 and the weight value is adjustable.

41
Q

Perceptron

Perceptrons as Linear Classifiers

A

At the end of successful training, a perceptron is able to create a linear classifier between data samples (also called features). It finds this decision boundary by using the linear combination (or weighted sum) of all of the features. The perceptron separates the training data set into two distinct sets of features, bounded by the linear classifier.

42
Q

Perceptron

Adjusting Perceptron Weights

A

The main goal of a perceptron is to make accurate classifications. To train a model to do this, perceptron weights must be optimizing for any specific classification task at hand.

The best weight values can be chosen by training a perceptron on labeled training data that assigns an appropriate label to each data sample (feature). This data is compared to the outputs of the perceptron and weight adjustments are made. Once this is done, a better classification model is created!

43
Q

Perceptron

Perceptron Weighted Sum

A

The first step in the perceptron classification process is calculating the weighted sum of the perceptron’s inputs and weights.

To do this, multiply each input value by its respective weight and then add all of these products together. This sum gives an appropriate representation of the inputs based on their importance.

44
Q

Perceptron

Optimizing Perceptron Weights

A

To increase the accuracy of a perceptron’s classifications, its weights need to be slightly adjusted in the direction of a decreasing training error. This will eventually lead to minimized training error and therefore optimized weight values.

Each weight is appropriately updated with this formula:

45
Q

Perceptron

Introduction to Perceptrons

A

Perceptrons are the building blocks of neural networks. They are artificial models of biological neurons that simulate the task of decision-making. Perceptrons aim to solve binary classification problems given their input.

The basis of the idea of the perceptron is rooted in the words perception (the ability to sense something) and neurons (nerve cells in the human brain that turn sensory input into meaningful information).

46
Q

Perceptron

Perceptron Activation Functions

A

The second step of the perceptron classification process involves an activation function. One of these special functions is applied to the weighted sum of inputs and weights to constrain perceptron output to a value in a certain range, depending on the problem.

Some example ranges are [0,1], [-1,1], [0,100].

The sign activation function is a common activation function that contains the perceptron output to be either 1 or -1:
* If weighted sum > 0, return 1.
* If weighted sum < 0, return -1.

47
Q

Perceptron

Perceptron Training Error

A
48
Q

Perceptron

Perceptron Main Components

A
49
Q

Artificial Intelligence Decision Making: Minimax

Minimax algorithm state value

A

When writing the minimax algorithm, each game involves two players and game states can be evaluated as a value. One of the players is called the maximizer, because he or she wants to maximize the value of the game and the remaining player is called the minimizer.

50
Q

Artificial Intelligence Decision Making: Minimax

Minimax algorithm problem specification

A

Given a game state, the minimax algorithm finds the decision that maximizes the minimum gain. In other words, if you assume your opponent will make decisions that minimize your gain, the algorithm finds the move that will maximize it based on the options your opponent gives you. It is assumed that the game is being played by turns and that the opponent is playing optimally, this is: at each turn a player must make a move, and this move is the best the player can make in that situation.

51
Q

Artificial Intelligence Decision Making: Minimax

Minimax algorithm assumption

A

When running the minimax algorithm, it is assumed that your opponent is playing optimally. This assumption is a worst case scenario, given that if your opponent is not playing optimally the problem is reduced to a simpler one.

52
Q

Artificial Intelligence Decision Making: Minimax

Minimax algorithm game representation

A

When writing the minimax algorithm, a game is modeled as a tree. Different elements of the game (as the current state and all possible moves) are represented as different parts of the tree. This visual representation of the game is a great aid in order to implement the minimax algorithm.

53
Q

Artificial Intelligence Decision Making: Minimax

Minimax algorithm state evaluation

A

When running the minimax algorithm, a game state can be evaluated even if it is not a leaf. This game state evaluation is particularly important in some games such as chess, where we have a long sequence of states before reaching a leaf.

54
Q

Artificial Intelligence Decision Making: Minimax

Minimax algorithm size restriction

A

The size of the game tree is a very important restriction in a minimax algorithm, given that it is not possible to visit all states in a reasonable time. If the maximum depth you can consider is reduced, the optimality of your solution is affected. When the size of the game tree is very large, several heuristics can be applied to find a good solution of the minimax algorithm.

55
Q

Artificial Intelligence Decision Making: Minimax

Minimax algorithm with alpha-beta pruning

A

When implementing alpha-beta pruning in the minimax algorithm, its execution time is drastically decreased. For a given unit of time, a minimax algorithm with alpha-beta pruning can go down twice as far as a minimax algorithm without this pruning technique.

56
Q

Artificial Intelligence Decision Making: Minimax

Alpha-beta pruning variables

A

When using alpha-beta pruning in a minimax algorithm, it is needed to track the value of two different variables (alpha and beta) in order to decide when to prune a part of the tree. At the beginning of the game, alpha is equal to negative infinity and beta is equal to positive infinity. These values are updated as the game progresses.