Module 10: Machine Learning Pt. 1 Flashcards
Explain what machine learning is and how it differs from stats
In theory, we would like to use machine learning models to predict outcomes so we can make better business decisions. Machine learning models nowadays are ubiquitous:
- Determine a diagnosis based on features and test results
- Develop an application to decide whether to grant a loan by analyzing input values
- Predict product consumption given features such as promotions, advertising, socio-demographics, and concurrent activity
- And a lot more
Machine learning is the business of predicting outcomes based on a set of features. The outcome measurements are usually quantitative (ranking, price etc.), or categorical (yes or no, or risk level, etc). A definition of machine learning is “a computer program that learns from experience (E), with respect to some class of tasks (T) and performance (P). If its performance at tasks T, as measured by P, improves with experience E.
- E: experience of playing many checker games
- T: playing checkers
- P: winning the checkers game
There is a difference between ML and statistics – in stats they create summaries and derive insights, but that’s all. Retrospective analysis. For ML it’s different, it’s about the predictive model. This is called the train-test split – you train your model on existing data, but then you test it on new and unseen data to evaluate its predictive power.
A key process in machine learning is feature engineering. Meaning is column is a feature of the model, and it can have dozens of dimensions. But that these features are useful in capturing what you want to predict and are in a proper format. Datetime stamps for example that are improperly formatted will be “garbage in, garbage out.”
predictor and target: learning model
{x(i),y(i)}
X(i) represents a set of input variables; also called features or predictors. Y(i) are output or target variables.
Usually problems require multiple variable so x(i) is a vector or set of values. Every (x(i) y(i)) pair in a dataset is called a training example. For 1000 square foot it is price $2,000,00 for example is one observation or training example. Ultimately the goal then is
Y = F(x)
Ultimately, the function F(x), will be used to predict the corresponding value y. For any new house with a known size, you can plug that size into the F(x) formula to estimate its price. The goal of machine learning is to find the best F(x) rule so that you can accurately predict house prices. \
Supervised and unsupervised learning
In supervised learning, the output, or target values, are given in the training dataset. When are solving stock price values, we have a dataset N examples of stock prices versus company performance measure, here is an example pair:
(company performance measures, economic data) (stock price)
This is called supervised learning because the learning process operates as through under supervision – a set of given examples of outcomes show what correct output should look like.
In unsupervised learning, the problem is approached with little or no idea of what outcome results should look like, data isn’t labeled, or there is no target variable. This method aims to create groups of data points, structural patterns in data can be found by grouping or clustering the data.
Clustering is based on the relationship among the variables in the data. Data that belongs in the same cluster are close to each other, while data that belong to different clusters are apart.
An example of unsupervised learning is the handwritten digits example. Thousands of scans of handwritten digits were collected and the unsupervised program clustered each digit into 10 groups where each group corresponded to a digit. This is unsupervised because each scan was not labeled and there was no prior knowledge that a certain pattern of pixel brightness would correspond to a certain digit.
Other examples of unsupervised learning include:
- Visualization algorithms that produce 2 or 3 dimensional representation from complex and unlabeled data, and their outputs can be ploted
- Dimensionality reduction algorithms that simplify data without loosing too much information
- Anomaly detection that detect outliers or observations that are very different from the rest of the dataset
Supervised learning: regression and classification
You may have noticed that some target values are numeric (continuous), such as house prices. Thus, we are trying to obtain a continuous function F(x) that maps input variables. This category of supervised learning problems is called regression. Other supervised learning problems involve predicting a discrete result as an output also known as categorical response. This type of learning task is called classification, which is different from clustering. Clustering is unsupervised, in classification the output is known in the training dataset and the goal is to learn the predicting function F(x) to map input variables into discrete categories. In the loan application problem, the output is a grant status yes or not, which can be 1 or 0.
This course will cover the most common supervised learning algorithms:
- Linear regression
- Logistic regression
- K-nearest neighbors
- Support vector machines
- Decision trees and random forests
Linear regression:
An algorithm for finding the linear relationship between predictors and responses. It is applicable when the response is a numeric variable. Linear regression models the relationship between a scalar dependent variable (output) and one or more independent variables (inputs). The objective is to find the line of best fit.
y=w0+∑i=1Lwixi
- Xi are features
- Wi are the coefficients or weights of each feature
- W0 is an intercept, or the value of the response when all xi are equal to 0
Interpreting regression results: r-squared, p-value, standard error of the mean
The method summary() gives a full statistical evaluation of the regression. We will cover these paramteres in the next course, for now, we will focus on the second block of summary output, which contains:
- R-squared: what percentage of variability in Y is explained by the model. Will be between 0-1. If 1, the model perfectly explains Y, while 0 means no linear relationship.
- P-values indicate the statistical significance – so how likely are your results due to sampling error and how replicable would they be in the overall population anything below 0.05 is statistically significant
- Standard error of the mean quantifies how much the sample mean is expected to vary from the population mean if you were to take multiple samples from the population
Evaluating the regression model - mean square error and gradient descent
How did the chosen algorithm perform this fit, and how were parameters like slope and intercept calculated? The performance measure should be selected for a machine learning algorithm. A typical performance measure for a regression algorithm is the Mean Square Error (MSE).
For each observation, the residual, which is the difference between the predicted value and response is calculated. The mean square error, also known as the cost function, is an average of the squared residuals.
Linear regression models are often fitted by minimizing MSE. This is the known as the least squares approach. We used this approach in the example above when the OLS method from the statsmodel library was called. Another useful metric is absolute error. In a similar fashion, the mean absolute error, is defined as an average of absolute values of the residuals.
For the best fit, the smallest possible mean square error is expected, but we should not expect zero for a real dataset (this would mean all observations fall on a straight line). The computed value of RMSE should be compared with the data spread. The graph shows that the calculated values are scattered in wide range between 0 and 700, so that an RMSE value of 133.067 seems to be reasonable. It should be compared to the standard deviation – if it is less then it is reasonably good at guessing better than the original spread of the data, if it is more, then the model is not good at predicting the value.
Gradient descent:
Generally, a function that defines how good the model describes the data is called a cost function. For linear regression, the cost function measures the distance between the model’s predictions and training instances. The mean square error and absolute error are examples of a cost function.
To find the best fit, the cost function must be minimized. In other words, we need to find coefficient values, that will minimize the cost function.
Gradient descent algorithms provide methods that can be used to calculate coefficients for minizming the cost function.
In gradient descent, coefficients are changed iteratively by small step values in order to ultimately arrive at the minimal cost function. At each iteration, the deritivative of the cost function is calculated to find the direction of the next step in order to descent to the minimum. There are two main versions of gradient descent methods:
1. Batch gradient descent: on each iteration, the sum of residuals over all observations is calculated. For large datasets this might be very slow
2. Stochastic gradient descent: on each iteration, only one residual is calculated for a random sample of observatations from the dataset. This makes the algorithm much faster and suitable for very large datasets.
K-nearest neighbors
kNN is al algorithm most widely used for classification – it analyzes the entire dataset and determines the class of a new data point based on similarity measures (e.g., distance functions). Classification is done by a majority vote of its neighbors. A new data point is assigned to the class which has the most nearest neighbors around this new data point. K is the number of neighbors that the algorithm considers in the calculations. Usually, k, is a small positive integer if k=1 then the object is simply assigned to the class of that single nearest neighbor.
In this example we are trying to determine the class of the green dot. Do we need to classify it as a blue square or a ref triangle? If we use k = 3 we will find 3 existing objects close to the green dot as indicated by the solid line circle. There are two red triangles and one blue square, so the green dot will be a red triangle. If we use k = 5 it will go to the dotted line and the green dot will rather become a blue square.
The scikit-learn library has KNeighborsClassifier which you will use in one the assignments.
The KNN algorithm looks at the distance between features. But this means that if features use different scales (i.e., one uses binary and the other 0-1000), it will focus on the larger variation from the bigger numbers to cover the distance. So some features will be more influential than others. For this reason, you want to standardize the features so they all have equal weight by square rooting.
Example selecting x and y training-test split
Select X and y:
- X is our features (predictor variables or input for the ML model), and y is the label of the data (i.e. the training target for the ML model).
- In this case, I’m choosing the binary column “high_count” as y, which contains values of either True or False. So, this is a binary classification problem.
- We can include any combinations of features we want as X. But don’t include features that won’t make sense to the model (like “datetime”) or features that will leak information about our training target (like “count”, “high_count”, “casual” and “registered”).
Note: we use the capital letter X to indicate it’s 2-D, and we use the small letter y to indicate it’s 1-D.
print(df.columns.to_list())
[‘datetime’, ‘season’, ‘holiday’, ‘workingday’, ‘weather’, ‘temp’, ‘atemp’, ‘humidity’, ‘windspeed’, ‘casual’, ‘registered’, ‘count’, ‘high_count’, ‘hour’]
X = df[[‘season’, ‘holiday’, ‘workingday’, ‘weather’, ‘temp’, ‘atemp’, ‘humidity’, ‘windspeed’, ‘hour’]]
y = df[“high_count”]
print(X.shape)
print(y.shape)
(10886, 9)
(10886,)
Train-test split:
I’m going to use 80% of the data for training and 20% for testing.
The ratio is usually between 70:30 to 80:20.
(Note: in production, the features must match the features in training)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
(8708, 9)
(2178, 9)
(8708,)
(2178,)
Random forest for classification
compare y_pred to our actual y_test
Random Forest is an ensemble learning method based on decision trees (you have to take the ML course if you want to know what those terms mean and why it works).
Training:
%%time
from sklearn.ensemble import RandomForestClassifier
rf_clf = RandomForestClassifier(n_estimators=200, n_jobs=-1) #create an instance of the model
rf_clf.fit(X_train, y_train) #train the model on the training set
CPU times: user 2.75 s, sys: 104 ms, total: 2.85 s
Wall time: 1.88 s
RandomForestClassifier
RandomForestClassifier(n_estimators=200, n_jobs=-1)
Testing:
y_pred = rf_clf.predict(X_test) #make predictions for the test set
print(y_pred.shape)
y_pred #these are the predictions made by the model
(2178,)
array([False, False, False, …, True, False, False])
(y_pred == y_test).sum()
1935
Let’s calculate the accuracy:
# two ways to calculate accuracy
# noob way:
print((y_pred == y_test).sum()/len(y_test))
pro way:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))
0.8884297520661157
0.8884297520661157
That means it’s 88% accurate
With accuracy though, it is important to recognize something called data imbalance. For example, if 0.01% of the true population has cancer, and your model just predicts that no one has cancer all the time, then it will be right 99.9% of the time. This is because of the imbalance, it will just predict the most likely outcome. So we have to check for imbalances:
y_test.value_counts()
high_count
False 1341
True 837
Name: count, dtype: int64
y_test.value_counts()[0]/y_test.value_counts().sum()
0.6157024793388429
So even though our data is imbalanced, it is doing better than purely guessing.
Another way you could illustrate the accuracy is through the confusionmatrixdisplay:
#visualize classification results with confusion matrix
from sklearn.metrics import ConfusionMatrixDisplay
with plt.style.context(“default”): #temporarily set the plot style
disp = ConfusionMatrixDisplay.from_estimator(
rf_clf,
X_test,
y_test,
display_labels=[False, True],
cmap=plt.cm.Blues,
normalize=None,
)
disp.ax_.set_title(“Confusion matrix, without normalization”)
This shows you when something was false what your model predicted, and same when it was true, to detect false positives and false negatives.
Feature importance
Feature importance:
Feature importance is available from decision-tree-based models in sklearn. More important features are giving the model more information for making the predictions. Let’s take a look at the most importance features for our model.
Note: feature importance is NOT correlation, so don’t interpret it that way!