Domain 3: Modeling Flashcards
This is an application agnostic standard that can be used as a baseline to understand the various phases of the ML workflow.
The Cross Industry Standard Process for Data Mining (CRISP-DM)
ML lifecycle phases
- Identify Problem
- Collect and QC
- Prepare
- Visualization
- Feature Engineering
- Model Training
- Model Evaluation
- Business Workflow Integration
T/F: Business problem identification requires senior leadership buy-in.
True
What is the goal of ML?
To predict the value or class of an unknown quantity using a mathematical model.
Data that the model can use to “learn” from, which consists of independent variables and a dependent variable.
Training data
What makes a good model?
Should be able to generalize what it has learned to unseen data, namely data where the dependent variable is unknown.
What makes a poor model?
One that has simply memorized the training data will have poor generalization performance and therefore will not be usable in a business process.
When a model is shown labeled examples of ground truth values and learns to predict the label based on the input data or features.
Supervised learning
When you do not have labeled data available and you want the model to discover patterns in the unlabeled data.
Unsupervised learning
When a model or agent learns by interacting with its environment - similar to trial-and-error learning, where an agent is given rewards and penalties for actions taken and its aim is to maximize the long-term rewards.
Reinforcement learning
T/F: The data type (whether it is structured or unstructured) does not dictate whether learning is supervised.
True
A type of supervised learning where the label is binary, such as fraud/not fraud, cat/dog, spam/not spam
Binary classification
A type of supervised learning where the label can have more than two classes
Multiclass classification
A type of supervised learning where the label is a continuous number such as a house price
Regression
A form of supervised machine learning where a model predicts a linear relationship between the data and the labels.
Linear models
Used when you have a continuous label (regression task), where the assumption is made that the label is linearly related to the data.
Linear regression
An idea that the label is a linear combination of the input data or feature vectors.
Linearity
Two assumptions that need to be tested before a linear model can be accurately fit to the data.
Linearity, constant variance, features cannot be strongly correlated w/ one another.
This is where one feature can be linearly derived from the other, in the most trivial example; they are related by a constant.
Multicollinear
What is often used in machine learning as a way to penalize the model from learning weights that do not generalize well to unseen data and reduces the overall model complexity and prevents overfitting?
Regularization
This tends to reduce the values of weights that are unimportant in predicting the labels, where you add an L2 penalty or quadratic penalty to the weights.
Ridge
This tends to shrink the weights to zero, where where you add an L1 penalty or absolute value penalty to the weights. It also eliminates unimportant features.
Lasso
This combines ridge and lasso regulation.
Elastic net
Lasso regression is also known as _____.
Shrinkage
T/F: Often in machine learning, it is not the model but how you engineer features that determines model performance and ultimately business value.
True
The application of linear regression to binary or multiclass classification problems using logit function.
Logistic regression
T/F: Logistic regression can apply to both binary and multiclass classification problems.
True
The _____ is one of the most common loss functions for classification problems in machine learning irrespective of the underlying algorithm.
Cross-entropy loss
Are logistic regression models large?
No, they only store coefficients and can thus be quite small.
Logistic regression often serves as a _____for model performance.
Benchmark
_____ is used to solve classification problems; _____ is used for regression problems.
Logistic regression/linear regression
What is the built-in algorithm SageMaker has that covers both linear and logistic regression use cases?
Linear learner
What data format does Linear Learner use?
Built using the MXNet framework (recognizes RecordIO data format)
Algorithm also recognizes CSV data
_____ can be used for supervised learning for both classification and regression tasks that takes into consideration when the label may also be proportional to interaction terms b/w different independent variables.
Factorization Machines
What do you use when dealing with large sparse data?
Factorization Machines
What method do Factorization Machines work by?
Matrix factorization, also built on top of the MXNet framework and accepts RecordIO format, but not CSV
Which built-in algorithm to use when dealing with recommender systems or item recommendation use cases?
AWS recommends using Factorization Machines for such large sparse matrix use cases
A supervised learning algorithm on structured data that works by first building an index consisting of the distance between any two data points in your dataset; and then, when a new point whose label is unknown is provided, this algorithm calculates the nearest neighbors to that point based on a specified distance metric, and either averages the label values for those k-points in the case of regression or uses the most frequently returned label as the label for classification.
k-Nearest Neighbors
How do you train for k-nearest neighbor?
Build an index
_____corresponds to performing fast lookups against that index.
Inference
- Sample dataset
- Reduce dimensionality
- Assign each vector a cluster
Steps to train a k-nearest model
Logistics regression solves binary classification problems using a loss function known as _____.
Cross-entropy loss
This algorithm is particularly popular in biological fields, and it aims to find the separating hyper-plane that separates two classes by the widest so-called margin. The wider the margin, the better the quality of the algorithm and its ability to generalize.
Support vector machines
How do you generalize to nonlinear situations where the separating boundary may not be linear with support vector machines?
introducing a kernel trick
Lets the tree learn when to spawn off new nodes based on the input data.
decision tree learning
Consists of a root node or parent node and spawns off child or leaf nodes based on certain criteria.
decision tree
Uses a metric to decide when it is appropriate to split a parent node into child nodes. Then this rule is recursively applied to the child nodes. Splitting stops when no further gains can be made or some other condition is met.
Classification and Regression Trees (CART)
How does the CART algorithm decide when to split a parent node?
Gini impurity or the entropy metric
A measure of the probability of incorrectly classifying a data point with a particular label.
Gini impurity
_____ operates by using a greedy algorithm to select which input variables to split on and for that input variable, all different split points are evaluated for the Gini impurity.
CART
_____ takes the same ideas behind decision trees, namely the CART algorithm, but instead of bagging, uses a technique called boosting.
XGBoost
_____ refers to sequential learning where each subsequent tree aims to correctly classify the errors that were misclassified by its predecessor, which can also prevent overfitting, as each individual tree can be a so-called weak learner or a shallow tree, but collectively, they can become a strong learner.
Boosting
Popular boosting algorithms
AdaBoost and Logit Boosting, which are examples of gradient boosting
_____ refers to the ability to treat the error terms as continuous variables and to use Taylor’s expansion to expand them in terms of their gradients or derivatives.
Gradient boosting
A key benefit of XGBoost is its ability to _____, which can occur for common machine learning problems such as fraud detection.
scale to very large datasets
T/F: SageMaker offers a built-in XGBoost algorithm
True
The _____ is a popular and efficient open-source implementation of the gradient boosted trees algorithm
XGBoost (eXtreme Gradient Boosting)
Use XGBoost as a _____to run your customized training scripts that can incorporate additional data processing into your training jobs.
framework
Use the XGBoost built-in algorithm to _____.
build an XGBoost training container
T/F: Gradient boosting operates on tabular data, with the rows representing observations, one column representing the target variable or label, and the remaining columns representing features.
True
_____ can be used when your data does not have labels but when you are looking to cluster your data points into “similar” groups.
k-means clustering
What are these steps:
1. identifying a random set of k points as the cluster centers
2. for each of the k centers, find a subset of points from the data that are closest to this center using a distance metric such as Euclidean distance
3. define the new centroid as the mean vector of all these points
4. repeatedly perform these steps until the algorithm converges, that is, the cluster centers do not move past a certain threshold.
how k-means training works
_____ is used when you have a continuous label (regression task), where the assumption is made that the label is linearly related to the data.
Linear regression
_____ models are powerful because they are easy to interpret, but the model makes multiple assumptions that need to be tested before a linear model can be accurately fit to the data.
Linear regression
- Linearity - the label is a linear combination of the input data or feature vectors.
- Constant variance - the statistical variance in the label is identical, regardless of the value of the input data.
- Features cannot be strongly correlated with one another
Linear regression assumptions
_____ are designed to reduce decision tree overfitting by creating a collection of decision trees.
Random forests
A _____ works by building many trees, but each tree is trained on only a subset of the input features using a method known as bootstrap aggregation or bagging - which essentially refers to sampling but with replacement.
random forest
Increase the minimum samples per leaf but decrease the maximum depth of the trees.
How to avoid overfitting
Pro: training multiple trees in parallel.
Con: different trees do not work together to reduce the overall errors.
Random forests
_____ are deep learning algorithms consisting of alternating convolutional layers, which apply various filters on the input data to capture different information at different scales, followed by pooling layers, which reduce the number of parameters in the network and also the spatial size of the representation.
CNNs
_____ have the ability to retain a user’s session history information as part of the model training.
Recurrent neural networks (RNNs)
_____ refers to taking a model that was pretrained on one dataset, freezing the initial layers, and letting it relearn the last few layers of the model on a different dataset.
Transfer learning
T/F: It is hard for an ML model to understand contextual information
True
_____ is a service that you can use to label your image, text, audio, or even tabular data; and it lets you outsource the labeling task to a public workforce (via Amazon Mechanical Turk) or a private workforce (either a third-party labeling company or your own private workforce within your organization) to label data.
Amazon SageMaker Ground Truth
How do you determine if the model is overfitting/underfitting your data?
Comparing the performance of your model against the training/validation datasets
An _____is a function or an algorithm that adjusts the attributes of the neural network, such as weights and learning rates. Thus, it helps in reducing the overall loss and improving accuracy.
optimizer
_____ is an optimization algorithm for finding a local minimum of a differentiable function, and in machine learning, it is simply used to find the values of a function’s parameters (coefficients) that minimize a cost function as far as possible.
Gradient Descent
In machine learning (ML), a _____ is used to measure model performance by calculating the deviation of a model’s predictions from the correct, “ground truth” predictions.
loss function
_____ are places where the function attains its smallest value in a neighborhood of a point.
Local minima
_____ in machine learning refers to the point where the model’s predictions stop improving, or the error rate becomes constant
Convergence
The _____ is a hyperparameter that defines the number of samples to work through before updating the internal model parameters.
batch size
_____ is a measure of the likelihood of an event to occur.
Probability
Which is better for handling massively distributed computational process required for ML? GPU or CPU
GPU
Which system is better for ML? Distributed or Non-Distributed?
Distributed b/c handles large volumes of data, also for fault tolerance (if one goes down the other is still up)
Which is best for ML, Spark or Non-Spark?
Spark, b/c it allows us to analyze and understand complex data sets that were previously considered too difficult to work with.
What is a trigger for model retraining?
Drift
_____ is fundamental to ensure that a machine learning model is constantly providing the most up-to-date predictions, while minimizing manual interventions and optimizing for monitoring and reliability. Can happen on a schedule or be triggered by an event.
Retraining
_____ involves lifting and shifting the batch training code defined at development time into an automated workflow.
Model retraining
T/F: You should abstract feature selection, model parameters, and other configurable pipeline parameters as input variables of the retraining pipeline.
True
When you have highly correlated features in your data, to prevent linear regression models from becoming unusable, use this to penalize the model from learning weights that do not generalize well to unseen data.
Regularization
What are three common forms of regularization?
Ridge (add L2 penalty or quadratic penalty to weights)
Lasso (aka, shrinkage: add L1 penalty or absolute value penalty to weights)
Elastic net (combines the two)
_____ is a technique used in machine learning to evaluate the performance of a model on unseen data. It involves dividing the available data into multiple folds or subsets, using one of these folds as a validation set, and training the model on the remaining folds.
Cross validation
The main purpose of cross validation is to prevent _____, which occurs when a model is trained too well on the training data and performs poorly on new, unseen data.
Overfitting
_____ is a procedure to set the weights of a neural network to small random values that define the starting point for the optimization (learning or training) of the neural network model.
Weight initialization
What does every neural network consist of?
Layers of nodes (artificial neurons)
Input layer
1 or more hidden layer
Output layer
_____ allow us to classify and cluster data at a high velocity
Neural networks
The _____ is a tuning parameter in an optimization algorithm that determines the step size at each iteration while moving toward a minimum of a loss function.
learning rate
_____ influences to what extent newly acquired information overrides old information and metaphorically represents the speed at which a machine learning model “learns”.
Learning rate
_____ decides whether a neuron should be activated or not, which means that it will decide whether the neuron’s input to the network is important or not in the process of prediction using simpler mathematical operations.
Activation function
_____ use a decision tree to represent how different input variables can be used to predict a target value, and they’re used for both classification and regression problems.
Tree-based models
This is the building block for many complex machine learning algorithms, including deep neural networks, and it predicts the target variable using a linear function of the input features.
Liner models
What techniques help avoid over and underfitting?
Feature engineering, regularization, ensemble learning, and cross-validation
The _____ represents the probability that the model, if given a randomly chosen positive and negative example, will rank the positive higher than the negative.
area under the ROC curve (AUC)
_____ is the proportion of all classifications that were correct, whether positive or negative. It is mathematically defined as correct classification/total classification
Accuracy
Precision is the proportion of all the model’s positive classifications that are actually positive. It is mathematically defined as correctly classified actual positives/everything classified as positive.
Precision
_____ improves as false positives decrease, while recall improves when false negatives decrease.
Precision
The true positive rate (TPR), or the proportion of all actual positives that were classified correctly as positives, is also known as ____, which is defined as correctly classified actual positives/all actual positives.
Recall
_____ is commonly used in machine learning as it gives a relatively high weight to large errors, which means it should be more useful when large errors are particularly undesirable. It is also valuable because it retains the same units as the input, making it easier to interpret.
RMSE
The percentage of positive predictions when the true value is negative, i.e., FP / (FP + TN).
False Positive Rate (FPR)
The harmonic mean of precision and recall
F1 Score
A _____ is used to measure the performance of a classifier in depth and the accuracy of a classification model.
confusion matrix
_____ is the process of measuring the quality and effectiveness of a machine learning model based on its interaction with real users and data in a live system.
Online evaluation
_____ is the process of measuring the quality and effectiveness of a machine learning model based on historical or simulated data and metrics.
Offline evaluation
T/F: Offline evaluation is usually faster, cheaper, and easier to perform than online evaluation, but it may not capture the true behavior and preferences of the users, the dynamics of the data, or the impact of the model on the system. Online evaluation can provide more realistic and actionable feedback, but it may also be more costly, risky, and complex to conduct.
True
_____ is an optimization technique often used to understand how an altered variable affects audience or user engagement. It’s a common method used in marketing, web design, product development, and user experience design to improve campaigns and goal conversion rates.
A/B testing
What are some metrics used to compare models?
Time to train, quality, and engineering costs