Domain 3: Modeling Flashcards
This is an application agnostic standard that can be used as a baseline to understand the various phases of the ML workflow.
The Cross Industry Standard Process for Data Mining (CRISP-DM)
ML lifecycle phases
- Identify Problem
- Collect and QC
- Prepare
- Visualization
- Feature Engineering
- Model Training
- Model Evaluation
- Business Workflow Integration
T/F: Business problem identification requires senior leadership buy-in.
True
What is the goal of ML?
To predict the value or class of an unknown quantity using a mathematical model.
Data that the model can use to “learn” from, which consists of independent variables and a dependent variable.
Training data
What makes a good model?
Should be able to generalize what it has learned to unseen data, namely data where the dependent variable is unknown.
What makes a poor model?
One that has simply memorized the training data will have poor generalization performance and therefore will not be usable in a business process.
When a model is shown labeled examples of ground truth values and learns to predict the label based on the input data or features.
Supervised learning
When you do not have labeled data available and you want the model to discover patterns in the unlabeled data.
Unsupervised learning
When a model or agent learns by interacting with its environment - similar to trial-and-error learning, where an agent is given rewards and penalties for actions taken and its aim is to maximize the long-term rewards.
Reinforcement learning
T/F: The data type (whether it is structured or unstructured) does not dictate whether learning is supervised.
True
A type of supervised learning where the label is binary, such as fraud/not fraud, cat/dog, spam/not spam
Binary classification
A type of supervised learning where the label can have more than two classes
Multiclass classification
A type of supervised learning where the label is a continuous number such as a house price
Regression
A form of supervised machine learning where a model predicts a linear relationship between the data and the labels.
Linear models
Used when you have a continuous label (regression task), where the assumption is made that the label is linearly related to the data.
Linear regression
An idea that the label is a linear combination of the input data or feature vectors.
Linearity
Two assumptions that need to be tested before a linear model can be accurately fit to the data.
Linearity, constant variance, features cannot be strongly correlated w/ one another.
This is where one feature can be linearly derived from the other, in the most trivial example; they are related by a constant.
Multicollinear
What is often used in machine learning as a way to penalize the model from learning weights that do not generalize well to unseen data and reduces the overall model complexity and prevents overfitting?
Regularization
This tends to reduce the values of weights that are unimportant in predicting the labels, where you add an L2 penalty or quadratic penalty to the weights.
Ridge
This tends to shrink the weights to zero, where where you add an L1 penalty or absolute value penalty to the weights. It also eliminates unimportant features.
Lasso
This combines ridge and lasso regulation.
Elastic net
Lasso regression is also known as _____.
Shrinkage
T/F: Often in machine learning, it is not the model but how you engineer features that determines model performance and ultimately business value.
True
The application of linear regression to binary or multiclass classification problems using logit function.
Logistic regression
T/F: Logistic regression can apply to both binary and multiclass classification problems.
True
The _____ is one of the most common loss functions for classification problems in machine learning irrespective of the underlying algorithm.
Cross-entropy loss
Are logistic regression models large?
No, they only store coefficients and can thus be quite small.
Logistic regression often serves as a _____for model performance.
Benchmark
_____ is used to solve classification problems; _____ is used for regression problems.
Logistic regression/linear regression
What is the built-in algorithm SageMaker has that covers both linear and logistic regression use cases?
Linear learner
What data format does Linear Learner use?
Built using the MXNet framework (recognizes RecordIO data format)
Algorithm also recognizes CSV data
_____ can be used for supervised learning for both classification and regression tasks that takes into consideration when the label may also be proportional to interaction terms b/w different independent variables.
Factorization Machines
What do you use when dealing with large sparse data?
Factorization Machines
What method do Factorization Machines work by?
Matrix factorization, also built on top of the MXNet framework and accepts RecordIO format, but not CSV
Which built-in algorithm to use when dealing with recommender systems or item recommendation use cases?
AWS recommends using Factorization Machines for such large sparse matrix use cases
A supervised learning algorithm on structured data that works by first building an index consisting of the distance between any two data points in your dataset; and then, when a new point whose label is unknown is provided, this algorithm calculates the nearest neighbors to that point based on a specified distance metric, and either averages the label values for those k-points in the case of regression or uses the most frequently returned label as the label for classification.
k-Nearest Neighbors
How do you train for k-nearest neighbor?
Build an index
_____corresponds to performing fast lookups against that index.
Inference
- Sample dataset
- Reduce dimensionality
- Assign each vector a cluster
Steps to train a k-nearest model
Logistics regression solves binary classification problems using a loss function known as _____.
Cross-entropy loss
This algorithm is particularly popular in biological fields, and it aims to find the separating hyper-plane that separates two classes by the widest so-called margin. The wider the margin, the better the quality of the algorithm and its ability to generalize.
Support vector machines
How do you generalize to nonlinear situations where the separating boundary may not be linear with support vector machines?
introducing a kernel trick
Lets the tree learn when to spawn off new nodes based on the input data.
decision tree learning
Consists of a root node or parent node and spawns off child or leaf nodes based on certain criteria.
decision tree