Domain 3: Modeling Flashcards

Question 1

Q

This is an application agnostic standard that can be used as a baseline to understand the various phases of the ML workflow.

Answer

A

The Cross Industry Standard Process for Data Mining (CRISP-DM)

Question 2

Q

ML lifecycle phases

Answer

A

Identify Problem
Collect and QC
Prepare
Visualization
Feature Engineering
Model Training
Model Evaluation
Business Workflow Integration

Question 3

Q

T/F: Business problem identification requires senior leadership buy-in.

Question 4

Q

What is the goal of ML?

Answer

A

To predict the value or class of an unknown quantity using a mathematical model.

Question 5

Q

Data that the model can use to “learn” from, which consists of independent variables and a dependent variable.

Answer

A

Training data

Question 6

Q

What makes a good model?

Answer

A

Should be able to generalize what it has learned to unseen data, namely data where the dependent variable is unknown.

Question 7

Q

What makes a poor model?

Answer

A

One that has simply memorized the training data will have poor generalization performance and therefore will not be usable in a business process.

Question 8

Q

When a model is shown labeled examples of ground truth values and learns to predict the label based on the input data or features.

Answer

A

Supervised learning

Question 9

Q

When you do not have labeled data available and you want the model to discover patterns in the unlabeled data.

Answer

A

Unsupervised learning

Question 10

Q

When a model or agent learns by interacting with its environment - similar to trial-and-error learning, where an agent is given rewards and penalties for actions taken and its aim is to maximize the long-term rewards.

Answer

A

Reinforcement learning

Question 11

Q

T/F: The data type (whether it is structured or unstructured) does not dictate whether learning is supervised.

Question 12

Q

A type of supervised learning where the label is binary, such as fraud/not fraud, cat/dog, spam/not spam

Answer

A

Binary classification

Question 13

Q

A type of supervised learning where the label can have more than two classes

Answer

A

Multiclass classification

Question 14

Q

A type of supervised learning where the label is a continuous number such as a house price

Answer

A

Regression

Question 15

Q

A form of supervised machine learning where a model predicts a linear relationship between the data and the labels.

Answer

A

Linear models

Question 16

Q

Used when you have a continuous label (regression task), where the assumption is made that the label is linearly related to the data.

Answer

A

Linear regression

Question 17

Q

An idea that the label is a linear combination of the input data or feature vectors.

Answer

A

Linearity

Question 18

Q

Two assumptions that need to be tested before a linear model can be accurately fit to the data.

Answer

A

Linearity, constant variance, features cannot be strongly correlated w/ one another.

Question 19

Q

This is where one feature can be linearly derived from the other, in the most trivial example; they are related by a constant.

Answer

A

Multicollinear

Question 20

Q

What is often used in machine learning as a way to penalize the model from learning weights that do not generalize well to unseen data and reduces the overall model complexity and prevents overfitting?

Answer

A

Regularization

Question 21

Q

This tends to reduce the values of weights that are unimportant in predicting the labels, where you add an L2 penalty or quadratic penalty to the weights.

Question 22

Q

This tends to shrink the weights to zero, where where you add an L1 penalty or absolute value penalty to the weights. It also eliminates unimportant features.

Question 23

Q

This combines ridge and lasso regulation.

Answer

A

Elastic net

Question 24

Q

Lasso regression is also known as _____.

Answer

A

Shrinkage

Question 25

Q

T/F: Often in machine learning, it is not the model but how you engineer features that determines model performance and ultimately business value.

Question 26

Q

The application of linear regression to binary or multiclass classification problems using logit function.

Answer

A

Logistic regression

Question 27

Q

T/F: Logistic regression can apply to both binary and multiclass classification problems.

Question 28

Q

The _____ is one of the most common loss functions for classification problems in machine learning irrespective of the underlying algorithm.

Answer

A

Cross-entropy loss

Question 29

Q

Are logistic regression models large?

Answer

A

No, they only store coefficients and can thus be quite small.

Question 30

Q

Logistic regression often serves as a _____for model performance.

Answer

A

Benchmark

Question 31

Q

_____ is used to solve classification problems; _____ is used for regression problems.

Answer

A

Logistic regression/linear regression

Question 32

Q

What is the built-in algorithm SageMaker has that covers both linear and logistic regression use cases?

Answer

A

Linear learner

Question 33

Q

What data format does Linear Learner use?

Answer

A

Built using the MXNet framework (recognizes RecordIO data format)
Algorithm also recognizes CSV data

Question 34

Q

_____ can be used for supervised learning for both classification and regression tasks that takes into consideration when the label may also be proportional to interaction terms b/w different independent variables.

Answer

A

Factorization Machines

Question 35

Q

What do you use when dealing with large sparse data?

Answer

A

Factorization Machines

Question 36

Q

What method do Factorization Machines work by?

Answer

A

Matrix factorization, also built on top of the MXNet framework and accepts RecordIO format, but not CSV

Question 37

Q

Which built-in algorithm to use when dealing with recommender systems or item recommendation use cases?

Answer

A

AWS recommends using Factorization Machines for such large sparse matrix use cases

Question 38

Q

A supervised learning algorithm on structured data that works by first building an index consisting of the distance between any two data points in your dataset; and then, when a new point whose label is unknown is provided, this algorithm calculates the nearest neighbors to that point based on a specified distance metric, and either averages the label values for those k-points in the case of regression or uses the most frequently returned label as the label for classification.

Answer

A

k-Nearest Neighbors

Question 39

Q

How do you train for k-nearest neighbor?

Answer

A

Build an index

Question 40

Q

_____corresponds to performing fast lookups against that index.

Answer

A

Inference

Question 41

Q

Sample dataset
Reduce dimensionality
Assign each vector a cluster

Answer

A

Steps to train a k-nearest model

Question 42

Q

Logistics regression solves binary classification problems using a loss function known as _____.

Answer

A

Cross-entropy loss

Question 43

Q

This algorithm is particularly popular in biological fields, and it aims to find the separating hyper-plane that separates two classes by the widest so-called margin. The wider the margin, the better the quality of the algorithm and its ability to generalize.

Answer

A

Support vector machines

Question 44

Q

How do you generalize to nonlinear situations where the separating boundary may not be linear with support vector machines?

Answer

A

introducing a kernel trick

Question 45

Q

Lets the tree learn when to spawn off new nodes based on the input data.

Answer

A

decision tree learning

Question 46

Q

Consists of a root node or parent node and spawns off child or leaf nodes based on certain criteria.

Answer

A

decision tree

Question 47

Q

Uses a metric to decide when it is appropriate to split a parent node into child nodes. Then this rule is recursively applied to the child nodes. Splitting stops when no further gains can be made or some other condition is met.

Answer

A

Classification and Regression Trees (CART)

Question 48

Q

How does the CART algorithm decide when to split a parent node?

Answer

A

Gini impurity or the entropy metric

Question 49

Q

A measure of the probability of incorrectly classifying a data point with a particular label.

Answer

A

Gini impurity

Question 50

Q

_____ operates by using a greedy algorithm to select which input variables to split on and for that input variable, all different split points are evaluated for the Gini impurity.

Question 51

Q

_____ takes the same ideas behind decision trees, namely the CART algorithm, but instead of bagging, uses a technique called boosting.

Question 52

Q

_____ refers to sequential learning where each subsequent tree aims to correctly classify the errors that were misclassified by its predecessor, which can also prevent overfitting, as each individual tree can be a so-called weak learner or a shallow tree, but collectively, they can become a strong learner.

Question 53

Q

Popular boosting algorithms

Answer

A

AdaBoost and Logit Boosting, which are examples of gradient boosting

Question 54

Q

_____ refers to the ability to treat the error terms as continuous variables and to use Taylor’s expansion to expand them in terms of their gradients or derivatives.

Answer

A

Gradient boosting

Question 55

Q

A key benefit of XGBoost is its ability to _____, which can occur for common machine learning problems such as fraud detection.

Answer

A

scale to very large datasets

Question 56

Q

T/F: SageMaker offers a built-in XGBoost algorithm

Question 57

Q

The _____ is a popular and efficient open-source implementation of the gradient boosted trees algorithm

Answer

A

XGBoost (eXtreme Gradient Boosting)

Question 58

Q

Use XGBoost as a _____to run your customized training scripts that can incorporate additional data processing into your training jobs.

Answer

A

framework

Question 59

Q

Use the XGBoost built-in algorithm to _____.

Answer

A

build an XGBoost training container

Question 60

Q

T/F: Gradient boosting operates on tabular data, with the rows representing observations, one column representing the target variable or label, and the remaining columns representing features.

Question 61

Q

_____ can be used when your data does not have labels but when you are looking to cluster your data points into “similar” groups.

Answer

A

k-means clustering

Question 62

Q

What are these steps:
1. identifying a random set of k points as the cluster centers
2. for each of the k centers, find a subset of points from the data that are closest to this center using a distance metric such as Euclidean distance
3. define the new centroid as the mean vector of all these points
4. repeatedly perform these steps until the algorithm converges, that is, the cluster centers do not move past a certain threshold.

Answer

A

how k-means training works

Question 63

Q

_____ is used when you have a continuous label (regression task), where the assumption is made that the label is linearly related to the data.

Answer

A

Linear regression

Question 64

Q

_____ models are powerful because they are easy to interpret, but the model makes multiple assumptions that need to be tested before a linear model can be accurately fit to the data.

Answer

A

Linear regression

Answer 54

A

Linear regression assumptions

Answer 55

A

Random forests

Answer 56

A

random forest

Answer 57

A

How to avoid overfitting

Answer 58

A

Random forests

Answer 59

A

Recurrent neural networks (RNNs)

Answer 60

A

Transfer learning

Answer 61

A

Amazon SageMaker Ground Truth

Answer 62

A

Comparing the performance of your model against the training/validation datasets

Answer 63

A

optimizer

Answer 64

A

Gradient Descent

Answer 65

A

loss function

Answer 66

A

Local minima

Answer 67

A

Convergence

Answer 68

A

batch size

Answer 69

A

Probability

Answer 70

A

Distributed b/c handles large volumes of data, also for fault tolerance (if one goes down the other is still up)

Answer 71

A

Spark, b/c it allows us to analyze and understand complex data sets that were previously considered too difficult to work with.

Answer 72

A

Retraining

Answer 73

A

Model retraining

Answer 74

A

Regularization

Answer 75

A

Ridge (add L2 penalty or quadratic penalty to weights)
Lasso (aka, shrinkage: add L1 penalty or absolute value penalty to weights)
Elastic net (combines the two)

Answer 76

A

Cross validation

Answer 77

A

Overfitting

Answer 78

A

Weight initialization

Answer 79

A

Layers of nodes (artificial neurons)
Input layer
1 or more hidden layer
Output layer

Answer 80

A

Neural networks

Answer 81

A

learning rate

Answer 82

A

Learning rate

Answer 83

A

Activation function

Answer 84

A

Tree-based models

Answer 85

A

Liner models

Answer 86

A

Feature engineering, regularization, ensemble learning, and cross-validation

Answer 87

A

area under the ROC curve (AUC)

Answer 88

A

Precision

Answer 89

A

Precision

Answer 90

A

False Positive Rate (FPR)

Answer 91

A

confusion matrix

Answer 92

A

Online evaluation

Answer 93

A

Offline evaluation

Answer 94

A

A/B testing

Answer 95

A

Time to train, quality, and engineering costs