Domain 3: Modeling Flashcards

1
Q

This is an application agnostic standard that can be used as a baseline to understand the various phases of the ML workflow.

A

The Cross Industry Standard Process for Data Mining (CRISP-DM)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

ML lifecycle phases

A
  1. Identify Problem
  2. Collect and QC
  3. Prepare
  4. Visualization
  5. Feature Engineering
  6. Model Training
  7. Model Evaluation
  8. Business Workflow Integration
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

T/F: Business problem identification requires senior leadership buy-in.

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the goal of ML?

A

To predict the value or class of an unknown quantity using a mathematical model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Data that the model can use to “learn” from, which consists of independent variables and a dependent variable.

A

Training data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What makes a good model?

A

Should be able to generalize what it has learned to unseen data, namely data where the dependent variable is unknown.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What makes a poor model?

A

One that has simply memorized the training data will have poor generalization performance and therefore will not be usable in a business process.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

When a model is shown labeled examples of ground truth values and learns to predict the label based on the input data or features.

A

Supervised learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

When you do not have labeled data available and you want the model to discover patterns in the unlabeled data.

A

Unsupervised learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

When a model or agent learns by interacting with its environment - similar to trial-and-error learning, where an agent is given rewards and penalties for actions taken and its aim is to maximize the long-term rewards.

A

Reinforcement learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

T/F: The data type (whether it is structured or unstructured) does not dictate whether learning is supervised.

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

A type of supervised learning where the label is binary, such as fraud/not fraud, cat/dog, spam/not spam

A

Binary classification

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

A type of supervised learning where the label can have more than two classes

A

Multiclass classification

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

A type of supervised learning where the label is a continuous number such as a house price

A

Regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

A form of supervised machine learning where a model predicts a linear relationship between the data and the labels.

A

Linear models

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Used when you have a continuous label (regression task), where the assumption is made that the label is linearly related to the data.

A

Linear regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

An idea that the label is a linear combination of the input data or feature vectors.

A

Linearity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Two assumptions that need to be tested before a linear model can be accurately fit to the data.

A

Linearity, constant variance, features cannot be strongly correlated w/ one another.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

This is where one feature can be linearly derived from the other, in the most trivial example; they are related by a constant.

A

Multicollinear

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is often used in machine learning as a way to penalize the model from learning weights that do not generalize well to unseen data and reduces the overall model complexity and prevents overfitting?

A

Regularization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

This tends to reduce the values of weights that are unimportant in predicting the labels, where you add an L2 penalty or quadratic penalty to the weights.

A

Ridge

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

This tends to shrink the weights to zero, where where you add an L1 penalty or absolute value penalty to the weights. It also eliminates unimportant features.

A

Lasso

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

This combines ridge and lasso regulation.

A

Elastic net

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Lasso regression is also known as _____.

A

Shrinkage

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

T/F: Often in machine learning, it is not the model but how you engineer features that determines model performance and ultimately business value.

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

The application of linear regression to binary or multiclass classification problems using logit function.

A

Logistic regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

T/F: Logistic regression can apply to both binary and multiclass classification problems.

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

The _____ is one of the most common loss functions for classification problems in machine learning irrespective of the underlying algorithm.

A

Cross-entropy loss

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Are logistic regression models large?

A

No, they only store coefficients and can thus be quite small.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Logistic regression often serves as a _____for model performance.

A

Benchmark

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

_____ is used to solve classification problems; _____ is used for regression problems.

A

Logistic regression/linear regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What is the built-in algorithm SageMaker has that covers both linear and logistic regression use cases?

A

Linear learner

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What data format does Linear Learner use?

A

Built using the MXNet framework (recognizes RecordIO data format)
Algorithm also recognizes CSV data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

_____ can be used for supervised learning for both classification and regression tasks that takes into consideration when the label may also be proportional to interaction terms b/w different independent variables.

A

Factorization Machines

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

What do you use when dealing with large sparse data?

A

Factorization Machines

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

What method do Factorization Machines work by?

A

Matrix factorization, also built on top of the MXNet framework and accepts RecordIO format, but not CSV

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Which built-in algorithm to use when dealing with recommender systems or item recommendation use cases?

A

AWS recommends using Factorization Machines for such large sparse matrix use cases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

A supervised learning algorithm on structured data that works by first building an index consisting of the distance between any two data points in your dataset; and then, when a new point whose label is unknown is provided, this algorithm calculates the nearest neighbors to that point based on a specified distance metric, and either averages the label values for those k-points in the case of regression or uses the most frequently returned label as the label for classification.

A

k-Nearest Neighbors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

How do you train for k-nearest neighbor?

A

Build an index

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

_____corresponds to performing fast lookups against that index.

A

Inference

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q
  1. Sample dataset
  2. Reduce dimensionality
  3. Assign each vector a cluster
A

Steps to train a k-nearest model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

Logistics regression solves binary classification problems using a loss function known as _____.

A

Cross-entropy loss

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

This algorithm is particularly popular in biological fields, and it aims to find the separating hyper-plane that separates two classes by the widest so-called margin. The wider the margin, the better the quality of the algorithm and its ability to generalize.

A

Support vector machines

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

How do you generalize to nonlinear situations where the separating boundary may not be linear with support vector machines?

A

introducing a kernel trick

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

Lets the tree learn when to spawn off new nodes based on the input data.

A

decision tree learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

Consists of a root node or parent node and spawns off child or leaf nodes based on certain criteria.

A

decision tree

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

Uses a metric to decide when it is appropriate to split a parent node into child nodes. Then this rule is recursively applied to the child nodes. Splitting stops when no further gains can be made or some other condition is met.

A

Classification and Regression Trees (CART)

48
Q

How does the CART algorithm decide when to split a parent node?

A

Gini impurity or the entropy metric

49
Q

A measure of the probability of incorrectly classifying a data point with a particular label.

A

Gini impurity

50
Q

_____ operates by using a greedy algorithm to select which input variables to split on and for that input variable, all different split points are evaluated for the Gini impurity.

A

CART

51
Q

_____ takes the same ideas behind decision trees, namely the CART algorithm, but instead of bagging, uses a technique called boosting.

A

XGBoost

52
Q

_____ refers to sequential learning where each subsequent tree aims to correctly classify the errors that were misclassified by its predecessor, which can also prevent overfitting, as each individual tree can be a so-called weak learner or a shallow tree, but collectively, they can become a strong learner.

A

Boosting

53
Q

Popular boosting algorithms

A

AdaBoost and Logit Boosting, which are examples of gradient boosting

54
Q

_____ refers to the ability to treat the error terms as continuous variables and to use Taylor’s expansion to expand them in terms of their gradients or derivatives.

A

Gradient boosting

55
Q

A key benefit of XGBoost is its ability to _____, which can occur for common machine learning problems such as fraud detection.

A

scale to very large datasets

56
Q

T/F: SageMaker offers a built-in XGBoost algorithm

A

True

57
Q

The _____ is a popular and efficient open-source implementation of the gradient boosted trees algorithm

A

XGBoost (eXtreme Gradient Boosting)

58
Q

Use XGBoost as a _____to run your customized training scripts that can incorporate additional data processing into your training jobs.

A

framework

59
Q

Use the XGBoost built-in algorithm to _____.

A

build an XGBoost training container

60
Q

T/F: Gradient boosting operates on tabular data, with the rows representing observations, one column representing the target variable or label, and the remaining columns representing features.

A

True

61
Q

_____ can be used when your data does not have labels but when you are looking to cluster your data points into “similar” groups.

A

k-means clustering

62
Q

What are these steps:
1. identifying a random set of k points as the cluster centers
2. for each of the k centers, find a subset of points from the data that are closest to this center using a distance metric such as Euclidean distance
3. define the new centroid as the mean vector of all these points
4. repeatedly perform these steps until the algorithm converges, that is, the cluster centers do not move past a certain threshold.

A

how k-means training works

63
Q

_____ is used when you have a continuous label (regression task), where the assumption is made that the label is linearly related to the data.

A

Linear regression

64
Q

_____ models are powerful because they are easy to interpret, but the model makes multiple assumptions that need to be tested before a linear model can be accurately fit to the data.

A

Linear regression

65
Q
  1. Linearity - the label is a linear combination of the input data or feature vectors.
  2. Constant variance - the statistical variance in the label is identical, regardless of the value of the input data.
  3. Features cannot be strongly correlated with one another
A

Linear regression assumptions

66
Q

_____ are designed to reduce decision tree overfitting by creating a collection of decision trees.

A

Random forests

67
Q

A _____ works by building many trees, but each tree is trained on only a subset of the input features using a method known as bootstrap aggregation or bagging - which essentially refers to sampling but with replacement.

A

random forest

68
Q

Increase the minimum samples per leaf but decrease the maximum depth of the trees.

A

How to avoid overfitting

69
Q

Pro: training multiple trees in parallel.
Con: different trees do not work together to reduce the overall errors.

A

Random forests

70
Q

_____ are deep learning algorithms consisting of alternating convolutional layers, which apply various filters on the input data to capture different information at different scales, followed by pooling layers, which reduce the number of parameters in the network and also the spatial size of the representation.

A

CNNs

71
Q

_____ have the ability to retain a user’s session history information as part of the model training.

A

Recurrent neural networks (RNNs)

72
Q

_____ refers to taking a model that was pretrained on one dataset, freezing the initial layers, and letting it relearn the last few layers of the model on a different dataset.

A

Transfer learning

73
Q

T/F: It is hard for an ML model to understand contextual information

A

True

74
Q

_____ is a service that you can use to label your image, text, audio, or even tabular data; and it lets you outsource the labeling task to a public workforce (via Amazon Mechanical Turk) or a private workforce (either a third-party labeling company or your own private workforce within your organization) to label data.

A

Amazon SageMaker Ground Truth

75
Q

How do you determine if the model is overfitting/underfitting your data?

A

Comparing the performance of your model against the training/validation datasets

76
Q

An _____is a function or an algorithm that adjusts the attributes of the neural network, such as weights and learning rates. Thus, it helps in reducing the overall loss and improving accuracy.

A

optimizer

77
Q

_____ is an optimization algorithm for finding a local minimum of a differentiable function, and in machine learning, it is simply used to find the values of a function’s parameters (coefficients) that minimize a cost function as far as possible.

A

Gradient Descent

78
Q

In machine learning (ML), a _____ is used to measure model performance by calculating the deviation of a model’s predictions from the correct, “ground truth” predictions.

A

loss function

79
Q

_____ are places where the function attains its smallest value in a neighborhood of a point.

A

Local minima

80
Q

_____ in machine learning refers to the point where the model’s predictions stop improving, or the error rate becomes constant

A

Convergence

81
Q

The _____ is a hyperparameter that defines the number of samples to work through before updating the internal model parameters.

A

batch size

82
Q

_____ is a measure of the likelihood of an event to occur.

A

Probability

83
Q

Which is better for handling massively distributed computational process required for ML? GPU or CPU

A

GPU

84
Q

Which system is better for ML? Distributed or Non-Distributed?

A

Distributed b/c handles large volumes of data, also for fault tolerance (if one goes down the other is still up)

85
Q

Which is best for ML, Spark or Non-Spark?

A

Spark, b/c it allows us to analyze and understand complex data sets that were previously considered too difficult to work with.

86
Q

What is a trigger for model retraining?

A

Drift

87
Q

_____ is fundamental to ensure that a machine learning model is constantly providing the most up-to-date predictions, while minimizing manual interventions and optimizing for monitoring and reliability. Can happen on a schedule or be triggered by an event.

A

Retraining

88
Q

_____ involves lifting and shifting the batch training code defined at development time into an automated workflow.

A

Model retraining

89
Q

T/F: You should abstract feature selection, model parameters, and other configurable pipeline parameters as input variables of the retraining pipeline.

A

True

90
Q

When you have highly correlated features in your data, to prevent linear regression models from becoming unusable, use this to penalize the model from learning weights that do not generalize well to unseen data.

A

Regularization

91
Q

What are three common forms of regularization?

A

Ridge (add L2 penalty or quadratic penalty to weights)
Lasso (aka, shrinkage: add L1 penalty or absolute value penalty to weights)
Elastic net (combines the two)

92
Q

_____ is a technique used in machine learning to evaluate the performance of a model on unseen data. It involves dividing the available data into multiple folds or subsets, using one of these folds as a validation set, and training the model on the remaining folds.

A

Cross validation

93
Q

The main purpose of cross validation is to prevent _____, which occurs when a model is trained too well on the training data and performs poorly on new, unseen data.

A

Overfitting

94
Q

_____ is a procedure to set the weights of a neural network to small random values that define the starting point for the optimization (learning or training) of the neural network model.

A

Weight initialization

95
Q

What does every neural network consist of?

A

Layers of nodes (artificial neurons)
Input layer
1 or more hidden layer
Output layer

96
Q

_____ allow us to classify and cluster data at a high velocity

A

Neural networks

97
Q

The _____ is a tuning parameter in an optimization algorithm that determines the step size at each iteration while moving toward a minimum of a loss function.

A

learning rate

98
Q

_____ influences to what extent newly acquired information overrides old information and metaphorically represents the speed at which a machine learning model “learns”.

A

Learning rate

99
Q

_____ decides whether a neuron should be activated or not, which means that it will decide whether the neuron’s input to the network is important or not in the process of prediction using simpler mathematical operations.

A

Activation function

100
Q

_____ use a decision tree to represent how different input variables can be used to predict a target value, and they’re used for both classification and regression problems.

A

Tree-based models

101
Q

This is the building block for many complex machine learning algorithms, including deep neural networks, and it predicts the target variable using a linear function of the input features.

A

Liner models

102
Q

What techniques help avoid over and underfitting?

A

Feature engineering, regularization, ensemble learning, and cross-validation

103
Q

The _____ represents the probability that the model, if given a randomly chosen positive and negative example, will rank the positive higher than the negative.

A

area under the ROC curve (AUC)

104
Q

_____ is the proportion of all classifications that were correct, whether positive or negative. It is mathematically defined as correct classification/total classification

A

Accuracy

105
Q

Precision is the proportion of all the model’s positive classifications that are actually positive. It is mathematically defined as correctly classified actual positives/everything classified as positive.

A

Precision

106
Q

_____ improves as false positives decrease, while recall improves when false negatives decrease.

A

Precision

107
Q

The true positive rate (TPR), or the proportion of all actual positives that were classified correctly as positives, is also known as ____, which is defined as correctly classified actual positives/all actual positives.

A

Recall

108
Q

_____ is commonly used in machine learning as it gives a relatively high weight to large errors, which means it should be more useful when large errors are particularly undesirable. It is also valuable because it retains the same units as the input, making it easier to interpret.

A

RMSE

109
Q

The percentage of positive predictions when the true value is negative, i.e., FP / (FP + TN).

A

False Positive Rate (FPR)

110
Q

The harmonic mean of precision and recall

A

F1 Score

111
Q

A _____ is used to measure the performance of a classifier in depth and the accuracy of a classification model.

A

confusion matrix

112
Q

_____ is the process of measuring the quality and effectiveness of a machine learning model based on its interaction with real users and data in a live system.

A

Online evaluation

113
Q

_____ is the process of measuring the quality and effectiveness of a machine learning model based on historical or simulated data and metrics.

A

Offline evaluation

114
Q

T/F: Offline evaluation is usually faster, cheaper, and easier to perform than online evaluation, but it may not capture the true behavior and preferences of the users, the dynamics of the data, or the impact of the model on the system. Online evaluation can provide more realistic and actionable feedback, but it may also be more costly, risky, and complex to conduct.

A

True

115
Q

_____ is an optimization technique often used to understand how an altered variable affects audience or user engagement. It’s a common method used in marketing, web design, product development, and user experience design to improve campaigns and goal conversion rates.

A

A/B testing

116
Q

What are some metrics used to compare models?

A

Time to train, quality, and engineering costs