Data Science Flashcards

Question

Why would you want to use: | b. Lasso instead of Ridge Regression?

Answer 1

Lasso Regression uses an ℓ1 penalty, which tends to push the weights down to exactly zero. This leads to sparse models, where all weights are zero except for the most important weights. This is a way to perform feature selection automatically, which is good if you suspect that only a few features actually matter. When you are not sure, you should prefer Ridge Regression.

Answer 2

Elastic Net is generally preferred over Lasso since Lasso may behave erratically in some cases (when several features are strongly correlated or when there are more features than training instances). However, it does add an extra hyper‐parameter to tune. If you want Lasso without the erratic behavior, you can just use Elastic Net with an l1_ratio close to 1.

Answer 3

If you want to classify pictures as outdoor/indoor and daytime/nighttime, since these are not exclusive classes (i.e., all four combinations are possible) you should train two Logistic Regression classifiers.

Answer 4

The fundamental idea behind Support Vector Machines is to fit the widest possible “street” between the classes. In other words, the goal is to have the largest possible margin between the decision boundary that separates the two classes and the training instances. When performing soft margin classification, the SVM searches for a compromise between perfectly separating the two classes and having the widest possible street (i.e., a few instances may end up on the street). Another key idea is to use kernels when training on nonlinear datasets.

Answer 5

. After training an SVM, a support vector is any instance located on the “street” (see the previous answer), including its border. The decision boundary is entirely determined by the support vectors. Any instance that is not a support vector (i.e., is off the street) has no influence whatsoever; you could remove them, add more instances, or move them around, and as long as they stay off the street they won’t affect the decision boundary. Computing the predictions only involves the support vectors, not the whole training set.

Answer 6

SVMs try to fit the largest possible “street” between the classes, so if the training set is not scaled, the SVM will tend to neglect small features

Answer 7

``` An SVM classifier can output the distance between the test instance and the decision boundary, and you can use this as a confidence score. However, this score cannot be directly converted into an estimation of the class probability. If you set probability=True when creating an SVM in Scikit-Learn, then after training it will calibrate the probabilities using Logistic Regression on the SVM’s scores (trained by an additional five-fold cross-validation on the training data). This will add the predict_proba() and predict_log_proba() methods to the SVM. ```

Answer 8

This question applies only to linear SVMs since kernelized SVMs can only use the dual form. The computational complexity of the primal form of the SVM problem is proportional to the number of training instances m, while the computational complexity of the dual form is proportional to a number between m2 and m3 So if there are millions of instances, you should definitely use the primal form, because the dual form will be much too slow

Answer 9

If an SVM classifier trained with an RBF kernel underfits the training set, there might be too much regularization. To decrease it, you need to increase gamma or C (or both).

Answer 10

The depth of a well-balanced binary tree containing m leaves is equal to log_2(m), rounded up. A binary Decision Tree (one that makes only binary decisions, as is the case of all trees in Scikit-Learn) will end up more or less well balanced at the end of training, with one leaf per training instance if it is trained without restrictions. Thus, if the training set contains one million instances, the Decision Tree will have a depth of log_2 (10^6) ≈ 20 (actually a bit more since the tree will generally not be perfectly well balanced).

Answer 11

``` A node’s Gini impurity is generally lower than its parent’s. This is due to the CART training algorithm’s cost function, which splits each node in a way that minimizes the weighted sum of its children’s Gini impurities. However, it is possible for a node to have a higher Gini impurity than its parent, as long as this increase is more than compensated for by a decrease in the other child’s impurity. For example, consider a node containing four instances of class A and one of class B. Its Gini impurity is 1 – (1/5)2– (4/5)2 = 0.32. Now suppose the dataset is one-dimensional and the instances are lined up in the following order: A, B, A, A, A. You can verify that the algorithm will split this node after the second instance, producing one child node with instances A, B, and the other child node with instances A, A, A. The first child node’s Gini impurity is 1 – (1/2)2– (1/2)2=0.5, which is higher than its parent’s. This is compensated for by the fact that the other node is pure, so its overall weighted Gini impurity is 2/5 × 0.5 + 3/5 × 0 =0.2, which is lower than the parent’s Gini impurity. ```

Answer 12

If a Decision Tree is overfitting the training set, it may be a good idea to decrease max_depth, since this will constrain the model, regularizing it.

Answer 13

Decision Trees don’t care whether or not the training data is scaled or centered; that’s one of the nice things about them. So if a Decision Tree underfits the training set, scaling the input features will just be a waste of time.

Answer 14

The computational complexity of training a Decision Tree is O(n × m log(m)). So if you multiply the training set size by 10, the training time will be multiplied by K = (n × 10m × log(10m)) / (n × m × log(m)) = 10 × log(10m) / log(m). If m =10^6, then K ≈ 11.7, so you can expect the training time to be roughly 11.7 hours.

Answer 15

Presorting the training set speeds up training only if the dataset is smaller than a few thousand instances. If it contains 100,000 instances, setting presort=True will considerably slow down training.

Answer 16

If you have trained five different models and they all achieve 95% precision, you can try combining them into a voting ensemble, which will often give you even better results. It works better if the models are very different (e.g., an SVM classifier, a Decision Tree Classifier, a Logistic Regression classifier, and so on). It is even better if they are trained on different training instances (that’s the whole point of bagging and pasting ensembles), but if not this will still be effective as long as the models are very different.

Answer 17

``` A hard voting classifier just counts the votes of each classifier in the ensemble and picks the class that gets the most votes. A soft voting classifier computes the average estimated class probability for each class and picks the class with the highest probability. This gives high-confidence votes more weight and often per‐ forms better, but it works only if every classifier is able to estimate class probabilities (e.g., for the SVM classifiers in Scikit-Learn you must set probability=True). ```

Answer 18

It is quite possible to speed up training of a bagging ensemble by distributing it across multiple servers, since each predictor in the ensemble is independent of the others. The same goes for pasting ensembles and Random Forests, for the same reason. However, each predictor in a boosting ensemble is built based on the previous predictor, so training is necessarily sequential, and you will not gain anything by distributing training across multiple servers. Regarding stacking ensembles, all the predictors in a given layer are independent of each other, so they can be trained in parallel on multiple servers. However, the predictors in one layer can only be trained after the predictors in the previous layer have all been trained.

Answer 19

With out-of-bag evaluation, each predictor in a bagging ensemble is evaluated using instances that it was not trained on (they were held out). This makes it possible to have a fairly unbiased evaluation of the ensemble without the need for an additional validation set. Thus, you have more instances available for training, and your ensemble can perform slightly better.

Answer 20

When you are growing a tree in a Random Forest, only a random subset of the features is considered for splitting at each node. This is true as well for ExtraTrees, but they go one step further: rather than searching for the best possible thresholds, like regular Decision Trees do, they use random thresholds for each feature. This extra randomness acts like a form of regularization: if a Random Forest overfits the training data, Extra-Trees might perform better. Moreover, since Extra-Trees don’t search for the best possible thresholds, they are much faster to train than Random Forests. However, they are neither faster nor slower than Random Forests when making predictions.

Answer 21

If your AdaBoost ensemble underfits the training data, you can try increasing the number of estimators or reducing the regularization hyperparameters of the base estimator. You may also try slightly increasing the learning rate.

Answer 22

If your Gradient Boosting ensemble overfits the training set, you should try decreasing the learning rate. You could also use early stopping to find the right number of predictors (you probably have too many).

Answer 23

The main motivations for dimensionality reduction are: • To speed up a subsequent training algorithm (in some cases it may even remove noise and redundant features, making the training algorithm perform better) • To visualize the data and gain insights on the most important features • To save space (compression)

Answer 24

The main drawbacks are: • Some information is lost, possibly degrading the performance of subsequent training algorithms. • It can be computationally intensive. • It adds some complexity to your Machine Learning pipelines. • Transformed features are often hard to interpret.

Answer 25

The curse of dimensionality refers to the fact that many problems that do not exist in low-dimensional space arise in high-dimensional space. In Machine Learning, one common manifestation is the fact that randomly sampled highdimensional vectors are generally very sparse, increasing the risk of overfitting and making it very difficult to identify patterns in the data without having plenty of training data.

Answer 26

Once a dataset’s dimensionality has been reduced using one of the algorithms we discussed, it is almost always impossible to perfectly reverse the operation, because some information gets lost during dimensionality reduction. Moreover, while some algorithms (such as PCA) have a simple reverse transformation procedure that can reconstruct a dataset relatively similar to the original, other algorithms (such as T-SNE) do not

Answer 27

PCA can be used to significantly reduce the dimensionality of most datasets, even if they are highly nonlinear, because it can at least get rid of useless dimensions. However, if there are no useless dimensions—as in a Swiss roll dataset—then reducing dimensionality with PCA will lose too much information. You want to unroll the Swiss roll, not squash it.

Answer 28

That’s a trick question: it depends on the dataset. Let’s look at two extreme examples. First, suppose the dataset is composed of points that are almost perfectly aligned. In this case, PCA can reduce the dataset down to just one dimension while still preserving 95% of the variance. Now imagine that the dataset is composed of perfectly random points, scattered all around the 1,000 dimensions. In this case roughly 950 dimensions are required to preserve 95% of the variance. So the answer is, it depends on the dataset, and it could be any number between 1 and 950. Plotting the explained variance as a function of the number of dimensions is one way to get a rough idea of the dataset’s intrinsic dimensionality.

Answer 29

Regular PCA is the default, but it works only if the dataset fits in memory. Incremental PCA is useful for large datasets that don’t fit in memory, but it is slower than regular PCA, so if the dataset fits in memory you should prefer regular PCA. Incremental PCA is also useful for online tasks, when you need to apply PCA on the fly, every time a new instance arrives. Randomized PCA is useful when you want to considerably reduce dimensionality and the dataset fits in memory; in this case, it is much faster than regular PCA. Finally, Kernel PCA is useful for nonlinear datasets.

Answer 30

Intuitively, a dimensionality reduction algorithm performs well if it eliminates a lot of dimensions from the dataset without losing too much information. One way to measure this is to apply the reverse transformation and measure the reconstruction error. However, not all dimensionality reduction algorithms provide a reverse transformation. Alternatively, if you are using dimensionality reduction as a preprocessing step before another Machine Learning algorithm (e.g., a Random Forest classifier), then you can simply measure the performance of that second algorithm; if dimensionality reduction did not lose too much information, then the algorithm should perform just as well as when using the original dataset.

Answer 31

It can absolutely make sense to chain two different dimensionality reduction algorithms. A common example is using PCA to quickly get rid of a large number of useless dimensions, then applying another much slower dimensionality reduction algorithm, such as LLE. This two-step approach will likely yield the same performance as using LLE only, but in a fraction of the time.

Answer 32

In Machine Learning, clustering is the unsupervised task of grouping similar instances together. The notion of similarity depends on the task at hand: for example, in some cases two nearby instances will be considered similar, while in others similar instances may be far apart as long as they belong to the same densely packed group. Popular clustering algorithms include K-Means, DBSCAN, agglomerative clustering, BIRCH, Mean-Shift, affinity propagation, and spectral clustering.

Answer 33

The main applications of clustering algorithms include data analysis, customer segmentation, recommender systems, search engines, image segmentation, semisupervised learning, dimensionality reduction, anomaly detection, and novelty detection.

Answer 34

The elbow rule is a simple technique to select the number of clusters when using K-Means: just plot the inertia (the mean squared distance from each instance to its nearest centroid) as a function of the number of clusters, and find the point in the curve where the inertia stops dropping fast (the “elbow”). This is generally close to the optimal number of clusters. Another approach is to plot the silhouette score as a function of the number of clusters. There will often be a peak, and the optimal number of clusters is generally nearby. The silhouette score is the mean silhouette coefficient over all instances. This coefficient varies from +1 for instances that are well inside their cluster and far from other clusters, to –1 for instances that are very close to another cluster. You may also plot the silhouette diagrams and perform a more thorough analysis.

Answer 35

Labeling a dataset is costly and time-consuming. Therefore, it is common to have plenty of unlabeled instances, but few labeled instances. Label propagation is a technique that consists in copying some (or all) of the labels from the labeled instances to similar unlabeled instances. This can greatly extend the number of labeled instances, and thereby allow a supervised algorithm to reach better performance (this is a form of semi-supervised learning). One approach is to use a clustering algorithm such as K-Means on all the instances, then for each cluster find the most common label or the label of the most representative instance (i.e., the one closest to the centroid) and propagate it to the unlabeled instances in the same cluster.

Answer 36

K-Means and BIRCH scale well to large datasets. DBSCAN and Mean-Shift look for regions of high density.

Answer 37

Active learning is useful whenever you have plenty of unlabeled instances but labeling is costly. In this case (which is very common), rather than randomly selecting instances to label, it is often preferable to perform active learning, where human experts interact with the learning algorithm, providing labels for Exercise Solutions | 729 specific instances when the algorithm requests them. A common approach is uncertainty sampling (see the description in “Active Learning” on page 255).

Answer 38

``` Many people use the terms anomaly detection and novelty detection interchangeably, but they are not exactly the same. In anomaly detection, the algorithm is trained on a dataset that may contain outliers, and the goal is typically to identify these outliers (within the training set), as well as outliers among new instances. In novelty detection, the algorithm is trained on a dataset that is presumed to be “clean,” and the objective is to detect novelties strictly among new instances. Some algorithms work best for anomaly detection (e.g., Isolation Forest), while others are better suited for novelty detection (e.g., one-class SVM). ```

Answer 39

A Gaussian mixture model (GMM) is a probabilistic model that assumes that the instances were generated from a mixture of several Gaussian distributions whose parameters are unknown. In other words, the assumption is that the data is grouped into a finite number of clusters, each with an ellipsoidal shape (but the clusters may have different ellipsoidal shapes, sizes, orientations, and densities), and we don’t know which cluster each instance belongs to. This model is useful for density estimation, clustering, and anomaly detection.

Answer 40

``` One way to find the right number of clusters when using a Gaussian mixture model is to plot the Bayesian information criterion (BIC) or the Akaike informa‐ tion criterion (AIC) as a function of the number of clusters, then choose the number of clusters that minimizes the BIC or AIC. Another technique is to use a Bayesian Gaussian mixture model, which automatically selects the number of clusters. ```

Answer 41

The normalization equation is an alternative to gradient descent when our number of features isn't too big.

Answer 42

Subtract each instance by the feature mean and divide it by the feature std

Answer 43

Make the learning rate (alpha), smaller.

Answer 44

When there are too many features (eg 10000).

Answer 45

It will make Gradient Descent quicker, more direct.

Answer 46

S(j+1) by S j plus one eg. hidden layer by input layer + 1

Answer 47

The ML dataset was improperly created

Answer 48

Recommendation systems (but you must understand and know the tools and tricks of image processing and sequence systems to understand recommendation systems).

Answer 49

Process batch and streaming data the same way

Answer 50

The model was trained on labeled data and we now wish to correct the labels of the data.

Answer 51

Use several models to read, sort, and send to departments. If there are any pre-existing models then use them.

Answer 52

The team should take time to gather more data, because with more data, it is possible to create a simpler ML model that performs better.

Answer 53

Individual contributor, delegation, digitization, big data and analytics, machine learning

Answer 54

Datalab notebooks are hosted in the cloud

Answer 55

Every member of the team, regardless of their location, can directly read data from Cloud Storage. Latency of data access can be a concern, so carefully select the zone for data storage.

Answer 56

serverless, insights

Answer 57

Consistency, accuracy, auditability

Answer 58

Cleaning tools, monitoring tools

Answer 59

unreliable info, incomplete data, duplicated data

Answer 60

The data entered has the required format and structure

Answer 61

resolving missing values, preventing duplicates, automating data entry

Answer 62

Data Exploration -> Data Cleaning -> Model Building -> Present Results

Answer 63

Check for missing data and other mistakes, Gain maximum insight into the data set and its underlying structure, uncover a parsimonious model (the most useful features)

Answer 64

Univariate and Bivariate

Answer 65

Supervised model

Answer 66

Regression model, and classification model

Answer 67

Classification model

Answer 68

Clustering Algorithms

Answer 69

Mean squared error as their loss funciton

Answer 70

Label, Quantity

Answer 71

Perceptron

Answer 72

Perceptron

Answer 73

linear classifier, feature vector

Answer 74

1. Takes the inputs, multiplies them by their weights, and computes their sum. 2. Adds a bias factor, the number 1 multiplied by a weight. 3. Feeds the sum through the activation function.

Answer 75

1. Input function X 2. Bias b (constant) 3. Weights 4. Weighted sum 5. Activation function 6. Output

Answer 76

1. Weighted Sum 2. Activation function 3. Output of the activation function

Answer 77

Nonlinear activation functions

Answer 78

between zero and infinity

Answer 79

between -1 and 1

Answer 80

between zero and 1

Answer 81

between -1 and infinity

Answer 82

Linear classifier of one feature

Answer 83

regression, classification

Answer 84

More neurons per layer

Answer 85

category assignment, | detecting spam, sentiment analysis

Answer 86

hyperplane, two, support vectors, hyperplane, hyperplane, line, linearly

Answer 87

kernel transformation, input, linearly

Answer 88

pattern analysis, support vector machine

Answer 89

outgoing edges, 0

Answer 90

In setting the output to 0, the cost function becomes more sensitive to neighboring neurons changing the way the weights will be updated during the process of backpropagation.

Answer 91

Convolutional, modular, recurrent

Answer 92

Adding dropout layers, performing data augmentation, adding noise

Answer 93

loss function

Answer 94

mean squared error

Answer 95

mean squared error, cross entropy

Answer 96

minimize, negative of the gradient, parameters

Answer 97

Batch gradient descent, each example, after

Answer 98

Stochastic Gradient Descent

Answer 99

Batch Gradient Descent, Stochastic Gradient Descent, Mini-Batch Gradient Descent

Answer 100

When the model predicts positive but it's actually a negative (predicts face when it's a statue).

Answer 101

True positives / (True positives + False Positives)

Answer 102

False Positives

Answer 103

When the predicts negative and it's actually a positive (predicts not face when it's a face in winter clothes).

Answer 104

True positives /(true positives + false negatives)

Answer 105

The loss metric output is measured in the same units as the error making it easier to directly interpret.

Answer 106

Performance metrics will allow us to reject models that have settled into inappropriate minima. 1. easier to understand 2. directly connected to business goals

Answer 107

1. inversely related to precision | 2. Recall is like a person who never wants to be left out of a positive decision

Answer 108

1. learning rate | 2. batch size

Answer 109

To observe how well a model performs against a new dataset that it hasn't seen before

Answer 110

When your loss metrics start to increase against the validation set

Answer 111

You can run it once, and only once, against the independent test dataset.

Answer 112

RMSE for linear regression, cross-entropy for classification

Answer 113

By analyzing the slopes of our loss functions, which provide us directions and step magnitude.

Answer 114

Greddy algorithms

Answer 115

Use the last few digits of a hash function on the field that you're using to split or bucketize your data

Answer 116

FARM_FINGERPRINT, an open-source hashing algorithm that is implemented in BigQuery SQL.

Answer 117

scalable, multi

Answer 118

tensors, immutable

Answer 119

Using a Directed Acyclic Graph (or DAG)

Answer 120

building custom Neural Network models.

Answer 121

CPU, GPU, TPU

Answer 122

distributed training

Answer 123

We can use an estimator from TF, or we can use a high-level API such as Keras

Answer 124

tf.data.Dataset

Answer 125

reshaped, sliced

Answer 126

GradientTape, Tape

Answer 127

tape, gradients, gradients

Answer 128

tf.Variable

Answer 129

changed, ops, tf.keras

Answer 130

raw input, dictionary

Answer 131

continuous feature values

Answer 132

1. A data source constructs a dataset from data stored in memory or in one or more files. 2. A data transformation constructs a dataset from one or more tf.data.Dataset objects

Answer 133

TextLineDataset

Answer 134

TFRecordDataset

Answer 135

FixedLengthRecordDataset

Answer 136

1. Finding nearest neighbors in the embedding space. These can be used to make recommendations based on user interests or cluster categories. 2. As input to a machine learning model for a supervised task. 3. For visualization of concepts and relations between categories.

Answer 137

Categorical, Bucketized, Crossed

Answer 138

Served Model

Answer 139

TextLineDataset, TFrecordDataset, FexedLengthRecordDataset

Answer 140

It enables you to build complex input pipelines from simple, reusable pieces

Answer 141

Data Extraction, Data Exploration, Data Analysis

Answer 142

much faster rate, more accuracy

Answer 143

Use non-saturating, nonlinear activation functions such as ReLUs

Answer 144

Batch normalization

Answer 145

lower your learning rates

Answer 146

Sequential

Answer 147

By updating network weights iteratively based on training data and by diagonal rescaling of the gradients

Answer 148

Numpy array(s) of predictions

Answer 149

optimizer, loss function, evaluation metrics

Answer 150

Defines the number of epochs.

Answer 151

1. It doesn't support dynamic architectures. The Functional API treats models as DAGs of layers. This is true for most deep learning architectures, but no all: for instance, recursive newtworks or Tree RNNs do not follow this assumption and cannot be implemented in the Functional API. 2. Sometimes we have to write from scratch and need to build subclasses. When writing advanced architectures, you may want to do things that are outside the scope of "defining a DAG of layers": for instance, you may want to expose multiple custom training and inference methods on your model instance. This requires subclassing.

Answer 152

Multiple inputs and outputs and models with shared layers

Answer 153

Sequential, Functional API

Answer 154

(p,) , columns inputs=Input(shape=(3,))

Answer 155

1/(1 - dropout probability)

Answer 156

By adding dropout layers to our neural networks.

Answer 157

It adds a sum of the squared parameter weights term to the loss function

Answer 158

Early Stopping

Answer 159

create the model -> train and evaluate your model -> save your model -> serve your model

Answer 160

model file, service

Answer 161

(EXPORT_PATH)

Answer 162

serialization, recoverable, hermetic

Answer 163

1. input or output models 2. ad hoc acyclic network graphs 3. a model that shares layers

Answer 164

input, outputs

Answer 165

It is a tool that can be used to analyze data to find potential problems in data.

Answer 166

We can use a DenseFeatures layer to input them to a Keras model.

Answer 167

1. multiple inputs or multiple outputs 2. Any of your layers has multiple inputs or multiple outputs. 3. You need to do layer sharing or non-linear topology

Answer 168

ML.EVALUATE, ML.EVALUATE, predicted, actual, metrics

Answer 169

STRUCT, categorical

Answer 170

continuous, string

Answer 171

single, weights

Answer 172

One hot encoding

Answer 173

tf.feature_column.categorical_column_with_identity

Answer 174

To discretize floating point values into a smaller numberr of categorical bins

Answer 175

feature vectors

Answer 176

1. related to the objective 2. know at prediction time 3. numeric with meaningful magnitude

Answer 177

different features

Answer 178

Cloud Dataflow is the API for data pipeline building in java or python and Apache Beam is the implementation and execution framework

Answer 179

True: Anything in Map or FlatMap can be parallelized by the Beam execution framework

Answer 180

Connectors allow you to output the results of a pipeline to a specific data sink like Bigtable, Google Cloud Storage, flat file, BigQuery, and more ...

Answer 181

1. data source 2. transformation steps 3. data sink

Answer 182

False. A ParDo acts on one item at a time (like a Map in MapReduce)

Answer 183

1. Create transformations in UI tool instead of writing Java or Python 2. Can chain step together as part of recipe 3. Supports outputting your data into BigQuery, Google Cloud Storage, or flat files

Answer 184

One feature cross: [binned latitude X binned longitude X binned roomsPerPerson]

Answer 185

1) margin = b +w1*is_home_game + w2*avg_points_A + w3*avg_points_B 2) margin = w1*is_home + w2*(avg_points_A - avg_points_B)^3

Answer 186

Feature crosses memorize, and that is okay only if you have extremely large datasets

Answer 187

A list of categorical or bucketized features

Answer 188

1) Create a lower-dimensional representation of the input space 2) Identify similar sets of inputs for clustering 3) Reuse weights learned in one problem in another problem

Answer 189

Provides a TensorFlow graph for preprocessing

Answer 190

Apache and TensorFlow

Answer 191

Preprocessing function

Answer 192

Data Cleaning > Feature engineering > Model Building

Answer 193

Feature engineering

Answer 194

related to the objective, is known at prediction time

Answer 195

1. Makes models smaller | 2. Limits overfitting (the most important reason)

Answer 196

Overly-complex models may not be generalizable to real-world scenarios on unseen data

Answer 197

Converge very slowly

Answer 198

will converge rapidly, but not reach the lowest error value possible

Answer 199

converge slowly

Answer 200

oscillate wildly

Answer 201

False: you want to use an eval-metric as your hyperparameter tuning metric so that you are not rewarding models that overfit.

Answer 202

False: Often, it is simply a matter of submitting a training job with an additional configuration setting

Answer 203

The number of layers, batch size, number of nodes in each layers, the learning rate AND the number of previous games that the input features are average over (the creation of an feature is a hyperparameter)

Answer 204

Have zero values

Answer 205

1. Helps stop weights being driven to +/- infinity | 2. Helps logits stay away from asymptotes which can halt training

Answer 206

1. Adding regularization 2. Choosing a tuned threshold 3. Checking for bias

Answer 207

Stops the layers from collapsing back into just a linear model

Answer 208

neurons, layers, outputs

Answer 209

1. Lower the learning rate 2. Add weight regularization 3. Add Gradient clipping 4. Add batch normalization

Answer 210

Regularization, multiple, ensemble, keep probability, inference

Answer 211

1. Gradients can explode if the learning rate is too high 2. Entire layers can die with all their weights becoming zero 3. Gradients can vanish, making it harder to train networks the deeper they are

Answer 212

Lower the learning rate

Answer 213

Try using ReLU activation function

Answer 214

1. tf.nn.softmax_entropy_with_logits_v2 2. tf.nn.sparse_softmax_cross_entropy_with_logits 3. tf.nn.sigmoid_cross_entropy_with_logits

Answer 215

Have a logistic layer for each label, and send the outputs of the logistic layer to a softmax layer

Answer 216

Use a noise-contrastive loss function

Answer 217

It can give us a quick ML model

Answer 218

Write a Keras model as normal, and use the model_to_estimator function to convert it into an Estimator for train_and_evaluate

Answer 219

1. the set of evaluation metrics 2. The loss metric that is optimized 3. The optimizer that is used 4. The predictions that are returned (Correct. It is possible, for example, in a classification problem to decide to return an intermediate embedding, the class probability, and the logits. This is possible because predictions is a dictionary)

Answer 220

1. It can be trained as a supervised learning problem | 2. It is applicable when the input/output is a sequence (e.g., a sequence of words).

Answer 221

A neuron computes a linear function (z=Wx + b) followed by an activation function

Answer 222

cross-entropy loss function, there is a global minimum

Answer 223

x = img.reshape((32*32*3,1))

Answer 224

c.shape = (2,3)

Answer 225

"Error!" the sizes don't match for an element-wise multiplication

Answer 226

c.shape = (12288, 45)

Answer 227

c = a + b.T

Answer 228

This will invoke broadcasting, so b is copied three tiems to become (3,3), and * is an element-wise product so c.shape will be (3,3)

Answer 229

The output range is between -1 and 1 and thus center the data around zero, which makes learning simpler for the next layer

Answer 230

Each neuron in the first hidden layer will perform the same computation. So even after multiple iterations of gradient descent each neuron in the layer will be computing the same thing as other neurons.

Answer 231

This will cause inputs of the tanh to also be very large, thus causing gradients to be close to zero. The optimization algorithm will thus become slow.

Answer 232

We use it to pass variables computed during forward propagation to the corresponding backward propagation step. It contains useful values for backward propagation to compute derivatives.

Answer 233

deeper, earlier

Answer 234

False in a deeper network, we cannot avoid a for loop iteration over the layers

Answer 235

To compute the derivative, each activation has a different derivative.

Answer 236

exponentially

Answer 237

W[l] has shape (n[l], n[l-1])

Answer 238

98%, 1% , 1%

Answer 239

come from the same distribution

Answer 240

1. increase the number of units in each hidden layer | 2. make the Neural Network deeper

Answer 241

1. Increase the regularization parameter lambda | 2. get more training data

Answer 242

A regularization technique (such as L2 regularizaiton) that results in gradient descent shrinking the weights on every iteration

Answer 243

Weights are pushed toward becoming smaller

Answer 244

you do not apply dropout (do not randomly eliminate units) and do not keep the 1/keep_prob factor in the calculations used in training

Answer 245

1. reducing the regularization effect | 2. causing the neural network to end up with a lower training set error

Answer 246

1. dropout 2. data augmentation 3. L2 regularization

Answer 247

It makes the cost function faster to optimize

Answer 248

a[3]{7}(7)

Answer 249

if it's 1 then you lose the benefits of vectorization across examples in the mini-batch if it's m the you end up with batch gradient descent, which can be very slow for big training sets

Answer 250

It will look like batch gradient descent but more oscillated

Answer 251

It's when you use exponentially weighted averages on back propagation, and it corrects for the first few iterations so that they are not zero. In practice it's not that common.

Answer 252

Too large the line will shift to the right. Too small it will oscillate a lot. Standard practice uses B=0.9=-> average over the last 10 iterations

Answer 253

1. tuning the learning rate 2. try mini-batch gradient descent 3. try using Adam 4. try better random initialization for the weights

Answer 254

Random search will let you try more values. A 5x5 grid search of two hyper will let be 25 different combination but only 5 different values of the two hyperparameters. A random search will 25 different values for each hyperparameter.

Answer 255

1st - learning rate 2nd - hidden units, B (momentum), mini-batch size 3rd - number of layers, learning rate decay The parameters with Adam are usually fixed: B1 = 0.9, B2 = 0.999, epsilon = 10^(-8)

Answer 256

The amount of computational power you can access

Answer 257

``` r = np.random.rand() beta = 1-10**(-r-1) ```

Answer 258

Z[l], that will go into the activation function

Answer 259

to avoid division by zero

Answer 260

They set the mean and variance of the linear variable z[l] or a given layer, and they can be learned using Adam, Gradient descent with momentum, RMSprop, or gradient descent,

Answer 261

Perform the needed normalization using an exponentially weighted average across mini-batches seen during training.

Answer 262

27,000,100=300*300*3*100+100

Answer 263

7600=25*3*100+100

Answer 264

29x29x32 dimension=((n+2p-f)/s)+1 ((63+2*0-7)/2)+1 = 29

Answer 265

3 dimension=((n+2p-f)/s)+1 ((63+2p-7)/1)+1=63 -> p=3

Answer 266

16X16X16 divide width and height by 2 or ((n+2p-f)/s)+1 works as well

Answer 267

1. It reduces the total number of parameters, thus reducing overfitting 2. It allows a feature detector to be used in multiple locations throughout the whole input/image/input volume

Answer 268

Each activation in the next layer depends on only a small number of activations from the previous layer

Answer 269

nH and nW decrease, while nC increases

Answer 270

Multiple CONV layers followed by a POOL layer repeated a few times, and FC layers in the last few layers

Answer 271

You can use a 1x1 convolutional layer to reduce nC but not nH, nW. You can use a pooling layer to reduce nH, nW, but not nC.

Answer 272

They can reduce the input data volume's size before applying 3x3 and 5x5 convolutions

Answer 273

A single inception block allows the network to sue a combination of 1x1, 3x3, 5x5 convolutions and pooling

Answer 274

1. It is a convenient way to get working an implementation of a complex ConvNet architecture. 2. Parameters trained for one computer vision task are often useful as pretraining for other computer vision tasks.

Answer 275

Logistic unit, bx, by

Answer 276

19x19x(5x25)

Answer 277

verification, recognition

Answer 278

This allows us to learn to recognize a new person given just a single image of that person. We need to solve a one-shot learning problem

Answer 279

False having about 10 photos of each person would work well

Answer 280

max(||f(A)−f(P)||^2 − ||f(A)−f(N)||^2 + α, 0)

Answer 281

different, same

Answer 282

The pixel values of the generated image G

Answer 283

30x30x30x32

Answer 284

1. Sentiment classification from a text [0=negative, 1=postivie] 2. Gender recognition from speech [0=male, 1=femal]

Answer 285

two, a and x>t>

Answer 286

Exploding gradient problem

Answer 287

Unidirectional RNN, because the value of y depends only on x<1>,…,x, but not on x,…,x<365>

Answer 288

A non-linear dimensionality reduction technique. It can be used to view the relations in Word Embedding Matrix

Answer 289

boy - girl =~ brother - sister boy - brother =~ girl - sister

Answer 290

When your training set is smaller than the training set used to create the word embedding

Answer 291

Jupyter/iPhython (.ipynb files)

Answer 292

externally studentized residuals

Answer 293

(x-mean)/std

Answer 294

(x-min)/(max-min)

Answer 295

It's like the min-max standardization but only using the interquartile. It's less vulnerable to outliers.

Answer 296

using the maximum likelihood estimation

Answer 297

the smallest significance level at which the null hypothesis would be rejected

Answer 298

Saying the null hypothesis is false, when it is actually true

Answer 299

Saying the null hypothesis is true, when it is actually false

Answer 300

increasing k will usually increase the variance of the estimated parameters

Answer 301

bootstrap aggregating

Answer 302

Models need to output predicted probabilities

Answer 303

Random Forest

Answer 304

inertia - you want a similar number of observations in each cluster distortion - you want the observations in each cluster to be very similar

Answer 305

When no point is left unvisited by the chain reaction

Answer 306

Don't need to specify the number of cluster, allows for noise and can handle strange shapes

Answer 307

computationally expensive, hard to choose parameters, and cluster should have similar density

Answer 308

Use the elbow method

Answer 309

distances and linkage

Answer 310

The algorithms chooses it for us

Answer 311

A point that has more than n_clu neighbors in its Е-neighborhood

Answer 312

``` L1 = Manhattan L2 = Euclidean ```

Answer 313

When the data is very high dimensional

Answer 314

It measure the angle form the origin. It can be good for text data when location of occurrence is less important

Answer 315

generates new features that are linear combinations of other features

Answer 316

Kernel PCA tend to preserve the geometric distances between the points while reducing the dimensionality of the space

Answer 317

To visualize the data

Answer 318

When no point is left unvisisted by the chain reaction

Answer 319

No need to specify the number of clusters, allows for noise, and can handle arbitrary-shaped clusters

Answer 320

It needs two parameters as inputs, finding appropriate values can be difficult, and does not do well with clusters of different density

Answer 321

Use Single Linkage

Answer 322

Complete linkage

Answer 323

Ward Linkage

Answer 324

Multiplicative Decomposition Model

Answer 325

Additive Decomposition Model

Answer 326

Double Exponential Smoothing

Answer 327

Triple Exponential Smoothing

Answer 328

S= seasonality

Answer 329

A fixed number of past forecast values are used to predict future values.

Answer 330

A fixed number of past forecast errors are used to predict future values.

Answer 331

The sum of AR and MA model.

Answer 332

Autocorrelation plot

Answer 333

cross-sectional (panel), and longitudinal (time series)

Answer 334

Function approximation, Optimization, Probability and Statistics.

Data Science Flashcards

(398 cards)