Data Science Flashcards

1
Q

Two most common supervised tasks?

A

Classification and Regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Four common unsupervised tasks?

A

Clustering, visualization, dimensionality reduction, association rule learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What model is used to train a robot to walk on various unknown terrains?

A

Reinforcement Learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Is spam detection a supervised or unsupervised learning problem?

A

Supervised, you feed the model many emails that are labeled spam or not spam

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is an online learning system?

A

A learning system that can learn incrementally. Capable of adapting rapidly to changing data and autonomous systems, and of training on very large quantities of data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is out-of-core learning?

A

Out-of-core algorithms can handle vast quantities of data that cannot fit in a computer’s main memory. Chops the data into mini-batches and uses online learning techniques.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What type of learning algorithm relies on a similarity measure to make predictions?

A

An instance-based learning system learns the training data by heart; then, when given new instances, it uses a similarity measure to find the most similar learned instances and uses them to make predictions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Difference between parameter and learning algorithm?

A

a parameter will predict give a new instance (e.g. slope of a linear model), a hyperparameter is a parameter of the learning algorithm itself (max depth of learning tree

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

what do model-based learning algorithms search for? 2 What is the most common strategy they use to succeed? 3 How do they make predictions?

A

They search for an optimal value for the model parameters such that the model will generalize well to new instances. 2 Usually by minimizing a cost function. 3 Feed new instances into the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Five main challenges to ML?

A
  1. lack of data 2. poor data quality 3. nonrepresentative data 4. uninformative features 5. overfitting or underfitting
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Four solutions to overfitting?

A
  1. get more data 2. simplify the model 3. reduce the noise in the data 4 smaller learning rate
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is a test set?

A

to generalize the error that the model will make on new instances

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

The purpose of a validation set?

A

To compare models and tune the hyperparameters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a train-dev set?

A

Used when there is a risk of mismatch between the training data and the data used for validation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Which Linear Regression training algorithm can you use if you have a training set with million of features?

A

You can’t use SVD or Normal Equation because computational complexity grows quickly with the number of features. Use Stochastic Gradient Descent or Mini-batch Gradient Descent. If memory allows you can use Batch Gradient Descent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

If your training set has very different scales which algorithms might suffer? What can you do about this?

A

The cost function will have the shape of an elongated bowl, so the Gradient Descent algorithms will take a long time to converge. (Normal Equation or SVD approach will work fine). To solve this you should scale the data first. Moreover, regularized models may converge to a suboptimal solution if the features are not scaled.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Can Gradient Descent get stuck in a local minimum when training a Logistic Regression Model?

A

Gradient Descent cannot get stuck in a local minimum when training a Logistic Regression model because the cost function is convex.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Suppose you use Batch Gradient Descent and you plot the validation error at every epoch. If you notice that the validation error consistently goes up, what is likely going on?

A

If the validation error consistently goes up after every epoch, then one possibility is that the learning rate is too high and the algorithm is diverging. If the training error also goes up, then this is clearly the problem and you should reduce the learning rate. However, if the training error is not going up, then your model is overfitting the training set and you should stop training.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Do all Gradient Descent algorithms lead to the same model, provided you let them run long enough?

A

If the optimization problem is convex (such as Linear Regression or Logistic Regression), and assuming the learning rate is not too high, then all Gradient Descent algorithms will approach the global optimum and end up producing fairly similar models. However, unless you gradually reduce the learning rate, Stochastic GD and Mini-batch GD will never truly converge; instead they will keep jumping back and forth around the global optimum. This means that even if you let them run for a very long time, these Gradient Descent algorithms will produce slightly different models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Is it a good idea to stop Mini-batch Gradient Descent immediately when the validation error goes up?

A

Due to their random nature, neither Stochastic Gradient Descent nor Mini-batch Gradient Descent is guaranteed to make progress at every single training iteration. So if you immediately stop training when the validation error goes up, you may stop much too early, before the optimum is reached. A better option is to
save the model at regular intervals; then, when it has not improved for a long time (meaning it will probably never beat the record), you can revert to the best
saved model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Which Gradient Descent algorithm (among those we discussed) will reach the vicinity of the optimal solution the fastest? Which will actually converge? How can you make the others converge as well?

A

Stochastic Gradient Descent has the fastest training iteration since it considers only one training instance at a time, so it is generally the first to reach the vicinity
of the global optimum (or Mini-batch GD with a very small mini-batch size). However, only Batch Gradient Descent will actually converge, given enough
training time. As mentioned, Stochastic GD and Mini-batch GD will bounce around the optimum, unless you gradually reduce the learning rate.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Suppose you are using Polynomial Regression. You plot the learning curves and
you notice that there is a large gap between the training error and the validation
error. What is happening? What are three ways to solve this?

A

If the validation error is much higher than the training error, this is likely because your model is overfitting the training set. One way to try to fix this is to reduce
the polynomial degree: a model with fewer degrees of freedom is less likely to overfit. Another thing you can try is to regularize the model—for example, by adding an ℓ2 penalty (Ridge) or an ℓ1 penalty (Lasso) to the cost function. This will also reduce the degrees of freedom of the model. Lastly, you can try to increase the size of the training set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Suppose you are using Ridge Regression and you notice that the training error
and the validation error are almost equal and fairly high. Would you say that the
model suffers from high bias or high variance? Should you increase the regularization hyperparameter α or reduce it?

A

If both the training error and the validation error are almost equal and fairly high, the model is likely underfitting the training set, which means it has a high
bias. You should try reducing the regularization hyperparameter α.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Why would you want to use:

a. Ridge Regression instead of plain Linear Regression (i.e., without any regula‐
rization) ?

A

A model with some regularization typically performs better than a model without any regularization, so you should generally prefer Ridge Regression over plain Linear Regression.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Why would you want to use:

b. Lasso instead of Ridge Regression?

A

Lasso Regression uses an ℓ1 penalty, which tends to push the weights down to exactly zero. This leads to sparse models, where all weights are zero except for
the most important weights. This is a way to perform feature selection automatically, which is good if you suspect that only a few features actually matter. When you are not sure, you should prefer Ridge Regression.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Why would you want to use: c. Elastic Net instead of Lasso?

A

Elastic Net is generally preferred over Lasso since Lasso may behave erratically in some cases (when several features are strongly correlated or when there are more features than training instances). However, it does add an extra hyper‐parameter to tune. If you want Lasso without the erratic behavior, you can just
use Elastic Net with an l1_ratio close to 1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Suppose you want to classify pictures as outdoor/indoor and daytime/nighttime.
Should you implement two Logistic Regression classifiers or one Softmax Regression classifier?

A

If you want to classify pictures as outdoor/indoor and daytime/nighttime, since these are not exclusive classes (i.e., all four combinations are possible) you should train two Logistic Regression classifiers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What is the fundamental idea behind Support Vector Machines?

A

The fundamental idea behind Support Vector Machines is to fit the widest possible “street” between the classes. In other words, the goal is to have the largest possible margin between the decision boundary that separates the two classes and the training instances. When performing soft margin classification, the SVM searches for a compromise between perfectly separating the two classes and having the widest possible street (i.e., a few instances may end up on the street). Another key idea is to use kernels when training on nonlinear datasets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What is a support vector?

A

. After training an SVM, a support vector is any instance located on the “street” (see the previous answer), including its border. The decision boundary is entirely
determined by the support vectors. Any instance that is not a support vector (i.e., is off the street) has no influence whatsoever; you could remove them, add more instances, or move them around, and as long as they stay off the street they won’t affect the decision boundary. Computing the predictions only involves the support vectors, not the whole training set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Why is it important to scale the inputs when using SVMs?

A

SVMs try to fit the largest possible “street” between the classes, so if the training set is not scaled, the SVM will tend to neglect small features

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Can an SVM classifier output a confidence score when it classifies an instance?
What about a probability?

A
An SVM classifier can output the distance between the test instance and the decision boundary, and you can use this as a confidence score. However, this score
cannot be directly converted into an estimation of the class probability. If you set probability=True when creating an SVM in Scikit-Learn, then after training it
will calibrate the probabilities using Logistic Regression on the SVM’s scores (trained by an additional five-fold cross-validation on the training data). This will add the predict_proba() and predict_log_proba() methods to the SVM.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Should you use the primal or the dual form of the SVM problem to train a model on a training set with millions of instances and hundreds of features?

A

This question applies only to linear SVMs since kernelized SVMs can only use the dual form. The computational complexity of the primal form of the SVM problem is proportional to the number of training instances m, while the computational complexity of the dual form is proportional to a number between m2 and m3 So if there are millions of instances, you should definitely use the primal form, because the dual form will be much too slow

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Say you’ve trained an SVM classifier with an RBF kernel, but it seems to underfit the training set. Should you increase or decrease γ (gamma)? What about C?

A

If an SVM classifier trained with an RBF kernel underfits the training set, there might be too much regularization. To decrease it, you need to increase gamma or C (or both).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What is the approximate depth of a Decision Tree trained (without restrictions) on a training set with one million instances?

A

The depth of a well-balanced binary tree containing m leaves is equal to log_2(m), rounded up. A binary Decision Tree (one that makes only binary decisions, as is the case of all trees in Scikit-Learn) will end up more or less well balanced at the end of training, with one leaf per training instance if it is trained without restrictions. Thus, if the training set contains one million instances, the Decision Tree will have a depth of log_2
(10^6) ≈ 20 (actually a bit more since the tree will generally not be perfectly well balanced).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Is a node’s Gini impurity generally lower or greater than its parent’s? Is it generally lower/greater, or always lower/greater?

A
A node’s Gini impurity is generally lower than its parent’s. This is due to the CART training algorithm’s cost function, which splits each node in a way that
minimizes the weighted sum of its children’s Gini impurities. However, it is possible for a node to have a higher Gini impurity than its parent, as long as this
increase is more than compensated for by a decrease in the other child’s impurity. For example, consider a node containing four instances of class A and one of
class B. Its Gini impurity is 1 – (1/5)2– (4/5)2
 = 0.32. Now suppose the dataset is one-dimensional and the instances are lined up in the following order: A, B, A, A, A. You can verify that the algorithm will split this node after the second instance, producing one child node with instances A, B, and the other child node with instances A, A, A. The first child node’s Gini impurity is 1 – (1/2)2– (1/2)2=0.5, which is higher than its parent’s. This is compensated for by the fact that the
other node is pure, so its overall weighted Gini impurity is 2/5 × 0.5 + 3/5 × 0 =0.2, which is lower than the parent’s Gini impurity.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

If a Decision Tree is overfitting the training set, is it a good idea to try decreasing max_depth?

A

If a Decision Tree is overfitting the training set, it may be a good idea to decrease max_depth, since this will constrain the model, regularizing it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

If a Decision Tree is underfitting the training set, is it a good idea to try scaling the input features?

A

Decision Trees don’t care whether or not the training data is scaled or centered; that’s one of the nice things about them. So if a Decision Tree underfits the training set, scaling the input features will just be a waste of time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

If it takes one hour to train a Decision Tree on a training set containing 1 million instances, roughly how much time will it take to train another Decision Tree on a training set containing 10 million instances?

A

The computational complexity of training a Decision Tree is O(n × m log(m)). So if you multiply the training set size by 10, the training time will be multiplied by
K = (n × 10m × log(10m)) / (n × m × log(m)) = 10 × log(10m) / log(m). If m =10^6, then K ≈ 11.7, so you can expect the training time to be roughly 11.7 hours.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

If your training set contains 100,000 instances for Decision Tree Classifier, will setting presort=True speed up training?

A

Presorting the training set speeds up training only if the dataset is smaller than a few thousand instances. If it contains 100,000 instances, setting presort=True
will considerably slow down training.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

If you have trained five different models on the exact same training data, and they all achieve 95% precision, is there any chance that you can combine these models to get better results? If so, how? If not, why?

A

If you have trained five different models and they all achieve 95% precision, you can try combining them into a voting ensemble, which will often give you even
better results. It works better if the models are very different (e.g., an SVM classifier, a Decision Tree
Classifier, a Logistic Regression classifier, and so on). It is even better if they are trained on different training instances (that’s the whole point of bagging and pasting ensembles), but if not this will still be effective as long as the models are very different.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

What is the difference between hard and soft voting classifiers?

A
A hard voting classifier just counts the votes of each classifier in the ensemble and picks the class that gets the most votes. A soft voting classifier computes the
average estimated class probability for each class and picks the class with the highest probability. This gives high-confidence votes more weight and often per‐
forms better, but it works only if every classifier is able to estimate class probabilities (e.g., for the SVM classifiers in Scikit-Learn you must set probability=True).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

Is it possible to speed up training of a bagging ensemble by distributing it across multiple servers? What about pasting ensembles, boosting ensembles, Random Forests, or stacking ensembles?

A

It is quite possible to speed up training of a bagging ensemble by distributing it across multiple servers, since each predictor in the ensemble is independent of the others. The same goes for pasting ensembles and Random Forests, for the same reason. However, each predictor in a boosting ensemble is built based on the previous predictor, so training is necessarily sequential, and you will not gain anything by distributing training across multiple servers. Regarding stacking ensembles, all the predictors in a given layer are independent of each other, so they can be trained in parallel on multiple servers. However, the predictors in one layer can only be trained after the predictors in the previous layer have all been trained.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

What is the benefit of out-of-bag evaluation?

A

With out-of-bag evaluation, each predictor in a bagging ensemble is evaluated using instances that it was not trained on (they were held out). This makes it possible to have a fairly unbiased evaluation of the ensemble without the need for an additional validation set. Thus, you have more instances available for training, and your ensemble can perform slightly better.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

What makes Extra-Trees more random than regular Random Forests? How can this extra randomness help? Are Extra-Trees slower or faster than regular Random Forests?

A

When you are growing a tree in a Random Forest, only a random subset of the features is considered for splitting at each node. This is true as well for ExtraTrees, but they go one step further: rather than searching for the best possible thresholds, like regular Decision Trees do, they use random thresholds for each feature. This extra randomness acts like a form of regularization: if a Random Forest overfits the training data, Extra-Trees might perform better. Moreover,
since Extra-Trees don’t search for the best possible thresholds, they are much faster to train than Random Forests. However, they are neither faster nor slower
than Random Forests when making predictions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

If your AdaBoost ensemble underfits the training data, which hyperparameters should you tweak and how?

A

If your AdaBoost ensemble underfits the training data, you can try increasing the number of estimators or reducing the regularization hyperparameters of the base estimator. You may also try slightly increasing the learning rate.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

If your Gradient Boosting ensemble overfits the training set, should you increase or decrease the learning rate?

A

If your Gradient Boosting ensemble overfits the training set, you should try decreasing the learning rate. You could also use early stopping to find the right
number of predictors (you probably have too many).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

What are the main motivations for reducing a dataset’s dimensionality?

A

The main motivations for dimensionality reduction are:
• To speed up a subsequent training algorithm (in some cases it may even remove noise and redundant features, making the training algorithm perform
better)
• To visualize the data and gain insights on the most important features
• To save space (compression)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

What are the main drawbacks for reducing a dataset’s dimensionality?

A

The main drawbacks are:
• Some information is lost, possibly degrading the performance of subsequent training algorithms.
• It can be computationally intensive.
• It adds some complexity to your Machine Learning pipelines.
• Transformed features are often hard to interpret.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

What is the curse of dimensionality?

A

The curse of dimensionality refers to the fact that many problems that do not exist in low-dimensional space arise in high-dimensional space. In Machine
Learning, one common manifestation is the fact that randomly sampled highdimensional vectors are generally very sparse, increasing the risk of overfitting and making it very difficult to identify patterns in the data without having plenty of training data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

Once a dataset’s dimensionality has been reduced, is it possible to reverse the operation? If so, how? If not, why?

A

Once a dataset’s dimensionality has been reduced using one of the algorithms we discussed, it is almost always impossible to perfectly reverse the operation,
because some information gets lost during dimensionality reduction. Moreover, while some algorithms (such as PCA) have a simple reverse transformation procedure that can reconstruct a dataset relatively similar to the original, other
algorithms (such as T-SNE) do not

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

Can PCA be used to reduce the dimensionality of a highly nonlinear dataset?

A

PCA can be used to significantly reduce the dimensionality of most datasets, even if they are highly nonlinear, because it can at least get rid of useless dimensions. However, if there are no useless dimensions—as in a Swiss roll dataset—then
reducing dimensionality with PCA will lose too much information. You want to unroll the Swiss roll, not squash it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

Suppose you perform PCA on a 1,000-dimensional dataset, setting the explained variance ratio to 95%. How many dimensions will the resulting dataset have?

A

That’s a trick question: it depends on the dataset. Let’s look at two extreme examples. First, suppose the dataset is composed of points that are almost perfectly
aligned. In this case, PCA can reduce the dataset down to just one dimension while still preserving 95% of the variance. Now imagine that the dataset is composed of perfectly random points, scattered all around the 1,000 dimensions. In this case roughly 950 dimensions are required to preserve 95% of the variance. So
the answer is, it depends on the dataset, and it could be any number between 1 and 950. Plotting the explained variance as a function of the number of dimensions is one way to get a rough idea of the dataset’s intrinsic dimensionality.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

In what cases would you use vanilla PCA, Incremental PCA, Randomized PCA, or Kernel PCA?

A

Regular PCA is the default, but it works only if the dataset fits in memory. Incremental PCA is useful for large datasets that don’t fit in memory, but it is slower
than regular PCA, so if the dataset fits in memory you should prefer regular PCA. Incremental PCA is also useful for online tasks, when you need to apply
PCA on the fly, every time a new instance arrives. Randomized PCA is useful when you want to considerably reduce dimensionality and the dataset fits in memory; in this case, it is much faster than regular PCA. Finally, Kernel PCA is useful for nonlinear datasets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

How can you evaluate the performance of a dimensionality reduction algorithm on your dataset?

A

Intuitively, a dimensionality reduction algorithm performs well if it eliminates a lot of dimensions from the dataset without losing too much information. One
way to measure this is to apply the reverse transformation and measure the reconstruction error. However, not all dimensionality reduction algorithms provide a reverse transformation. Alternatively, if you are using dimensionality reduction as a preprocessing step before another Machine Learning algorithm
(e.g., a Random Forest classifier), then you can simply measure the performance of that second algorithm; if dimensionality reduction did not lose too much
information, then the algorithm should perform just as well as when using the original dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

Does it make any sense to chain two different dimensionality reduction algorithms?

A

It can absolutely make sense to chain two different dimensionality reduction algorithms. A common example is using PCA to quickly get rid of a large number of useless dimensions, then applying another much slower dimensionality reduction algorithm, such as LLE. This two-step approach will likely yield the same performance as using LLE only, but in a fraction of the time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

How would you define clustering? Can you name a few clustering algorithms?

A

In Machine Learning, clustering is the unsupervised task of grouping similar instances together. The notion of similarity depends on the task at hand: for example, in some cases two nearby instances will be considered similar, while in others similar instances may be far apart as long as they belong to the same densely packed group. Popular clustering algorithms include K-Means, DBSCAN, agglomerative clustering, BIRCH, Mean-Shift, affinity propagation, and spectral clustering.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

What are some of the main applications of clustering algorithms?

A

The main applications of clustering algorithms include data analysis, customer segmentation, recommender systems, search engines, image segmentation, semisupervised learning, dimensionality reduction, anomaly detection, and novelty detection.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

Describe two techniques to select the right number of clusters when using K-Means.

A

The elbow rule is a simple technique to select the number of clusters when using K-Means: just plot the inertia (the mean squared distance from each instance to its nearest centroid) as a function of the number of clusters, and find the point in the curve where the inertia stops dropping fast (the “elbow”). This is generally close to the optimal number of clusters. Another approach is to plot the silhouette score as a function of the number of clusters. There will often be a peak, and the optimal number of clusters is generally nearby. The silhouette score is the mean silhouette coefficient over all instances. This coefficient varies from +1 for instances that are well inside their cluster and far from other clusters, to –1 for instances that are very close to another cluster. You may also plot the silhouette diagrams and perform a more thorough analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
59
Q

What is label propagation? Why would you implement it, and how?

A

Labeling a dataset is costly and time-consuming. Therefore, it is common to have plenty of unlabeled instances, but few labeled instances. Label propagation is a technique that consists in copying some (or all) of the labels from the labeled
instances to similar unlabeled instances. This can greatly extend the number of labeled instances, and thereby allow a supervised algorithm to reach better performance (this is a form of semi-supervised learning). One approach is to use a clustering algorithm such as K-Means on all the instances, then for each cluster find the most common label or the label of the most representative instance (i.e., the one closest to the centroid) and propagate it to the unlabeled instances in the same cluster.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
60
Q

Can you name two clustering algorithms that can scale to large datasets? And two that look for regions of high density?

A

K-Means and BIRCH scale well to large datasets. DBSCAN and Mean-Shift look for regions of high density.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
61
Q

Can you think of a use case where active learning would be useful? How would you implement it?

A

Active learning is useful whenever you have plenty of unlabeled instances but labeling is costly. In this case (which is very common), rather than randomly selecting instances to label, it is often preferable to perform active learning, where human experts interact with the learning algorithm, providing labels for Exercise Solutions | 729 specific instances when the algorithm requests them. A common approach is uncertainty sampling (see the description in “Active Learning” on page 255).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
62
Q

What is the difference between anomaly detection and novelty detection?

A
Many people use the terms anomaly detection and novelty detection interchangeably, but they are not exactly the same. In anomaly detection, the algorithm is
trained on a dataset that may contain outliers, and the goal is typically to identify these outliers (within the training set), as well as outliers among new instances.
In novelty detection, the algorithm is trained on a dataset that is presumed to be “clean,” and the objective is to detect novelties strictly among new instances. Some algorithms work best for anomaly detection (e.g., Isolation Forest), while others are better suited for novelty detection (e.g., one-class SVM).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
63
Q

What is a Gaussian mixture? What tasks can you use it for?

A

A Gaussian mixture model (GMM) is a probabilistic model that assumes that the instances were generated from a mixture of several Gaussian distributions whose
parameters are unknown. In other words, the assumption is that the data is grouped into a finite number of clusters, each with an ellipsoidal shape (but the clusters may have different ellipsoidal shapes, sizes, orientations, and densities), and we don’t know which cluster each instance belongs to. This model is useful for density estimation, clustering, and anomaly detection.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
64
Q

Can you name two techniques to find the right number of clusters when using a Gaussian mixture model?

A
One way to find the right number of clusters when using a Gaussian mixture model is to plot the Bayesian information criterion (BIC) or the Akaike informa‐
tion criterion (AIC) as a function of the number of clusters, then choose the number of clusters that minimizes the BIC or AIC. Another technique is to use a Bayesian Gaussian mixture model, which automatically selects the number of clusters.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
65
Q

What is θ=(XT*X)^-1 * XT *y

and when can we use it.

A

The normalization equation is an alternative to gradient descent when our number of features isn’t too big.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
66
Q

How to turn a feature into ‘standard norm’ form. Mean =0, std=1

A

Subtract each instance by the feature mean and divide it by the feature std

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
67
Q

What do you do if your cost function increases after each iteration?

A

Make the learning rate (alpha), smaller.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
68
Q

When is Gradient Descent a better option than the Normal Equation?

A

When there are too many features (eg 10000).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
69
Q

Why use feature scaling?

A

It will make Gradient Descent quicker, more direct.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
70
Q

What are the dimensions of theta(j) in a neural network?

A

S(j+1) by S j plus one

eg. hidden layer by input layer + 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
71
Q

What is a common reason for an ML model that works well in training but fails in production?

A

The ML dataset was improperly created

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
72
Q

Personalized Algorithms are often built using which type of ML model?

A

Recommendation systems (but you must understand and know the tools and tricks of image processing and sequence systems to understand recommendation systems).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
73
Q

Question 3

What is a key lesson Google has learned with regards to reducing the chance of failure in production ML models?

A

Process batch and streaming data the same way

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
74
Q

Which of the following scenarios may require a supervised learning model to be retrained as a new model?

A

The model was trained on labeled data and we now wish to correct the labels of the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
75
Q

Someone read emails for a company and then forwards the emails to the appropriate department. How can we automate this process?

A

Use several models to read, sort, and send to departments. If there are any pre-existing models then use them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
76
Q

A team is preparing to develop and deploy an ML model for use on a shopping website. They have collected a little data to train the model. The team plans on gathering more data once the model is developed. Now they are ready for the next phase, training.

Which of these scenarios will most likely lead to a successful deployment of the ML model?

A

The team should take time to gather more data, because with more data, it is possible to create a simpler ML model that performs better.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
77
Q

What are the five phases of the “Path to ML”?

A

Individual contributor, delegation, digitization, big data and analytics, machine learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
78
Q

You are going to develop an ML model. You are in Canada and the rest of the team is in Mexico.

Your team wants to use Google Cloud Platform with Python Notebook. Which of the following statements support your decision.

A

Datalab notebooks are hosted in the cloud

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
79
Q

Question 2
Your team has decided to use the Compute Engine, Cloud Storage, and Datalab for ML model development

Which two statements are applicable to your situation

A

Every member of the team, regardless of their location, can directly read data from Cloud Storage.

Latency of data access can be a concern, so carefully select the zone for data storage.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
80
Q

The third wave of cloud is _________________ so you can focus on data ___________ instead of infrastructure.

A

serverless, insights

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
81
Q

Three quality attributes of data?

A

Consistency, accuracy, auditability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
82
Q

Two categories of data quality tools?

A

Cleaning tools, monitoring tools

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
83
Q

Three features of low data quality?

A

unreliable info, incomplete data, duplicated data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
84
Q

What is the Orderliness of data?

A

The data entered has the required format and structure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
85
Q

Three best practices for data quality management?

A

resolving missing values, preventing duplicates, automating data entry

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
86
Q

Which is the correct sequence of steps in data science after the data is gathered? 4 steps

A

Data Exploration -> Data Cleaning -> Model Building -> Present Results

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
87
Q

Three objectives of exploratory data analysis?

A

Check for missing data and other mistakes, Gain maximum insight into the data set and its underlying structure, uncover a parsimonious model (the most useful features)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
88
Q

Two main methods for Exploratory Data Analysis?

A

Univariate and Bivariate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
89
Q

What machine learning models have labels, or in other words, the correct answers to whatever it is that we want to learn to predict?

A

Supervised model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
90
Q

Two most common types of Supervised machine learning models?

A

Regression model, and classification model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
91
Q

Which model would you use if your problem required a discrete number of values or classes?

A

Classification model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
92
Q

Question 5

When the data isn’t labelled, what is an alternative way of predicting the output?

A

Clustering Algorithms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
93
Q

Question 5

What is the most essential metric a regression model uses?

A

Mean squared error as their loss funciton

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
94
Q

Question 1
Fill in the blanks. In the video, we presented a linear equation. This hypothesis equation is applied to every _________ of our dataset, where the weight values are fixed, and the feature values are from each associated column, and our machine learning data set.

A

row

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
95
Q

Question 3
Fill in the blanks. Fundamentally, classification is about predicting a _______ and regression is about predicting a __________.

A

Label, Quantity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
96
Q

What component of a biological neuron is analogous to the input portion of a perceptron?

A

Dendrites

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
97
Q

Which of the following is an algorithm for supervised learning of binary classifiers - given that a binary classifier is a function which can decide whether or not an input, represented by a vector of numbers, belongs to some specific class.
Binary classifier, Perceptron, Linear Regression

A

Perceptron

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
98
Q

Which model is the linear classifier, also unsed in Supervised learning?
Neuron, Dendrites, Perceptron

A

Perceptron

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
99
Q

A perceptron is a type of _____ that makes its predictions based on a linear predictor function combining a set of weights with the ________.

A

linear classifier, feature vector

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
100
Q

Three steps in the Perceptron Learning Process

A
  1. Takes the inputs, multiplies them by their weights, and computes their sum.
  2. Adds a bias factor, the number 1 multiplied by a weight.
  3. Feeds the sum through the activation function.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
101
Q

Six elements of a perceptron?

A
  1. Input function X
  2. Bias b (constant)
  3. Weights
  4. Weighted sum
  5. Activation function
  6. Output
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
102
Q

Neural Networks: I I wanted my outputs to be in the form of probabilities which activation function should I use in the final layer?

A

Sigmoid

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
103
Q

A single unit for a non-input neuron has these three things:

A
  1. Weighted Sum
  2. Activation function
  3. Output of the activation function
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
104
Q

What activation functions are needed to get the complex chain functions that allow neural networks to learn data distributions?

A

Nonlinear activation functions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
105
Q

The range of a ReLU output?

A

between zero and infinity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
106
Q

The range of Tanh output?

A

between -1 and 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
107
Q

The range of a Sigmoid output?

A

between zero and 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
108
Q

The range of a ELU output?

A

between -1 and infinity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
109
Q

In a decision classification tree, what does each decision or node consist of?

A

Linear classifier of one feature

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
110
Q

Mean squared error minimizer and euclidean distance minimizer are used in ______, not ______.

A

regression, classification

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
111
Q

What thing in neural network can map to a higher dimensional vector space?

A

More neurons per layer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
112
Q

SVM: The _____ is the distance between two separate vectors.

A

margin

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
113
Q

SVM: The more generalizable the decision boundary, the ____ the margin.

A

wider

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
114
Q

SVM are used for text classification tasks such as __________,
__________, and _________.

A

category assignment,

detecting spam, sentiment analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
115
Q

SVMs are based on the idea of finding a ________ that best divides a dataset into _____ classes. ___________ are the data points nearest to the hyperplane,, the points of a data set that, if removed, would alter the position of the dividing ______. As a simple example, for a classification task with only two features, you can think of a _______ as a ______ that ______ separates and classifies a set of data.

A

hyperplane, two, support vectors, hyperplane, hyperplane, line, linearly

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
116
Q

A _______ maps the data from our ______ vector space to a vector space that has features that can be ______ separated.

A

kernel transformation, input, linearly

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
117
Q

In ML, kernel methods are a class of algorithms for ________, whose best know member is the ________.

A

pattern analysis, support vector machine

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
118
Q

Dropout in neural networks works by randomly setting the _______ of hidden units to ____ at each update of training phase.

A

outgoing edges, 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
119
Q

How does dropout help neural networks generalize?

A

In setting the output to 0, the cost function becomes more sensitive to neighboring neurons changing the way the weights will be updated during the process of backpropagation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
120
Q

Three types of modern neural networks.

A

Convolutional, modular, recurrent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
121
Q

Three was to improve generalization in a NN?

A

Adding dropout layers, performing data augmentation, adding noise

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
122
Q

At its core, a ________ is a method of evaluating how well your algorithm models your dataset. If your predictions are totally off, your _________ will output a higher number. If they’re pretty good, it will output a lower number. As you change pieces of your algorithm to try and improve your model, your ______ will tell you if you’re getting anywhere.

A

loss function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
123
Q

Simply speaking, __________ is the workhorse of basic loss functions. ______ is the sum of squared distances between our target variable and predicted values.

A

mean squared error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
124
Q

Loss functions can be broadly categorized into 2 types: Classification and Regression Loss. _____ is typically used for regression and ______ is typically used for classification.

A

mean squared error, cross entropy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
125
Q

Gradient Descent is an optimization algorithm used to _______ some function by iteratively moving in the direction of the steepest descent as by the _________. In machine learning, we use gradient descent to update the _______ of our model.

A

minimize, negative of the gradient, parameters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
126
Q

________, also called vanilla gradient descent, calculates the error for _______ within the training dataset, but only ________ all training examples have evaluated does the model get updated. This whole process is like a cycle and it’s called a training epoch.

A

Batch gradient descent, each example, after

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
127
Q

In the ________________________ method, one training sample (example) is passed through the neural network at a time and the parameters (weights) of each layer are updated with the computed gradient.

A

Stochastic Gradient Descent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
128
Q

________________: Parameters are updated after computing the gradient of error with respect to the entire training set
________________: Parameters are updated after computing the gradient of error with respect to a single training example
________________: Parameters are updated after computing the gradient of error with respect to a subset of the training set

A

Batch Gradient Descent, Stochastic Gradient Descent, Mini-Batch Gradient Descent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
129
Q

What is a type one error?

A

When the model predicts positive but it’s actually a negative (predicts face when it’s a statue).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
130
Q

Formula for precision

A

True positives / (True positives + False Positives)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
131
Q

An increase in what factor will drive down the precision ratio?

A

False Positives

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
132
Q

What is type two error?

A

When the predicts negative and it’s actually a positive (predicts not face when it’s a face in winter clothes).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
133
Q

Formula for recall

A

True positives /(true positives + false negatives)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
134
Q

Why is RMSE preferred?

A

The loss metric output is measured in the same units as the error making it easier to directly interpret.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
135
Q

There will always be a ____ between the metrics we care about and the metrics that work well with gradient descent.

A

gap

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
136
Q

What is the significance of performance metrics?

Plus two benefits

A

Performance metrics will allow us to reject models that have settled into inappropriate minima.

  1. easier to understand
  2. directly connected to business goals
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
137
Q

Two ways to think about recall?

A
  1. inversely related to precision

2. Recall is like a person who never wants to be left out of a positive decision

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
138
Q

Two parameters that affect gradient descent?

A
  1. learning rate

2. batch size

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
139
Q

What is the best way to assess the quality of a model?

A

To observe how well a model performs against a new dataset that it hasn’t seen before

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
140
Q

How do you decide when to stop training a model?

A

When your loss metrics start to increase against the validation set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
141
Q

What actions can you perform on your model when it is trained and validated?

A

You can run it once, and only once, against the independent test dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
142
Q

What two loss functions are the most common for Regressions?

A

RMSE for linear regression, cross-entropy for classification

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
143
Q

Which is the most preferred way to traverse loss surfaces efficiently?

A

By analyzing the slopes of our loss functions, which provide us directions and step magnitude.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
144
Q

What core algorithm is used to construct Decision Trees?

A

Greddy algorithms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
145
Q

The RAND function in BigQuery generates a value between ____ and ____.

A

zero, one

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
146
Q

How can you create repeatable samples of your data in BigQuery?

A

Use the last few digits of a hash function on the field that you’re using to split or bucketize your data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
147
Q

What allows you to split the dataset based upon a filed in your data?

A

FARM_FINGERPRINT, an open-source hashing algorithm that is implemented in BigQuery SQL.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
148
Q

TensorFlow is a _____ and _____ platform programming interface for implementing and running machine learning algorithms, including convenience wrappers for deep learning.

A

scalable, multi

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
149
Q

In TensorFlow, ____ are multi-dimensional arrays with a uniform type. All tensors are ____ like Python numbers and strings: you can never update the contents of a tensor, only create a new one.

A

tensors, immutable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
150
Q

How does TensorFlow represent numeric computations?

A

Using a Directed Acyclic Graph (or DAG)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
151
Q

How can we improve the calculation speed in TensorFlow, without losing accuracy?

A

Using GPU

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
152
Q

tf.losses, tf.metrics, and tf.optimizers are useful components when?

A

building custom Neural Network models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
153
Q

Which processing units can you run TensorFlow?

A

CPU, GPU, TPU

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
154
Q

tf.estimator, tf.keras, tf.data are high level APIs used for?

A

distributed training

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
155
Q

You need to build a custom NN model. What are two options?

A

We can use an estimator from TF, or we can use a high-level API such as Keras

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
156
Q

Which of the following API’s are not used in the TensorFlow abstraction layers?
C++ API, Python API, tf.keras, tf.image

A

tf.image

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
157
Q

Which API is used to build performant, complex input pipelines from simple, re-usable pieces that will feed your model’s training or evaluation loops.

A

tf.data.Dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
158
Q

Two operations that can be performed on tensors?

A

reshaped, sliced

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
159
Q

What rank is Shape:[3,4]?

A

Rank 2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
160
Q

TensorFlow records all operations executed inside the context of a tf._______ onto a _____.

A

GradientTape, Tape

161
Q

When we compute a loss gradient TensorFlow uses ___ and the ___ associated with each recorded operation to compute the _____.

A

tape, gradients, gradients

162
Q

In a TensorFlow loss gradient opperation the computed gradient of a recorded computation will be used in ______ mode differentiation.

A

reverse

163
Q

How to produce tensors that can be modified and that can be used for weights?

A

tf.Variable

164
Q

A tf.Variable represents a tensor whose value can be _____ by running ___ on it. Specific ops allow you to read and modify the values of this tensor. Higher level libraries like ______ use tf.Variable to store model parameters.

A

changed, ops, tf.keras

165
Q

Feature columns describe how the model should use ____ _____ data from your features ______.

A

raw input, dictionary

166
Q

A bucketized column helps with discretizing _____ _____ _____.

A

continuous feature values

167
Q

Two distinct ways to create a dataset in TensorFlow?

A
  1. A data source constructs a dataset from data stored in memory or in one or more files.
  2. A data transformation constructs a dataset from one or more tf.data.Dataset objects
168
Q

_____ is used to instantiate a Dataset object which is comprised of lines from one or more text files.

A

TextLineDataset

169
Q

The _____ format is a simple format for storing a sequence of binary records. Using ____ can be useful for standardizing input data and optimizing performance.

A

TFRecordDataset

170
Q

____ has fixed-length records from one or more binary files.

A

FixedLengthRecordDataset

171
Q

Which method is invoked on the dataset - which triggers creation and execution of two operations?

A

iter

172
Q

Three purposes of Neural Network embedding?

A
  1. Finding nearest neighbors in the embedding space. These can be used to make recommendations based on user interests or cluster categories.
  2. As input to a machine learning model for a supervised task.
  3. For visualization of concepts and relations between categories.
173
Q

Three types of feature columns?

A

Categorical, Bucketized, Crossed

174
Q

In the training phases of ML, which component is not part of the training phase?
Labeled Data, ML Algorithm, Served Model, Trained Model

A

Served Model

175
Q

Three way to feed TensorFlow models with data

A

TextLineDataset, TFrecordDataset, FexedLengthRecordDataset

176
Q

What is the role of the tf.data API in TensorFlow?

A

It enables you to build complex input pipelines from simple, reusable pieces

177
Q

Three components of the ML pipeline before running the model?

A

Data Extraction, Data Exploration, Data Analysis

178
Q

Non-linearity helps in training your model at a _____ _____ ____ and with _____ _____ without the loss of your important information.

A

much faster rate, more accuracy

179
Q

The activation function which is linear in the positive domain and the function is 0 in the negative domain.

A

ReLU

180
Q

During the training process, each additional layer in your network can successively reduce signal vs. noise. How can we fix this?

A

Use non-saturating, nonlinear activation functions such as ReLUs

181
Q

How can we solve the problem called internal covariate shift?

A

Batch normalization

182
Q

How can we stop ReLU layers from dying?

A

lower your learning rates

183
Q

Which model is appropriate for a plain stack of layers?

A

Sequential

184
Q

How does Adam (optimization algorithm) help in compiling the Keras model?

A

By updating network weights iteratively based on training data and by diagonal rescaling of the gradients

185
Q

The predict function in the tf.keras API returns what?

A

Numpy array(s) of predictions

186
Q

Three parameters involved while compiling the Keras model?

A

optimizer, loss function, evaluation metrics

187
Q

Question 5

What is the significance of the Fit method while training a Keras model ?

A

Defines the number of epochs.

188
Q

Two weaknesses of the Keras Functional API

A
  1. It doesn’t support dynamic architectures. The Functional API treats models as DAGs of layers. This is true for most deep learning architectures, but no all: for instance, recursive newtworks or Tree RNNs do not follow this assumption and cannot be implemented in the Functional API.
  2. Sometimes we have to write from scratch and need to build subclasses. When writing advanced architectures, you may want to do things that are outside the scope of “defining a DAG of layers”: for instance, you may want to expose multiple custom training and inference methods on your model instance. This requires subclassing.
189
Q

The Keras Functional API can be characterized by having:

A

Multiple inputs and outputs and models with shared layers

190
Q

The core data structure of Keras is a model, which let us organize and design layers. The _____ model is the simplest type of model (a linear stock of layers). If we need to build arbitrary graphs of layers, the Keras _____ ____ can do that for us.

A

Sequential, Functional API

191
Q

The input layer of the Keras Functional API needs to have shape ___, where p is the number of ____ in your training matrix. For example: _____

A

(p,) , columns

inputs=Input(shape=(3,))

192
Q

The activations in regularization is scaled by which equation?

A

1/(1 - dropout probability)

193
Q

Question 2

How does regularization help build generalizable models ?

A

By adding dropout layers to our neural networks.

194
Q

what does L2 regularization do?

A

It adds a sum of the squared parameter weights term to the loss function

195
Q

___ regularization will keep the weight values smaller and ___ regularization will make the model sparser by dropping features.

A

L2, L1

196
Q

Wat is approximate equivalent of L2 regularization?

A

Early Stopping

197
Q

correct workflow to serve your model in the cloud

A

create the model -> train and evaluate your model -> save your model -> serve your model

198
Q

To serve our model for other to use, we export the ____ ____ and deploy the model as a _____.

A

model file, service

199
Q

_____ is the directory in which to write the SavedModel

A

(EXPORT_PATH)

200
Q

SavedModel is a universal _____ format for TensorFlow models. SavedModel provides a “language neutral format” to save your machine learning models that is both _____ and _____.

A

serialization, recoverable, hermetic

201
Q

The Keras Functional API allows you to define what 3 things?

A
  1. input or output models
  2. ad hoc acyclic network graphs
  3. a model that shares layers
202
Q

In the Keras Functional API, models are created by specifying their ____ and ____ in a graph of layers. That means that a single graph of layers can be used to generate multiple models.

A

input, outputs

203
Q

What is TensorFlow Data Validations?

A

It is a tool that can be used to analyze data to find potential problems in data.

204
Q

How to Input Feature Columns to a Keras Model?

A

We can use a DenseFeatures layer to input them to a Keras model.

205
Q

Three reasons why the Keras Sequential model is not appropriate?

A
  1. multiple inputs or multiple outputs
  2. Any of your layers has multiple inputs or multiple outputs.
  3. You need to do layer sharing or non-linear topology
206
Q

The _____ function can be used with linear regression, logistic regression, k-means, matrix factorization, and ARIMA-based times series models. The _____ function evaluates the _____ values against the ____ data, and can be used to evaluate model _____.

A

ML.EVALUATE, ML.EVALUATE, predicted, actual, metrics

207
Q

ML.FEATURE_CROSS generates a ____ feature with all combinations of crossed _____ features except for 1-degree items.

A

STRUCT, categorical

208
Q

ML.BUCKETIZE bucketizes a ____ numerical feature into a _____ feature with bucket names as the value.

A

continuous, string

209
Q

Feature Cross combines features into a _____ feature, and enables a model to learn separate _____ for each combination of features.

A

single, weights

210
Q

_____ is a process by which categorical variables are converted into a form that could be provided to neural networks to do a better job in prediction

A

One hot encoding

211
Q

What to use to encode categorical data that is already indexed?

A

tf.feature_column.categorical_column_with_identity

212
Q

What do you use the tf.feature_column.bucketized_column function for?

A

To discretize floating point values into a smaller numberr of categorical bins

213
Q

Before being input into an ML model, raw data must be turned into:

A

feature vectors

214
Q

Three characteristics of a good feature

A
  1. related to the objective
  2. know at prediction time
  3. numeric with meaningful magnitude
215
Q

Different problems in the same domain may need _____ _____

A

different features

216
Q

What is the relationship between Apache Beam and Cloud Dataflow?

A

Cloud Dataflow is the API for data pipeline building in java or python and Apache Beam is the implementation and execution framework

217
Q

TRUE or FALSE: The Filter method can be carried out in parallel and autoscaled by the execution framework:

A

True: Anything in Map or FlatMap can be parallelized by the Beam execution framework

218
Q

What is the purpose of a Cloud Dataflow connector?

.apply(TextIO.write().to(“gs://…”));

A

Connectors allow you to output the results of a pipeline to a specific data sink like Bigtable, Google Cloud Storage, flat file, BigQuery, and more …

219
Q

The stages of a pipeline

A
  1. data source
  2. transformation steps
  3. data sink
220
Q

To run a pipeline you need something called a ________.

A

runner

221
Q

TRUE or FALSE: A ParDo acts on all items at once (like a Map in MapReduce).

A

False. A ParDo acts on one item at a time (like a Map in MapReduce)

222
Q

Three advantages with using an UI tool like Cloud Dataprep?

A
  1. Create transformations in UI tool instead of writing Java or Python
  2. Can chain step together as part of recipe
  3. Supports outputting your data into BigQuery, Google Cloud Storage, or flat files
223
Q

TRUE or FALSE: You can automatically setup pipelines to run at defined intervals with Cloud Dataprep

A

True

224
Q

Different cities in California have markedly different housing prices. What feature crosses could learn city-specific relationships between house characteristic and housing price?

A

One feature cross: [binned latitude X binned longitude X binned roomsPerPerson]

225
Q

You are building a model to predict the number of points
(“margin”) by which Team A will beat Team B in a basketball game. Your input
features are (1) whether or not it is a home game for Team A (2) average
number of points Team A scored in its past 7 games and (3) average number
of points Team B scored in its past 7 games. What two a linear model
suitable for machine learning?

A

1) margin = b +w1is_home_game + w2avg_points_A + w3avg_points_B
2) margin = w1
is_home + w2*(avg_points_A - avg_points_B)^3

226
Q

Feature crosses are more common in modern machine learning

because:

A

Feature crosses memorize, and that is okay only if you have extremely large datasets

227
Q

The function tf.feature_column.crossed_column requires:

A

A list of categorical or bucketized features

228
Q

Three reasons you might create an embedding of a feature cross.

A

1) Create a lower-dimensional representation of the input space
2) Identify similar sets of inputs for clustering
3) Reuse weights learned in one problem in another problem

229
Q

During the training and serving phase, tf.Transform:

A

Provides a TensorFlow graph for preprocessing

230
Q

Tensorflow transform is a hybrid of?

A

Apache and TensorFlow

231
Q

The ____ ____ is the most important concept of tf.Transform. The ____ ____ is a logical description of a transformation of the dataset. The ____ ____ accepts and returns a dictionary of tensors, where a tensor means Tensor or 2D SparseTensor.

A

Preprocessing function

232
Q

Three steps in order that are considered a best practice in predictive modeling?

A

Data Cleaning > Feature engineering > Model Building

233
Q

Using indicator variables to isolate key information, Highlighting interactions between two or more features. and representing the same feature in a different way are examples of?

A

Feature engineering

234
Q

A good feature typically is _____ and _____.

A

related to the objective, is known at prediction time

235
Q

Two benefits from Regularization?

A
  1. Makes models smaller

2. Limits overfitting (the most important reason)

236
Q

What is the key reason that we want to penalize models for over-complexity?

A

Overly-complex models may not be generalizable to real-world scenarios on unseen data

237
Q

If your learning rate is too small, your loss function will:

A

Converge very slowly

238
Q

If your learning rate is too high, your loss function

A

will converge rapidly, but not reach the lowest error value possible

239
Q

If your batch size is too high, your loss function will

A

converge slowly

240
Q

If your batch size is too low, your loss function will

A

oscillate wildly

241
Q

If searching among a large number of hyperparameters, you should do a systematic grid search rather than start from random values, so that you are not relying on chance. True or False?

A

False

242
Q

Question 2

It is a good idea to use the training loss itself as the hyperparameter tuning metric. True or False?

A

False: you want to use an eval-metric as your hyperparameter tuning metric so that you are not rewarding models that overfit.

243
Q

Hyperparameter tuning in Cloud ML Engine involves adding the appropriate TensorFlow function call to your model code. True or False?

A

False: Often, it is simply a matter of submitting a training job with an additional configuration setting

244
Q

You are creating a model to predict the outcome (final score difference) of a basketball game between Team A and Team B. Your initial model is a neural network with [64, 32] nodes, learning_rate = 0.05, batch_size = 32. The input features include whether the game was played “at home” for Team A, the fraction of the last 7 games that Team A won, the average number of points scored by Team A in its last 7 games, the average score of Team A’s opponents in its last 7 games, etc.

Which of these are hyperparameters to the model?

A

The number of layers, batch size, number of nodes in each layers, the learning rate AND the number of previous games that the input features are average over (the creation of an feature is a hyperparameter)

245
Q

What does L1 regularization tend to do to a model’s low predictive freatures’ parameter weights?

A

Have zero values

246
Q

Which type of regularization is more likely to lead to zero weights?

A

L1

247
Q

Which type of regularization penalizes large weight values more?

A

L2

248
Q

Two reasons why it’s important to add regularization to logistic regression?

A
  1. Helps stop weights being driven to +/- infinity

2. Helps logits stay away from asymptotes which can halt training

249
Q

Three things you should do when performing logistic regression:

A
  1. Adding regularization
  2. Choosing a tuned threshold
  3. Checking for bias
250
Q

You are training your classification model and are using Logistic Regression. You last layer has no weights that can be _____

A

tuned

251
Q

Why is it important adding non-linear activations functions to neural networks?

A

Stops the layers from collapsing back into just a linear model

252
Q

Neural networks can be arbitrarily complex. To increase hidden dimensions, I can add____. To increase function composition, I can add ____. If I have multiple labels per example, I can add ____.

A

neurons, layers, outputs

253
Q

Four things you can try if your model is experiencing exploding gradients:

A
  1. Lower the learning rate
  2. Add weight regularization
  3. Add Gradient clipping
  4. Add batch normalization
254
Q

Dropout acts as another form of ____. It forces data to flow down ____ paths so that there is a more even spread. It also simulates ____ learning. Don’t forget to scale the dropout activations by the inverse of the _____. We remove dropout during ____.

A

Regularization, multiple, ensemble, keep probability, inference

255
Q

What are three common ways that a neural network training can fail?

A
  1. Gradients can explode if the learning rate is too high
  2. Entire layers can die with all their weights becoming zero
  3. Gradients can vanish, making it harder to train networks the deeper they are
256
Q

If you see a dead layer (fraction of zero weights close to 1), what is a reasonable thing to try?

A

Lower the learning rate

257
Q

I am training a classification neural network with 5 hidden layers, sigmoid activation function, and [128, 64, 32, 16, 8] with learning_rate=0.05 and batch_size=32. I notice from TensorBoard that gradients in the third layer are near-zero. Is this a problem?

A

yes

258
Q

I am training a classification neural network with 5 hidden layers, sigmoid activation function, and [128, 64, 32, 16, 8] with learning_rate=0.05 and batch_size=32. I notice from TensorBoard that gradients in the third layer are near-zero. What would you try to fix this?

A

Try using ReLU activation function

259
Q

For our classification output, if we have both mutually exclusive labels and probabilities, we should us ____. If the labels are mutually exclusive, but the probabilities aren’t, we should us _____. If our labels aren’t mutually exclusive, we should use ____.

A
  1. tf.nn.softmax_entropy_with_logits_v2
  2. tf.nn.sparse_softmax_cross_entropy_with_logits
  3. tf.nn.sigmoid_cross_entropy_with_logits
260
Q

If you have a classification problem with multiple labels, how does the neural network architecture change?

A

Have a logistic layer for each label, and send the outputs of the logistic layer to a softmax layer

261
Q

If you have thousands of classes, computing the cross-entropy loss can be very slow. Which of these is a way to help address that problem?

A

Use a noise-contrastive loss function

262
Q

What is the benefit of using a pre-canned Estimator?

A

It can give us a quick ML model

263
Q

What is the recommended way to create distributed Keras models?

A

Write a Keras model as normal, and use the model_to_estimator function to convert it into an Estimator for train_and_evaluate

264
Q

In the model function for a custom estimator, you can customize four things:

A
  1. the set of evaluation metrics
  2. The loss metric that is optimized
  3. The optimizer that is used
  4. The predictions that are returned (Correct. It is possible, for example, in a classification problem to decide to return an intermediate embedding, the class probability, and the logits. This is possible because predictions is a dictionary)
265
Q

Two reasons for why an RNN (Recurrent Neural Network) is used for machine translation, say translating English to French?

A
  1. It can be trained as a supervised learning problem

2. It is applicable when the input/output is a sequence (e.g., a sequence of words).

266
Q

What does a neuron compute?

A

A neuron computes a linear function (z=Wx + b) followed by an activation function

267
Q

What is the loss function for a Logistic classification? Why do we use this one?

A

cross-entropy loss function, there is a global minimum

268
Q

Suppose img is a (32,32,3) array, representing a 32x32 image with 3 color channels red, green and blue. How do you reshape this into a column vector?

A

x = img.reshape((32323,1))

269
Q

a.shape = (2,3)
b.shape = (2,1)
c = a + b

A

c.shape = (2,3)

270
Q

a.shape = (4,3)
b.shape = (3,2)
c = a*b

A

“Error!” the sizes don’t match for an element-wise multiplication

271
Q

Suppose you have n_x input features per example. What is the dimension of X?

A

(n_x, m)

272
Q

a.shape= (12288, 150)
b.shape = (150, 45)
c = np.dot(a,b)

A

c.shape = (12288, 45)

273
Q

a.shape = (3,4)
b.shape = (4,1)
How do you vectorize this?

A

c = a + b.T

274
Q

a.shape = (3,3)
b.shape = (3,1)
c = a*b
What does python do to make this work?

A

This will invoke broadcasting, so b is copied three tiems to become (3,3), and * is an element-wise product so c.shape will be (3,3)

275
Q

Whey does the tanh activation function usually work better than sigmoid as an activation function in the hidden layers?

A

The output range is between -1 and 1 and thus center the data around zero, which makes learning simpler for the next layer

276
Q

You are building a binary classifier for recognizing cucumbers (y=1) vs. watermelons (y=0). Which one of these activation functions would you recommend using for the output layer?

A

sigmoid

277
Q
A = np.random.randn(4,3)
B = np.sum(A, axis = 1, keepdims = True)

what is B.shape?

A

(4,1)

278
Q

What will happen if you build a neural network and you initialize the weights to be zero?

A

Each neuron in the first hidden layer will perform the same computation. So even after multiple iterations of gradient descent each neuron in the layer will be computing the same thing as other neurons.

279
Q

You have built a network using the tanh activation for all the hidden units. You initialize the weights to relative large values, using np.random.randn(..,..)*1000. What will happen?

A

This will cause inputs of the tanh to also be very large, thus causing gradients to be close to zero. The optimization algorithm will thus become slow.

280
Q

What is the “cache” used for in our implementation of forward propagation and backward propagation?

A

We use it to pass variables computed during forward propagation to the corresponding backward propagation step. It contains useful values for backward propagation to compute derivatives.

281
Q

The ____ layers of a neural network are typically computing more complex features of the input than the ____ layers.

A

deeper, earlier

282
Q

Vectorization allows you to compute forward propagation in an LL-layer neural network without an explicit for-loop (or any other explicit iterative loop) over the layers l=1, 2, …,L. True/False?

A

False

in a deeper network, we cannot avoid a for loop iteration over the layers

283
Q

Why do we need to know the activation function for backpropagation?

A

To compute the derivative, each activation has a different derivative.

284
Q

Circuit theory: (i) To compute the function using a shallow network circuit, you will need a large network (where we measure size by the number of logic gates in the network), but (ii) To compute it using a deep network circuit, you need only an ____ smaller network.

A

exponentially

285
Q

In general how can we find the dimension of the weight matrix associated with a layer?

A

W[l] has shape (n[l], n[l-1])

286
Q

If you have 10,000,000 examples, how would you split the train/deve/test set?

A

98%, 1% , 1%

287
Q

The dev and test set should:

A

come from the same distribution

288
Q

If your Neural Network model seems to have high bias, what two things could you try?

A
  1. increase the number of units in each hidden layer

2. make the Neural Network deeper

289
Q

You are working on an automated check-out kiosk for a supermarket, and are building a classifier for apples, bananas and oranges. Suppose your classifier obtains a training set error of 0.5%, and a dev set error of 7%. What two things could you try?

A
  1. Increase the regularization parameter lambda

2. get more training data

290
Q

What is weight decay?

A

A regularization technique (such as L2 regularizaiton) that results in gradient descent shrinking the weights on every iteration

291
Q

What happens when you increase the regularization hyperparameter lambda?

A

Weights are pushed toward becoming smaller

292
Q

With the inverted dropout technique, at test time:

A

you do not apply dropout (do not randomly eliminate units) and do not keep the 1/keep_prob factor in the calculations used in training

293
Q

Increasing the parameter kee_prob from (say) 0.5 to 0.6 will likely cause what two things?

A
  1. reducing the regularization effect

2. causing the neural network to end up with a lower training set error

294
Q

Three techniques that can be used to reduce variance when more training data isn’t an option?

A
  1. dropout
  2. data augmentation
  3. L2 regularization
295
Q

Why do we normalize the inputs x?

A

It makes the cost function faster to optimize

296
Q

Which notation would you use to denote the 3rd layer’s activations when the input is the 7th example from the 8th minibatch?

A

a[3]{7}(7)

297
Q

Why is the best mini-batch size usually not 1 and not m, but instead something in-between?

A

if it’s 1 then you lose the benefits of vectorization across examples in the mini-batch

if it’s m the you end up with batch gradient descent, which can be very slow for big training sets

298
Q

If you plot the cost with mini-batch what does it look like?

A

It will look like batch gradient descent but more oscillated

299
Q

What’s bias correct? Is it popular?

A

It’s when you use exponentially weighted averages on back propagation, and it corrects for the first few iterations so that they are not zero. In practice it’s not that common.

300
Q

With exponentially weighted averages what happens if the B is too large? too small? What a good B value?

A

Too large the line will shift to the right. Too small it will oscillate a lot. Standard practice uses B=0.9=-> average over the last 10 iterations

301
Q

Suppose batch gradient descent in a deep network is taking excessively long to find a value of the parameters that achieves a small value for the cost function. Which four techniques could help find parameter values that attain a small value for\mathcal{J}J?

A
  1. tuning the learning rate
  2. try mini-batch gradient descent
  3. try using Adam
  4. try better random initialization for the weights
302
Q

Why is grid search a bad idea for searching for hyperparameters?

A

Random search will let you try more values. A 5x5 grid search of two hyper will let be 25 different combination but only 5 different values of the two hyperparameters. A random search will 25 different values for each hyperparameter.

303
Q

Which hyperparameters are generally the most important?

A

1st - learning rate
2nd - hidden units, B (momentum), mini-batch size
3rd - number of layers, learning rate decay

The parameters with Adam are usually fixed: B1 = 0.9, B2 = 0.999, epsilon = 10^(-8)

304
Q

During hyperparameter search, whether you try to babysit one model (“Panda” strategy) or train a lot of models in parallel (“Caviar”) is largely determined by:

A

The amount of computational power you can access

305
Q

If you think β (hyperparameter for momentum) is between on 0.9 and 0.99, what is the recommended way to sample a value for beta?

A
r = np.random.rand()
beta = 1-10**(-r-1)
306
Q

In batch normalization as presented in the videos, if you apply it on the llth layer of your neural network, what are you normalizing/

A

Z[l], that will go into the activation function

307
Q

In the normalization formula (z_norm)= (z(i)-u)/sqrt(sigma^2+epsilon), why do we use epsilon?

A

to avoid division by zero

308
Q

What do gamma and beta do in Batch Norm, and how can we find them?

A

They set the mean and variance of the linear variable z[l] or a given layer, and they can be learned using Adam, Gradient descent with momentum, RMSprop, or gradient descent,

309
Q

After training a neural network with Batch Norm, at test time, to evaluate the neural network on a new example you should:

A

Perform the needed normalization using an exponentially weighted average across mini-batches seen during training.

310
Q

Suppose your input is a 300 by 300 color (RGB) image, and you are not using a convolutional network. If the first hidden layer has 100 neurons, each one fully connected to the input, how many parameters does this hidden layer have (including the bias parameters)?

A

27,000,100=3003003*100+100

311
Q

Suppose your input is a 300 by 300 color (RGB) image, and you use a convolutional layer with 100 filters that are each 5x5. How many parameters does this hidden layer have (including the bias parameters)?

A

7600=253100+100

312
Q

You have an input volume that is 63x63x16, and convolve it with 32 filters that are each 7x7, using a stride of 2 and no padding. What is the output volume?

A

29x29x32

dimension=((n+2p-f)/s)+1

((63+2*0-7)/2)+1 = 29

313
Q

You have an input volume that is 15x15x8, and pad it using “pad=2.” What is the dimension of the resulting volume (after padding)?

A

19x19x8

314
Q

You have an input volume that is 63x63x16, and convolve it with 32 filters that are each 7x7, and stride of 1. You want to use a “same” convolution. What is the padding?

A

3

dimension=((n+2p-f)/s)+1

((63+2p-7)/1)+1=63 -> p=3

315
Q

You have an input volume that is 32x32x16, and apply max pooling with a stride of 2 and a filter size of 2. What is the output volume?

A

16X16X16

divide width and height by 2 or ((n+2p-f)/s)+1 works as well

316
Q

True or False. Because pooling layers do not have parameters, they do not affect the backpropagation (derivatives) calculation.

A

False

317
Q

Two reasons why ‘parameter’ sharing is a benefit for using convolutional networks?

A
  1. It reduces the total number of parameters, thus reducing overfitting
  2. It allows a feature detector to be used in multiple locations throughout the whole input/image/input volume
318
Q

In lecture we talked about “sparsity of connections” as a benefit of using convolutional layers. What does this mean?

A

Each activation in the next layer depends on only a small number of activations from the previous layer

319
Q

How do dimension typically change in a ConvNet?

A

nH and nW decrease, while nC increases

320
Q

Typical structure of a ConvNet?

A

Multiple CONV layers followed by a POOL layer repeated a few times, and FC layers in the last few layers

321
Q

Suppose you have an input volume of dimension 64x64x16. How many parameters would a single 1x1 convolutional filter have (including the bias)?

A

17

322
Q

Suppose you have an input volume of dimension nH x nW x nC. How can you change the dimensions? (Assume that “1x1 convolutional layer” below always uses a stride of 1 and no padding.)

A

You can use a 1x1 convolutional layer to reduce nC but not nH, nW.

You can use a pooling layer to reduce nH, nW, but not nC.

323
Q

What can a 1x1 convolution do for Inception Networks?

A

They can reduce the input data volume’s size before applying 3x3 and 5x5 convolutions

324
Q

What is a inception block?

A

A single inception block allows the network to sue a combination of 1x1, 3x3, 5x5 convolutions and pooling

325
Q

What two reasons for using open-source implementations of ConvNets (both the model and/or weights)?

A
  1. It is a convenient way to get working an implementation of a complex ConvNet architecture.
  2. Parameters trained for one computer vision task are often useful as pretraining for other computer vision tasks.
326
Q

You are working on a factory automation task. Your system will see a can of soft-drink coming down a conveyor belt, and you want it to take a picture and decide whether (i) there is a soft-drink can in the image, and if so (ii) its bounding box. Since the soft-drink can is round, the bounding box is always square, and the soft drink can always appears as the same size in the image. There is at most one soft drink can in each image. Here’re some typical images in your training set: What is the most appropriate set of output units for your neural network?

A

Logistic unit, bx, by

327
Q

If you build a neural network that inputs a picture of a person’s face and outputs N landmarks on the face (assume the input image always contains exactly one face), how many output units will the network have?

A

2N

328
Q

When training one of the object detection systems described in lecture, you need a training set that contains many pictures of the object(s) you wish to detect. However, bounding boxes do not need to be provided in the training set, since the algorithm can learn to detect the objects by itself.

A

False

329
Q

Suppose you are applying a sliding windows classifier (non-convolutional implementation). Increasing the stride would tend to increase accuracy, but decrease computational cost.

A

False

330
Q

In the YOLO algorithm, at training time, only one cell —the one containing the center/midpoint of an object— is responsible for detecting this object.

A

True

331
Q

What is the IoU between these two boxes? The upper-left box is 2x2, and the lower-right box is 2x3. The overlapping region is 1x1

A

1/9

332
Q

Suppose you are using YOLO on a 19x19 grid, on a detection problem with 20 classes, and with 5 anchor boxes. During training, for each image you will need to construct an output volume y as the target value for the neural network; this corresponds to the last layer of the neural network. (y may include some “?”, or “don’t cares”). What is the dimension of this output volume?

A

19x19x(5x25)

333
Q

Face _____requires comparing a new picture against one person’s face, whereas face recognition requires comparing a new picture against K person’s faces.

A

verification, recognition

334
Q

Why do we learn a function d(img1,img2) for face verification?

A

This allows us to learn to recognize a new person given just a single image of that person. We need to solve a one-shot learning problem

335
Q

In order to train the parameters of a face recognition system, it would be reasonable to use a training set comprising 100,000 pictures of 100,000 different persons.

A

False

having about 10 photos of each person would work well

336
Q

Which of the following is a correct definition of the triplet loss? Consider that α>0. (We encourage you to figure out the answer from first principles, rather than just refer to the lecture.)

A

max(||f(A)−f(P)||^2 − ||f(A)−f(N)||^2 + α, 0)

337
Q

In a siamese network architecture neural networks will have ______ input images, but have exactly the same _____ parameters.

A

different, same

338
Q

You train a ConvNet on a dataset with 100 different classes. You wonder if you can find a hidden unit which responds strongly to pictures of cats. (I.e., a neuron so that, of all the input/training images that strongly activate that neuron, the majority are cat pictures.) You are more likely to find this unit in layer 4 of the network than in layer 1

A

True

339
Q

In the deeper layers of a ConvNet, each channel corresponds to a different feature detector. The style matrix G[l] measures the degree to which the activations of different feature detectors in layer l vary (or correlate) together with each other.

A

True

340
Q

In neural style transfer, what is updated in each iteration of the optimization algorithm?

A

The pixel values of the generated image G

341
Q

You are working with 3D data. You are building a network layer whose input volume has size 32x32x32x16 (this volume has 16 channels), and applies convolutions with 32 filters of dimension 3x3x3 (no padding, stride 1). What is the resulting output volume?

A

30x30x30x32

342
Q

Two examples for when we would use a many-to-one RNN architecture?

A
  1. Sentiment classification from a text [0=negative, 1=postivie]
  2. Gender recognition from speech [0=male, 1=femal]
343
Q

how many inputs does a have in a RNN?

A

two, a and x>t>

344
Q

You are training an RNN, and find that your weights and activations are all taking on the value of NaN (“Not a Number”). Which of these is the most likely cause of this problem?

A

Exploding gradient problem

345
Q

Suppose you are training a LSTM. You have a 10000 word vocabulary, and are using an LSTM with 100-dimensional activations a. What is the dimension of Γu at each time step?

A

100

346
Q

You have a pet dog whose mood is heavily dependent on the current and past few days’ weather. You’ve collected data for the past 365 days on the weather, which you represent as a sequence as x<1>,…,x<365>. You’ve also collected data on your dog’s mood, which you represent as y<1>,…,y<365>. You’d like to build a model to map from x→y. Should you use a Unidirectional RNN or Bidirectional RNN for this problem?

A

Unidirectional RNN, because the value of y depends only on x<1>,…,x, but not on x,…,x<365>

347
Q

What is t-SNE?

A

A non-linear dimensionality reduction technique. It can be used to view the relations in Word Embedding Matrix

348
Q

What equations could you expect to make from ‘boy’, ‘girl’, ‘brother’ and ‘sister’ if your word embedding is good?

A

boy - girl =~ brother - sister

boy - brother =~ girl - sister

349
Q

When is it okay to use transfer learning for NLP?

A

When your training set is smaller than the training set used to create the word embedding

350
Q

Which other file type is an example of a file type that uses Javascript Object Notation (JSON) formatting that I would use?

A

Jupyter/iPhython (.ipynb files)

351
Q

Which residual-based approach to identifying outliers compares running a model with all data to running the same model, but dropping a single observation?

A

externally studentized residuals

352
Q

equation to standardize the data?

A

(x-mean)/std

353
Q

equation to in-max standardize the data?

A

(x-min)/(max-min)

354
Q

What’s the robust scaler? And why use it?

A

It’s like the min-max standardization but only using the interquartile. It’s less vulnerable to outliers.

355
Q

(True/False) In general, the population parameters are unknown

A

True

356
Q
Question 2
(True/False) Parametric models have finite number of parameters
A

True

357
Q

The most common way of estimating parameters in a parametric model is:

A

using the maximum likelihood estimation

358
Q

A p-value is:

A

the smallest significance level at which the null hypothesis would be rejected

359
Q

Type 1 Error is defined as:

A

Saying the null hypothesis is false, when it is actually true

360
Q

Type 2 error is defined as:

A

Saying the null hypothesis is true, when it is actually false

361
Q

(True/False) If you reject the null hypothesis, it means that the alternate hypothesis is true.

A

False

362
Q

In K-fold cross-validation, how will increasing k affect the variance (across subsamples) of estimated model parameters?

A

increasing k will usually increase the variance of the estimated parameters

363
Q

What does Bagging stand for?

A

bootstrap aggregating

364
Q

What is the main condition to use stacking as ensemble method?

A

Models need to output predicted probabilities

365
Q

This tree ensemble method only uses a subset of the features for each tree:

A

Random Forest

366
Q

This is an ensemble model that does not use bootstrapped samples to fit the base trees, takes residuals into account, and fits the base trees iteratively:

A

Boosting

367
Q

When clustering with KMeans, what’s the difference between inertia and distortion?

A

inertia - you want a similar number of observations in each cluster
distortion - you want the observations in each cluster to be very similar

368
Q

When using DBSCAN, how does the algorithm determine that a cluster is complete and is time to move to a different point of the data set and potentially start a new cluster?

A

When no point is left unvisited by the chain reaction

369
Q

What are the advantages of DBSCAN?

A

Don’t need to specify the number of cluster, allows for noise and can handle strange shapes

370
Q

What are the disadvantages of DBSCAN?

A

computationally expensive, hard to choose parameters, and cluster should have similar density

371
Q

How to find the best number of clusters with the K-means algorithm?

A

Use the elbow method

372
Q

What are the key hyperparameters for the Hierarchical Clustering (Ward) algorithm?

A

distances and linkage

373
Q

How to choose the number of cluster while you the K-Mean Shift algorithm?

A

The algorithms chooses it for us

374
Q

How do we define the core points when we use the DBSCAN algorithm?

A

A point that has more than n_clu neighbors in its Е-neighborhood

375
Q

What are the L1 and L2 distances?

A
L1 = Manhattan
L2 = Euclidean
376
Q

When might the Manhattan distance be better?

A

When the data is very high dimensional

377
Q

What is Cosine distance and when do we use it?

A

It measure the angle form the origin. It can be good for text data when location of occurrence is less important

378
Q

Which distance metric is useful when we have text documents and we want to group similar topics together?

A

Jaccard

379
Q

For data with many features, principal components analysis …

A

generates new features that are linear combinations of other features

380
Q

What is the main difference between kernel PCA and linear PCA?

A

Kernel PCA tend to preserve the geometric distances between the points while reducing the dimensionality of the space

381
Q

Multi-Dimensional Scaling can be useful to do what?

A

To visualize the data

382
Q

When we use the DBSCAN algorithm, how do we know that our cluster is complete and is time to move to a different point of the data set and potentially start a new cluster?

A

When no point is left unvisisted by the chain reaction

383
Q

What correctly defines the strengths of the DBSCAN algorithm?

A

No need to specify the number of clusters, allows for noise, and can handle arbitrary-shaped clusters

384
Q

What statements correctly defines the weaknesses of the DBSCAN algorithm?

A

It needs two parameters as inputs, finding appropriate values can be difficult, and does not do well with clusters of different density

385
Q

How can you have clear separations of clusters while using HAC?

A

Use Single Linkage

386
Q

Which linkage refers to maximum pairwise distance between clusters in HAC?

A

Complete linkage

387
Q

Which of the following measure methods computes the inertia and picks the pair that is going to ultimately minimize the inertia value in HAC?

A

Ward Linkage

388
Q

This is the type of decomposition model that is used if the magnitudes of the seasonal and residual values fluctuate with trend:

A

Multiplicative Decomposition Model

389
Q

This decomposition model assumes that the seasonal and residual magnitudes are independent of trend.

A

Additive Decomposition Model

390
Q

Which of the following smoothing techniques is appropriate for data with a trend but no seasonality?

A

Double Exponential Smoothing

391
Q

Question 2

Which of the following smoothing techniques is appropriate for data with both trend and seasonality?

A

Triple Exponential Smoothing

392
Q

How is SARIMA different than than ARIMA?

A

S= seasonality

393
Q

What is a characteristic of an autoregressive (AR) model?

A

A fixed number of past forecast values are used to predict future values.

394
Q

What is a characteristic of a moving average (MA) model?

A

A fixed number of past forecast errors are used to predict future values.

395
Q

An ARIMA model without differencing (I=0) is equivalent to what of the following approaches?

A

The sum of AR and MA model.

396
Q

This plot summarizes the 2-way correlation between a variable and its past values:

A

Autocorrelation plot

397
Q

Two major ways that data can be collected:

A

cross-sectional (panel), and longitudinal (time series)

398
Q

three important pillars of the mathematical sciences

A

Function approximation, Optimization, Probability and Statistics.