Data Science Flashcards
Two most common supervised tasks?
Classification and Regression
Four common unsupervised tasks?
Clustering, visualization, dimensionality reduction, association rule learning
What model is used to train a robot to walk on various unknown terrains?
Reinforcement Learning
Is spam detection a supervised or unsupervised learning problem?
Supervised, you feed the model many emails that are labeled spam or not spam
What is an online learning system?
A learning system that can learn incrementally. Capable of adapting rapidly to changing data and autonomous systems, and of training on very large quantities of data
What is out-of-core learning?
Out-of-core algorithms can handle vast quantities of data that cannot fit in a computer’s main memory. Chops the data into mini-batches and uses online learning techniques.
What type of learning algorithm relies on a similarity measure to make predictions?
An instance-based learning system learns the training data by heart; then, when given new instances, it uses a similarity measure to find the most similar learned instances and uses them to make predictions.
Difference between parameter and learning algorithm?
a parameter will predict give a new instance (e.g. slope of a linear model), a hyperparameter is a parameter of the learning algorithm itself (max depth of learning tree
what do model-based learning algorithms search for? 2 What is the most common strategy they use to succeed? 3 How do they make predictions?
They search for an optimal value for the model parameters such that the model will generalize well to new instances. 2 Usually by minimizing a cost function. 3 Feed new instances into the model.
Five main challenges to ML?
- lack of data 2. poor data quality 3. nonrepresentative data 4. uninformative features 5. overfitting or underfitting
Four solutions to overfitting?
- get more data 2. simplify the model 3. reduce the noise in the data 4 smaller learning rate
What is a test set?
to generalize the error that the model will make on new instances
The purpose of a validation set?
To compare models and tune the hyperparameters
What is a train-dev set?
Used when there is a risk of mismatch between the training data and the data used for validation
Which Linear Regression training algorithm can you use if you have a training set with million of features?
You can’t use SVD or Normal Equation because computational complexity grows quickly with the number of features. Use Stochastic Gradient Descent or Mini-batch Gradient Descent. If memory allows you can use Batch Gradient Descent.
If your training set has very different scales which algorithms might suffer? What can you do about this?
The cost function will have the shape of an elongated bowl, so the Gradient Descent algorithms will take a long time to converge. (Normal Equation or SVD approach will work fine). To solve this you should scale the data first. Moreover, regularized models may converge to a suboptimal solution if the features are not scaled.
Can Gradient Descent get stuck in a local minimum when training a Logistic Regression Model?
Gradient Descent cannot get stuck in a local minimum when training a Logistic Regression model because the cost function is convex.
Suppose you use Batch Gradient Descent and you plot the validation error at every epoch. If you notice that the validation error consistently goes up, what is likely going on?
If the validation error consistently goes up after every epoch, then one possibility is that the learning rate is too high and the algorithm is diverging. If the training error also goes up, then this is clearly the problem and you should reduce the learning rate. However, if the training error is not going up, then your model is overfitting the training set and you should stop training.
Do all Gradient Descent algorithms lead to the same model, provided you let them run long enough?
If the optimization problem is convex (such as Linear Regression or Logistic Regression), and assuming the learning rate is not too high, then all Gradient Descent algorithms will approach the global optimum and end up producing fairly similar models. However, unless you gradually reduce the learning rate, Stochastic GD and Mini-batch GD will never truly converge; instead they will keep jumping back and forth around the global optimum. This means that even if you let them run for a very long time, these Gradient Descent algorithms will produce slightly different models.
Is it a good idea to stop Mini-batch Gradient Descent immediately when the validation error goes up?
Due to their random nature, neither Stochastic Gradient Descent nor Mini-batch Gradient Descent is guaranteed to make progress at every single training iteration. So if you immediately stop training when the validation error goes up, you may stop much too early, before the optimum is reached. A better option is to
save the model at regular intervals; then, when it has not improved for a long time (meaning it will probably never beat the record), you can revert to the best
saved model.
Which Gradient Descent algorithm (among those we discussed) will reach the vicinity of the optimal solution the fastest? Which will actually converge? How can you make the others converge as well?
Stochastic Gradient Descent has the fastest training iteration since it considers only one training instance at a time, so it is generally the first to reach the vicinity
of the global optimum (or Mini-batch GD with a very small mini-batch size). However, only Batch Gradient Descent will actually converge, given enough
training time. As mentioned, Stochastic GD and Mini-batch GD will bounce around the optimum, unless you gradually reduce the learning rate.
Suppose you are using Polynomial Regression. You plot the learning curves and
you notice that there is a large gap between the training error and the validation
error. What is happening? What are three ways to solve this?
If the validation error is much higher than the training error, this is likely because your model is overfitting the training set. One way to try to fix this is to reduce
the polynomial degree: a model with fewer degrees of freedom is less likely to overfit. Another thing you can try is to regularize the model—for example, by adding an ℓ2 penalty (Ridge) or an ℓ1 penalty (Lasso) to the cost function. This will also reduce the degrees of freedom of the model. Lastly, you can try to increase the size of the training set.
Suppose you are using Ridge Regression and you notice that the training error
and the validation error are almost equal and fairly high. Would you say that the
model suffers from high bias or high variance? Should you increase the regularization hyperparameter α or reduce it?
If both the training error and the validation error are almost equal and fairly high, the model is likely underfitting the training set, which means it has a high
bias. You should try reducing the regularization hyperparameter α.
Why would you want to use:
a. Ridge Regression instead of plain Linear Regression (i.e., without any regula‐
rization) ?
A model with some regularization typically performs better than a model without any regularization, so you should generally prefer Ridge Regression over plain Linear Regression.
Why would you want to use:
b. Lasso instead of Ridge Regression?
Lasso Regression uses an ℓ1 penalty, which tends to push the weights down to exactly zero. This leads to sparse models, where all weights are zero except for
the most important weights. This is a way to perform feature selection automatically, which is good if you suspect that only a few features actually matter. When you are not sure, you should prefer Ridge Regression.
Why would you want to use: c. Elastic Net instead of Lasso?
Elastic Net is generally preferred over Lasso since Lasso may behave erratically in some cases (when several features are strongly correlated or when there are more features than training instances). However, it does add an extra hyper‐parameter to tune. If you want Lasso without the erratic behavior, you can just
use Elastic Net with an l1_ratio close to 1.
Suppose you want to classify pictures as outdoor/indoor and daytime/nighttime.
Should you implement two Logistic Regression classifiers or one Softmax Regression classifier?
If you want to classify pictures as outdoor/indoor and daytime/nighttime, since these are not exclusive classes (i.e., all four combinations are possible) you should train two Logistic Regression classifiers.
What is the fundamental idea behind Support Vector Machines?
The fundamental idea behind Support Vector Machines is to fit the widest possible “street” between the classes. In other words, the goal is to have the largest possible margin between the decision boundary that separates the two classes and the training instances. When performing soft margin classification, the SVM searches for a compromise between perfectly separating the two classes and having the widest possible street (i.e., a few instances may end up on the street). Another key idea is to use kernels when training on nonlinear datasets.
What is a support vector?
. After training an SVM, a support vector is any instance located on the “street” (see the previous answer), including its border. The decision boundary is entirely
determined by the support vectors. Any instance that is not a support vector (i.e., is off the street) has no influence whatsoever; you could remove them, add more instances, or move them around, and as long as they stay off the street they won’t affect the decision boundary. Computing the predictions only involves the support vectors, not the whole training set.
Why is it important to scale the inputs when using SVMs?
SVMs try to fit the largest possible “street” between the classes, so if the training set is not scaled, the SVM will tend to neglect small features
Can an SVM classifier output a confidence score when it classifies an instance?
What about a probability?
An SVM classifier can output the distance between the test instance and the decision boundary, and you can use this as a confidence score. However, this score cannot be directly converted into an estimation of the class probability. If you set probability=True when creating an SVM in Scikit-Learn, then after training it will calibrate the probabilities using Logistic Regression on the SVM’s scores (trained by an additional five-fold cross-validation on the training data). This will add the predict_proba() and predict_log_proba() methods to the SVM.
Should you use the primal or the dual form of the SVM problem to train a model on a training set with millions of instances and hundreds of features?
This question applies only to linear SVMs since kernelized SVMs can only use the dual form. The computational complexity of the primal form of the SVM problem is proportional to the number of training instances m, while the computational complexity of the dual form is proportional to a number between m2 and m3 So if there are millions of instances, you should definitely use the primal form, because the dual form will be much too slow
Say you’ve trained an SVM classifier with an RBF kernel, but it seems to underfit the training set. Should you increase or decrease γ (gamma)? What about C?
If an SVM classifier trained with an RBF kernel underfits the training set, there might be too much regularization. To decrease it, you need to increase gamma or C (or both).
What is the approximate depth of a Decision Tree trained (without restrictions) on a training set with one million instances?
The depth of a well-balanced binary tree containing m leaves is equal to log_2(m), rounded up. A binary Decision Tree (one that makes only binary decisions, as is the case of all trees in Scikit-Learn) will end up more or less well balanced at the end of training, with one leaf per training instance if it is trained without restrictions. Thus, if the training set contains one million instances, the Decision Tree will have a depth of log_2
(10^6) ≈ 20 (actually a bit more since the tree will generally not be perfectly well balanced).
Is a node’s Gini impurity generally lower or greater than its parent’s? Is it generally lower/greater, or always lower/greater?
A node’s Gini impurity is generally lower than its parent’s. This is due to the CART training algorithm’s cost function, which splits each node in a way that minimizes the weighted sum of its children’s Gini impurities. However, it is possible for a node to have a higher Gini impurity than its parent, as long as this increase is more than compensated for by a decrease in the other child’s impurity. For example, consider a node containing four instances of class A and one of class B. Its Gini impurity is 1 – (1/5)2– (4/5)2 = 0.32. Now suppose the dataset is one-dimensional and the instances are lined up in the following order: A, B, A, A, A. You can verify that the algorithm will split this node after the second instance, producing one child node with instances A, B, and the other child node with instances A, A, A. The first child node’s Gini impurity is 1 – (1/2)2– (1/2)2=0.5, which is higher than its parent’s. This is compensated for by the fact that the other node is pure, so its overall weighted Gini impurity is 2/5 × 0.5 + 3/5 × 0 =0.2, which is lower than the parent’s Gini impurity.
If a Decision Tree is overfitting the training set, is it a good idea to try decreasing max_depth?
If a Decision Tree is overfitting the training set, it may be a good idea to decrease max_depth, since this will constrain the model, regularizing it.
If a Decision Tree is underfitting the training set, is it a good idea to try scaling the input features?
Decision Trees don’t care whether or not the training data is scaled or centered; that’s one of the nice things about them. So if a Decision Tree underfits the training set, scaling the input features will just be a waste of time.
If it takes one hour to train a Decision Tree on a training set containing 1 million instances, roughly how much time will it take to train another Decision Tree on a training set containing 10 million instances?
The computational complexity of training a Decision Tree is O(n × m log(m)). So if you multiply the training set size by 10, the training time will be multiplied by
K = (n × 10m × log(10m)) / (n × m × log(m)) = 10 × log(10m) / log(m). If m =10^6, then K ≈ 11.7, so you can expect the training time to be roughly 11.7 hours.
If your training set contains 100,000 instances for Decision Tree Classifier, will setting presort=True speed up training?
Presorting the training set speeds up training only if the dataset is smaller than a few thousand instances. If it contains 100,000 instances, setting presort=True
will considerably slow down training.
If you have trained five different models on the exact same training data, and they all achieve 95% precision, is there any chance that you can combine these models to get better results? If so, how? If not, why?
If you have trained five different models and they all achieve 95% precision, you can try combining them into a voting ensemble, which will often give you even
better results. It works better if the models are very different (e.g., an SVM classifier, a Decision Tree
Classifier, a Logistic Regression classifier, and so on). It is even better if they are trained on different training instances (that’s the whole point of bagging and pasting ensembles), but if not this will still be effective as long as the models are very different.
What is the difference between hard and soft voting classifiers?
A hard voting classifier just counts the votes of each classifier in the ensemble and picks the class that gets the most votes. A soft voting classifier computes the average estimated class probability for each class and picks the class with the highest probability. This gives high-confidence votes more weight and often per‐ forms better, but it works only if every classifier is able to estimate class probabilities (e.g., for the SVM classifiers in Scikit-Learn you must set probability=True).
Is it possible to speed up training of a bagging ensemble by distributing it across multiple servers? What about pasting ensembles, boosting ensembles, Random Forests, or stacking ensembles?
It is quite possible to speed up training of a bagging ensemble by distributing it across multiple servers, since each predictor in the ensemble is independent of the others. The same goes for pasting ensembles and Random Forests, for the same reason. However, each predictor in a boosting ensemble is built based on the previous predictor, so training is necessarily sequential, and you will not gain anything by distributing training across multiple servers. Regarding stacking ensembles, all the predictors in a given layer are independent of each other, so they can be trained in parallel on multiple servers. However, the predictors in one layer can only be trained after the predictors in the previous layer have all been trained.
What is the benefit of out-of-bag evaluation?
With out-of-bag evaluation, each predictor in a bagging ensemble is evaluated using instances that it was not trained on (they were held out). This makes it possible to have a fairly unbiased evaluation of the ensemble without the need for an additional validation set. Thus, you have more instances available for training, and your ensemble can perform slightly better.
What makes Extra-Trees more random than regular Random Forests? How can this extra randomness help? Are Extra-Trees slower or faster than regular Random Forests?
When you are growing a tree in a Random Forest, only a random subset of the features is considered for splitting at each node. This is true as well for ExtraTrees, but they go one step further: rather than searching for the best possible thresholds, like regular Decision Trees do, they use random thresholds for each feature. This extra randomness acts like a form of regularization: if a Random Forest overfits the training data, Extra-Trees might perform better. Moreover,
since Extra-Trees don’t search for the best possible thresholds, they are much faster to train than Random Forests. However, they are neither faster nor slower
than Random Forests when making predictions.
If your AdaBoost ensemble underfits the training data, which hyperparameters should you tweak and how?
If your AdaBoost ensemble underfits the training data, you can try increasing the number of estimators or reducing the regularization hyperparameters of the base estimator. You may also try slightly increasing the learning rate.
If your Gradient Boosting ensemble overfits the training set, should you increase or decrease the learning rate?
If your Gradient Boosting ensemble overfits the training set, you should try decreasing the learning rate. You could also use early stopping to find the right
number of predictors (you probably have too many).
What are the main motivations for reducing a dataset’s dimensionality?
The main motivations for dimensionality reduction are:
• To speed up a subsequent training algorithm (in some cases it may even remove noise and redundant features, making the training algorithm perform
better)
• To visualize the data and gain insights on the most important features
• To save space (compression)
What are the main drawbacks for reducing a dataset’s dimensionality?
The main drawbacks are:
• Some information is lost, possibly degrading the performance of subsequent training algorithms.
• It can be computationally intensive.
• It adds some complexity to your Machine Learning pipelines.
• Transformed features are often hard to interpret.
What is the curse of dimensionality?
The curse of dimensionality refers to the fact that many problems that do not exist in low-dimensional space arise in high-dimensional space. In Machine
Learning, one common manifestation is the fact that randomly sampled highdimensional vectors are generally very sparse, increasing the risk of overfitting and making it very difficult to identify patterns in the data without having plenty of training data.
Once a dataset’s dimensionality has been reduced, is it possible to reverse the operation? If so, how? If not, why?
Once a dataset’s dimensionality has been reduced using one of the algorithms we discussed, it is almost always impossible to perfectly reverse the operation,
because some information gets lost during dimensionality reduction. Moreover, while some algorithms (such as PCA) have a simple reverse transformation procedure that can reconstruct a dataset relatively similar to the original, other
algorithms (such as T-SNE) do not
Can PCA be used to reduce the dimensionality of a highly nonlinear dataset?
PCA can be used to significantly reduce the dimensionality of most datasets, even if they are highly nonlinear, because it can at least get rid of useless dimensions. However, if there are no useless dimensions—as in a Swiss roll dataset—then
reducing dimensionality with PCA will lose too much information. You want to unroll the Swiss roll, not squash it.
Suppose you perform PCA on a 1,000-dimensional dataset, setting the explained variance ratio to 95%. How many dimensions will the resulting dataset have?
That’s a trick question: it depends on the dataset. Let’s look at two extreme examples. First, suppose the dataset is composed of points that are almost perfectly
aligned. In this case, PCA can reduce the dataset down to just one dimension while still preserving 95% of the variance. Now imagine that the dataset is composed of perfectly random points, scattered all around the 1,000 dimensions. In this case roughly 950 dimensions are required to preserve 95% of the variance. So
the answer is, it depends on the dataset, and it could be any number between 1 and 950. Plotting the explained variance as a function of the number of dimensions is one way to get a rough idea of the dataset’s intrinsic dimensionality.
In what cases would you use vanilla PCA, Incremental PCA, Randomized PCA, or Kernel PCA?
Regular PCA is the default, but it works only if the dataset fits in memory. Incremental PCA is useful for large datasets that don’t fit in memory, but it is slower
than regular PCA, so if the dataset fits in memory you should prefer regular PCA. Incremental PCA is also useful for online tasks, when you need to apply
PCA on the fly, every time a new instance arrives. Randomized PCA is useful when you want to considerably reduce dimensionality and the dataset fits in memory; in this case, it is much faster than regular PCA. Finally, Kernel PCA is useful for nonlinear datasets.
How can you evaluate the performance of a dimensionality reduction algorithm on your dataset?
Intuitively, a dimensionality reduction algorithm performs well if it eliminates a lot of dimensions from the dataset without losing too much information. One
way to measure this is to apply the reverse transformation and measure the reconstruction error. However, not all dimensionality reduction algorithms provide a reverse transformation. Alternatively, if you are using dimensionality reduction as a preprocessing step before another Machine Learning algorithm
(e.g., a Random Forest classifier), then you can simply measure the performance of that second algorithm; if dimensionality reduction did not lose too much
information, then the algorithm should perform just as well as when using the original dataset.
Does it make any sense to chain two different dimensionality reduction algorithms?
It can absolutely make sense to chain two different dimensionality reduction algorithms. A common example is using PCA to quickly get rid of a large number of useless dimensions, then applying another much slower dimensionality reduction algorithm, such as LLE. This two-step approach will likely yield the same performance as using LLE only, but in a fraction of the time.
How would you define clustering? Can you name a few clustering algorithms?
In Machine Learning, clustering is the unsupervised task of grouping similar instances together. The notion of similarity depends on the task at hand: for example, in some cases two nearby instances will be considered similar, while in others similar instances may be far apart as long as they belong to the same densely packed group. Popular clustering algorithms include K-Means, DBSCAN, agglomerative clustering, BIRCH, Mean-Shift, affinity propagation, and spectral clustering.
What are some of the main applications of clustering algorithms?
The main applications of clustering algorithms include data analysis, customer segmentation, recommender systems, search engines, image segmentation, semisupervised learning, dimensionality reduction, anomaly detection, and novelty detection.
Describe two techniques to select the right number of clusters when using K-Means.
The elbow rule is a simple technique to select the number of clusters when using K-Means: just plot the inertia (the mean squared distance from each instance to its nearest centroid) as a function of the number of clusters, and find the point in the curve where the inertia stops dropping fast (the “elbow”). This is generally close to the optimal number of clusters. Another approach is to plot the silhouette score as a function of the number of clusters. There will often be a peak, and the optimal number of clusters is generally nearby. The silhouette score is the mean silhouette coefficient over all instances. This coefficient varies from +1 for instances that are well inside their cluster and far from other clusters, to –1 for instances that are very close to another cluster. You may also plot the silhouette diagrams and perform a more thorough analysis.
What is label propagation? Why would you implement it, and how?
Labeling a dataset is costly and time-consuming. Therefore, it is common to have plenty of unlabeled instances, but few labeled instances. Label propagation is a technique that consists in copying some (or all) of the labels from the labeled
instances to similar unlabeled instances. This can greatly extend the number of labeled instances, and thereby allow a supervised algorithm to reach better performance (this is a form of semi-supervised learning). One approach is to use a clustering algorithm such as K-Means on all the instances, then for each cluster find the most common label or the label of the most representative instance (i.e., the one closest to the centroid) and propagate it to the unlabeled instances in the same cluster.
Can you name two clustering algorithms that can scale to large datasets? And two that look for regions of high density?
K-Means and BIRCH scale well to large datasets. DBSCAN and Mean-Shift look for regions of high density.
Can you think of a use case where active learning would be useful? How would you implement it?
Active learning is useful whenever you have plenty of unlabeled instances but labeling is costly. In this case (which is very common), rather than randomly selecting instances to label, it is often preferable to perform active learning, where human experts interact with the learning algorithm, providing labels for Exercise Solutions | 729 specific instances when the algorithm requests them. A common approach is uncertainty sampling (see the description in “Active Learning” on page 255).
What is the difference between anomaly detection and novelty detection?
Many people use the terms anomaly detection and novelty detection interchangeably, but they are not exactly the same. In anomaly detection, the algorithm is trained on a dataset that may contain outliers, and the goal is typically to identify these outliers (within the training set), as well as outliers among new instances. In novelty detection, the algorithm is trained on a dataset that is presumed to be “clean,” and the objective is to detect novelties strictly among new instances. Some algorithms work best for anomaly detection (e.g., Isolation Forest), while others are better suited for novelty detection (e.g., one-class SVM).
What is a Gaussian mixture? What tasks can you use it for?
A Gaussian mixture model (GMM) is a probabilistic model that assumes that the instances were generated from a mixture of several Gaussian distributions whose
parameters are unknown. In other words, the assumption is that the data is grouped into a finite number of clusters, each with an ellipsoidal shape (but the clusters may have different ellipsoidal shapes, sizes, orientations, and densities), and we don’t know which cluster each instance belongs to. This model is useful for density estimation, clustering, and anomaly detection.
Can you name two techniques to find the right number of clusters when using a Gaussian mixture model?
One way to find the right number of clusters when using a Gaussian mixture model is to plot the Bayesian information criterion (BIC) or the Akaike informa‐ tion criterion (AIC) as a function of the number of clusters, then choose the number of clusters that minimizes the BIC or AIC. Another technique is to use a Bayesian Gaussian mixture model, which automatically selects the number of clusters.
What is θ=(XT*X)^-1 * XT *y
and when can we use it.
The normalization equation is an alternative to gradient descent when our number of features isn’t too big.
How to turn a feature into ‘standard norm’ form. Mean =0, std=1
Subtract each instance by the feature mean and divide it by the feature std
What do you do if your cost function increases after each iteration?
Make the learning rate (alpha), smaller.
When is Gradient Descent a better option than the Normal Equation?
When there are too many features (eg 10000).
Why use feature scaling?
It will make Gradient Descent quicker, more direct.
What are the dimensions of theta(j) in a neural network?
S(j+1) by S j plus one
eg. hidden layer by input layer + 1
What is a common reason for an ML model that works well in training but fails in production?
The ML dataset was improperly created
Personalized Algorithms are often built using which type of ML model?
Recommendation systems (but you must understand and know the tools and tricks of image processing and sequence systems to understand recommendation systems).
Question 3
What is a key lesson Google has learned with regards to reducing the chance of failure in production ML models?
Process batch and streaming data the same way
Which of the following scenarios may require a supervised learning model to be retrained as a new model?
The model was trained on labeled data and we now wish to correct the labels of the data.
Someone read emails for a company and then forwards the emails to the appropriate department. How can we automate this process?
Use several models to read, sort, and send to departments. If there are any pre-existing models then use them.
A team is preparing to develop and deploy an ML model for use on a shopping website. They have collected a little data to train the model. The team plans on gathering more data once the model is developed. Now they are ready for the next phase, training.
Which of these scenarios will most likely lead to a successful deployment of the ML model?
The team should take time to gather more data, because with more data, it is possible to create a simpler ML model that performs better.
What are the five phases of the “Path to ML”?
Individual contributor, delegation, digitization, big data and analytics, machine learning
You are going to develop an ML model. You are in Canada and the rest of the team is in Mexico.
Your team wants to use Google Cloud Platform with Python Notebook. Which of the following statements support your decision.
Datalab notebooks are hosted in the cloud
Question 2
Your team has decided to use the Compute Engine, Cloud Storage, and Datalab for ML model development
Which two statements are applicable to your situation
Every member of the team, regardless of their location, can directly read data from Cloud Storage.
Latency of data access can be a concern, so carefully select the zone for data storage.
The third wave of cloud is _________________ so you can focus on data ___________ instead of infrastructure.
serverless, insights
Three quality attributes of data?
Consistency, accuracy, auditability
Two categories of data quality tools?
Cleaning tools, monitoring tools
Three features of low data quality?
unreliable info, incomplete data, duplicated data
What is the Orderliness of data?
The data entered has the required format and structure
Three best practices for data quality management?
resolving missing values, preventing duplicates, automating data entry
Which is the correct sequence of steps in data science after the data is gathered? 4 steps
Data Exploration -> Data Cleaning -> Model Building -> Present Results
Three objectives of exploratory data analysis?
Check for missing data and other mistakes, Gain maximum insight into the data set and its underlying structure, uncover a parsimonious model (the most useful features)
Two main methods for Exploratory Data Analysis?
Univariate and Bivariate
What machine learning models have labels, or in other words, the correct answers to whatever it is that we want to learn to predict?
Supervised model
Two most common types of Supervised machine learning models?
Regression model, and classification model
Which model would you use if your problem required a discrete number of values or classes?
Classification model
Question 5
When the data isn’t labelled, what is an alternative way of predicting the output?
Clustering Algorithms
Question 5
What is the most essential metric a regression model uses?
Mean squared error as their loss funciton
Question 1
Fill in the blanks. In the video, we presented a linear equation. This hypothesis equation is applied to every _________ of our dataset, where the weight values are fixed, and the feature values are from each associated column, and our machine learning data set.
row
Question 3
Fill in the blanks. Fundamentally, classification is about predicting a _______ and regression is about predicting a __________.
Label, Quantity
What component of a biological neuron is analogous to the input portion of a perceptron?
Dendrites
Which of the following is an algorithm for supervised learning of binary classifiers - given that a binary classifier is a function which can decide whether or not an input, represented by a vector of numbers, belongs to some specific class.
Binary classifier, Perceptron, Linear Regression
Perceptron
Which model is the linear classifier, also unsed in Supervised learning?
Neuron, Dendrites, Perceptron
Perceptron
A perceptron is a type of _____ that makes its predictions based on a linear predictor function combining a set of weights with the ________.
linear classifier, feature vector
Three steps in the Perceptron Learning Process
- Takes the inputs, multiplies them by their weights, and computes their sum.
- Adds a bias factor, the number 1 multiplied by a weight.
- Feeds the sum through the activation function.
Six elements of a perceptron?
- Input function X
- Bias b (constant)
- Weights
- Weighted sum
- Activation function
- Output
Neural Networks: I I wanted my outputs to be in the form of probabilities which activation function should I use in the final layer?
Sigmoid
A single unit for a non-input neuron has these three things:
- Weighted Sum
- Activation function
- Output of the activation function
What activation functions are needed to get the complex chain functions that allow neural networks to learn data distributions?
Nonlinear activation functions
The range of a ReLU output?
between zero and infinity
The range of Tanh output?
between -1 and 1
The range of a Sigmoid output?
between zero and 1
The range of a ELU output?
between -1 and infinity
In a decision classification tree, what does each decision or node consist of?
Linear classifier of one feature
Mean squared error minimizer and euclidean distance minimizer are used in ______, not ______.
regression, classification
What thing in neural network can map to a higher dimensional vector space?
More neurons per layer
SVM: The _____ is the distance between two separate vectors.
margin
SVM: The more generalizable the decision boundary, the ____ the margin.
wider
SVM are used for text classification tasks such as __________,
__________, and _________.
category assignment,
detecting spam, sentiment analysis
SVMs are based on the idea of finding a ________ that best divides a dataset into _____ classes. ___________ are the data points nearest to the hyperplane,, the points of a data set that, if removed, would alter the position of the dividing ______. As a simple example, for a classification task with only two features, you can think of a _______ as a ______ that ______ separates and classifies a set of data.
hyperplane, two, support vectors, hyperplane, hyperplane, line, linearly
A _______ maps the data from our ______ vector space to a vector space that has features that can be ______ separated.
kernel transformation, input, linearly
In ML, kernel methods are a class of algorithms for ________, whose best know member is the ________.
pattern analysis, support vector machine
Dropout in neural networks works by randomly setting the _______ of hidden units to ____ at each update of training phase.
outgoing edges, 0
How does dropout help neural networks generalize?
In setting the output to 0, the cost function becomes more sensitive to neighboring neurons changing the way the weights will be updated during the process of backpropagation.
Three types of modern neural networks.
Convolutional, modular, recurrent
Three was to improve generalization in a NN?
Adding dropout layers, performing data augmentation, adding noise
At its core, a ________ is a method of evaluating how well your algorithm models your dataset. If your predictions are totally off, your _________ will output a higher number. If they’re pretty good, it will output a lower number. As you change pieces of your algorithm to try and improve your model, your ______ will tell you if you’re getting anywhere.
loss function
Simply speaking, __________ is the workhorse of basic loss functions. ______ is the sum of squared distances between our target variable and predicted values.
mean squared error
Loss functions can be broadly categorized into 2 types: Classification and Regression Loss. _____ is typically used for regression and ______ is typically used for classification.
mean squared error, cross entropy
Gradient Descent is an optimization algorithm used to _______ some function by iteratively moving in the direction of the steepest descent as by the _________. In machine learning, we use gradient descent to update the _______ of our model.
minimize, negative of the gradient, parameters
________, also called vanilla gradient descent, calculates the error for _______ within the training dataset, but only ________ all training examples have evaluated does the model get updated. This whole process is like a cycle and it’s called a training epoch.
Batch gradient descent, each example, after
In the ________________________ method, one training sample (example) is passed through the neural network at a time and the parameters (weights) of each layer are updated with the computed gradient.
Stochastic Gradient Descent
________________: Parameters are updated after computing the gradient of error with respect to the entire training set
________________: Parameters are updated after computing the gradient of error with respect to a single training example
________________: Parameters are updated after computing the gradient of error with respect to a subset of the training set
Batch Gradient Descent, Stochastic Gradient Descent, Mini-Batch Gradient Descent
What is a type one error?
When the model predicts positive but it’s actually a negative (predicts face when it’s a statue).
Formula for precision
True positives / (True positives + False Positives)
An increase in what factor will drive down the precision ratio?
False Positives
What is type two error?
When the predicts negative and it’s actually a positive (predicts not face when it’s a face in winter clothes).
Formula for recall
True positives /(true positives + false negatives)
Why is RMSE preferred?
The loss metric output is measured in the same units as the error making it easier to directly interpret.
There will always be a ____ between the metrics we care about and the metrics that work well with gradient descent.
gap
What is the significance of performance metrics?
Plus two benefits
Performance metrics will allow us to reject models that have settled into inappropriate minima.
- easier to understand
- directly connected to business goals
Two ways to think about recall?
- inversely related to precision
2. Recall is like a person who never wants to be left out of a positive decision
Two parameters that affect gradient descent?
- learning rate
2. batch size
What is the best way to assess the quality of a model?
To observe how well a model performs against a new dataset that it hasn’t seen before
How do you decide when to stop training a model?
When your loss metrics start to increase against the validation set
What actions can you perform on your model when it is trained and validated?
You can run it once, and only once, against the independent test dataset.
What two loss functions are the most common for Regressions?
RMSE for linear regression, cross-entropy for classification
Which is the most preferred way to traverse loss surfaces efficiently?
By analyzing the slopes of our loss functions, which provide us directions and step magnitude.
What core algorithm is used to construct Decision Trees?
Greddy algorithms
The RAND function in BigQuery generates a value between ____ and ____.
zero, one
How can you create repeatable samples of your data in BigQuery?
Use the last few digits of a hash function on the field that you’re using to split or bucketize your data
What allows you to split the dataset based upon a filed in your data?
FARM_FINGERPRINT, an open-source hashing algorithm that is implemented in BigQuery SQL.
TensorFlow is a _____ and _____ platform programming interface for implementing and running machine learning algorithms, including convenience wrappers for deep learning.
scalable, multi
In TensorFlow, ____ are multi-dimensional arrays with a uniform type. All tensors are ____ like Python numbers and strings: you can never update the contents of a tensor, only create a new one.
tensors, immutable
How does TensorFlow represent numeric computations?
Using a Directed Acyclic Graph (or DAG)
How can we improve the calculation speed in TensorFlow, without losing accuracy?
Using GPU
tf.losses, tf.metrics, and tf.optimizers are useful components when?
building custom Neural Network models.
Which processing units can you run TensorFlow?
CPU, GPU, TPU
tf.estimator, tf.keras, tf.data are high level APIs used for?
distributed training
You need to build a custom NN model. What are two options?
We can use an estimator from TF, or we can use a high-level API such as Keras
Which of the following API’s are not used in the TensorFlow abstraction layers?
C++ API, Python API, tf.keras, tf.image
tf.image
Which API is used to build performant, complex input pipelines from simple, re-usable pieces that will feed your model’s training or evaluation loops.
tf.data.Dataset
Two operations that can be performed on tensors?
reshaped, sliced
What rank is Shape:[3,4]?
Rank 2
TensorFlow records all operations executed inside the context of a tf._______ onto a _____.
GradientTape, Tape
When we compute a loss gradient TensorFlow uses ___ and the ___ associated with each recorded operation to compute the _____.
tape, gradients, gradients
In a TensorFlow loss gradient opperation the computed gradient of a recorded computation will be used in ______ mode differentiation.
reverse
How to produce tensors that can be modified and that can be used for weights?
tf.Variable
A tf.Variable represents a tensor whose value can be _____ by running ___ on it. Specific ops allow you to read and modify the values of this tensor. Higher level libraries like ______ use tf.Variable to store model parameters.
changed, ops, tf.keras
Feature columns describe how the model should use ____ _____ data from your features ______.
raw input, dictionary
A bucketized column helps with discretizing _____ _____ _____.
continuous feature values
Two distinct ways to create a dataset in TensorFlow?
- A data source constructs a dataset from data stored in memory or in one or more files.
- A data transformation constructs a dataset from one or more tf.data.Dataset objects
_____ is used to instantiate a Dataset object which is comprised of lines from one or more text files.
TextLineDataset
The _____ format is a simple format for storing a sequence of binary records. Using ____ can be useful for standardizing input data and optimizing performance.
TFRecordDataset
____ has fixed-length records from one or more binary files.
FixedLengthRecordDataset
Which method is invoked on the dataset - which triggers creation and execution of two operations?
iter
Three purposes of Neural Network embedding?
- Finding nearest neighbors in the embedding space. These can be used to make recommendations based on user interests or cluster categories.
- As input to a machine learning model for a supervised task.
- For visualization of concepts and relations between categories.
Three types of feature columns?
Categorical, Bucketized, Crossed
In the training phases of ML, which component is not part of the training phase?
Labeled Data, ML Algorithm, Served Model, Trained Model
Served Model
Three way to feed TensorFlow models with data
TextLineDataset, TFrecordDataset, FexedLengthRecordDataset
What is the role of the tf.data API in TensorFlow?
It enables you to build complex input pipelines from simple, reusable pieces
Three components of the ML pipeline before running the model?
Data Extraction, Data Exploration, Data Analysis
Non-linearity helps in training your model at a _____ _____ ____ and with _____ _____ without the loss of your important information.
much faster rate, more accuracy
The activation function which is linear in the positive domain and the function is 0 in the negative domain.
ReLU
During the training process, each additional layer in your network can successively reduce signal vs. noise. How can we fix this?
Use non-saturating, nonlinear activation functions such as ReLUs
How can we solve the problem called internal covariate shift?
Batch normalization
How can we stop ReLU layers from dying?
lower your learning rates
Which model is appropriate for a plain stack of layers?
Sequential
How does Adam (optimization algorithm) help in compiling the Keras model?
By updating network weights iteratively based on training data and by diagonal rescaling of the gradients
The predict function in the tf.keras API returns what?
Numpy array(s) of predictions
Three parameters involved while compiling the Keras model?
optimizer, loss function, evaluation metrics
Question 5
What is the significance of the Fit method while training a Keras model ?
Defines the number of epochs.
Two weaknesses of the Keras Functional API
- It doesn’t support dynamic architectures. The Functional API treats models as DAGs of layers. This is true for most deep learning architectures, but no all: for instance, recursive newtworks or Tree RNNs do not follow this assumption and cannot be implemented in the Functional API.
- Sometimes we have to write from scratch and need to build subclasses. When writing advanced architectures, you may want to do things that are outside the scope of “defining a DAG of layers”: for instance, you may want to expose multiple custom training and inference methods on your model instance. This requires subclassing.
The Keras Functional API can be characterized by having:
Multiple inputs and outputs and models with shared layers
The core data structure of Keras is a model, which let us organize and design layers. The _____ model is the simplest type of model (a linear stock of layers). If we need to build arbitrary graphs of layers, the Keras _____ ____ can do that for us.
Sequential, Functional API
The input layer of the Keras Functional API needs to have shape ___, where p is the number of ____ in your training matrix. For example: _____
(p,) , columns
inputs=Input(shape=(3,))
The activations in regularization is scaled by which equation?
1/(1 - dropout probability)
Question 2
How does regularization help build generalizable models ?
By adding dropout layers to our neural networks.
what does L2 regularization do?
It adds a sum of the squared parameter weights term to the loss function
___ regularization will keep the weight values smaller and ___ regularization will make the model sparser by dropping features.
L2, L1
Wat is approximate equivalent of L2 regularization?
Early Stopping
correct workflow to serve your model in the cloud
create the model -> train and evaluate your model -> save your model -> serve your model
To serve our model for other to use, we export the ____ ____ and deploy the model as a _____.
model file, service
_____ is the directory in which to write the SavedModel
(EXPORT_PATH)
SavedModel is a universal _____ format for TensorFlow models. SavedModel provides a “language neutral format” to save your machine learning models that is both _____ and _____.
serialization, recoverable, hermetic
The Keras Functional API allows you to define what 3 things?
- input or output models
- ad hoc acyclic network graphs
- a model that shares layers
In the Keras Functional API, models are created by specifying their ____ and ____ in a graph of layers. That means that a single graph of layers can be used to generate multiple models.
input, outputs
What is TensorFlow Data Validations?
It is a tool that can be used to analyze data to find potential problems in data.
How to Input Feature Columns to a Keras Model?
We can use a DenseFeatures layer to input them to a Keras model.
Three reasons why the Keras Sequential model is not appropriate?
- multiple inputs or multiple outputs
- Any of your layers has multiple inputs or multiple outputs.
- You need to do layer sharing or non-linear topology
The _____ function can be used with linear regression, logistic regression, k-means, matrix factorization, and ARIMA-based times series models. The _____ function evaluates the _____ values against the ____ data, and can be used to evaluate model _____.
ML.EVALUATE, ML.EVALUATE, predicted, actual, metrics
ML.FEATURE_CROSS generates a ____ feature with all combinations of crossed _____ features except for 1-degree items.
STRUCT, categorical
ML.BUCKETIZE bucketizes a ____ numerical feature into a _____ feature with bucket names as the value.
continuous, string
Feature Cross combines features into a _____ feature, and enables a model to learn separate _____ for each combination of features.
single, weights
_____ is a process by which categorical variables are converted into a form that could be provided to neural networks to do a better job in prediction
One hot encoding
What to use to encode categorical data that is already indexed?
tf.feature_column.categorical_column_with_identity
What do you use the tf.feature_column.bucketized_column function for?
To discretize floating point values into a smaller numberr of categorical bins
Before being input into an ML model, raw data must be turned into:
feature vectors
Three characteristics of a good feature
- related to the objective
- know at prediction time
- numeric with meaningful magnitude
Different problems in the same domain may need _____ _____
different features
What is the relationship between Apache Beam and Cloud Dataflow?
Cloud Dataflow is the API for data pipeline building in java or python and Apache Beam is the implementation and execution framework
TRUE or FALSE: The Filter method can be carried out in parallel and autoscaled by the execution framework:
True: Anything in Map or FlatMap can be parallelized by the Beam execution framework
What is the purpose of a Cloud Dataflow connector?
.apply(TextIO.write().to(“gs://…”));
Connectors allow you to output the results of a pipeline to a specific data sink like Bigtable, Google Cloud Storage, flat file, BigQuery, and more …
The stages of a pipeline
- data source
- transformation steps
- data sink
To run a pipeline you need something called a ________.
runner
TRUE or FALSE: A ParDo acts on all items at once (like a Map in MapReduce).
False. A ParDo acts on one item at a time (like a Map in MapReduce)
Three advantages with using an UI tool like Cloud Dataprep?
- Create transformations in UI tool instead of writing Java or Python
- Can chain step together as part of recipe
- Supports outputting your data into BigQuery, Google Cloud Storage, or flat files
TRUE or FALSE: You can automatically setup pipelines to run at defined intervals with Cloud Dataprep
True
Different cities in California have markedly different housing prices. What feature crosses could learn city-specific relationships between house characteristic and housing price?
One feature cross: [binned latitude X binned longitude X binned roomsPerPerson]
You are building a model to predict the number of points
(“margin”) by which Team A will beat Team B in a basketball game. Your input
features are (1) whether or not it is a home game for Team A (2) average
number of points Team A scored in its past 7 games and (3) average number
of points Team B scored in its past 7 games. What two a linear model
suitable for machine learning?
1) margin = b +w1is_home_game + w2avg_points_A + w3avg_points_B
2) margin = w1is_home + w2*(avg_points_A - avg_points_B)^3
Feature crosses are more common in modern machine learning
because:
Feature crosses memorize, and that is okay only if you have extremely large datasets
The function tf.feature_column.crossed_column requires:
A list of categorical or bucketized features
Three reasons you might create an embedding of a feature cross.
1) Create a lower-dimensional representation of the input space
2) Identify similar sets of inputs for clustering
3) Reuse weights learned in one problem in another problem
During the training and serving phase, tf.Transform:
Provides a TensorFlow graph for preprocessing
Tensorflow transform is a hybrid of?
Apache and TensorFlow
The ____ ____ is the most important concept of tf.Transform. The ____ ____ is a logical description of a transformation of the dataset. The ____ ____ accepts and returns a dictionary of tensors, where a tensor means Tensor or 2D SparseTensor.
Preprocessing function
Three steps in order that are considered a best practice in predictive modeling?
Data Cleaning > Feature engineering > Model Building
Using indicator variables to isolate key information, Highlighting interactions between two or more features. and representing the same feature in a different way are examples of?
Feature engineering
A good feature typically is _____ and _____.
related to the objective, is known at prediction time
Two benefits from Regularization?
- Makes models smaller
2. Limits overfitting (the most important reason)
What is the key reason that we want to penalize models for over-complexity?
Overly-complex models may not be generalizable to real-world scenarios on unseen data
If your learning rate is too small, your loss function will:
Converge very slowly
If your learning rate is too high, your loss function
will converge rapidly, but not reach the lowest error value possible
If your batch size is too high, your loss function will
converge slowly
If your batch size is too low, your loss function will
oscillate wildly
If searching among a large number of hyperparameters, you should do a systematic grid search rather than start from random values, so that you are not relying on chance. True or False?
False
Question 2
It is a good idea to use the training loss itself as the hyperparameter tuning metric. True or False?
False: you want to use an eval-metric as your hyperparameter tuning metric so that you are not rewarding models that overfit.
Hyperparameter tuning in Cloud ML Engine involves adding the appropriate TensorFlow function call to your model code. True or False?
False: Often, it is simply a matter of submitting a training job with an additional configuration setting
You are creating a model to predict the outcome (final score difference) of a basketball game between Team A and Team B. Your initial model is a neural network with [64, 32] nodes, learning_rate = 0.05, batch_size = 32. The input features include whether the game was played “at home” for Team A, the fraction of the last 7 games that Team A won, the average number of points scored by Team A in its last 7 games, the average score of Team A’s opponents in its last 7 games, etc.
Which of these are hyperparameters to the model?
The number of layers, batch size, number of nodes in each layers, the learning rate AND the number of previous games that the input features are average over (the creation of an feature is a hyperparameter)
What does L1 regularization tend to do to a model’s low predictive freatures’ parameter weights?
Have zero values
Which type of regularization is more likely to lead to zero weights?
L1
Which type of regularization penalizes large weight values more?
L2
Two reasons why it’s important to add regularization to logistic regression?
- Helps stop weights being driven to +/- infinity
2. Helps logits stay away from asymptotes which can halt training
Three things you should do when performing logistic regression:
- Adding regularization
- Choosing a tuned threshold
- Checking for bias
You are training your classification model and are using Logistic Regression. You last layer has no weights that can be _____
tuned
Why is it important adding non-linear activations functions to neural networks?
Stops the layers from collapsing back into just a linear model
Neural networks can be arbitrarily complex. To increase hidden dimensions, I can add____. To increase function composition, I can add ____. If I have multiple labels per example, I can add ____.
neurons, layers, outputs
Four things you can try if your model is experiencing exploding gradients:
- Lower the learning rate
- Add weight regularization
- Add Gradient clipping
- Add batch normalization
Dropout acts as another form of ____. It forces data to flow down ____ paths so that there is a more even spread. It also simulates ____ learning. Don’t forget to scale the dropout activations by the inverse of the _____. We remove dropout during ____.
Regularization, multiple, ensemble, keep probability, inference
What are three common ways that a neural network training can fail?
- Gradients can explode if the learning rate is too high
- Entire layers can die with all their weights becoming zero
- Gradients can vanish, making it harder to train networks the deeper they are
If you see a dead layer (fraction of zero weights close to 1), what is a reasonable thing to try?
Lower the learning rate
I am training a classification neural network with 5 hidden layers, sigmoid activation function, and [128, 64, 32, 16, 8] with learning_rate=0.05 and batch_size=32. I notice from TensorBoard that gradients in the third layer are near-zero. Is this a problem?
yes
I am training a classification neural network with 5 hidden layers, sigmoid activation function, and [128, 64, 32, 16, 8] with learning_rate=0.05 and batch_size=32. I notice from TensorBoard that gradients in the third layer are near-zero. What would you try to fix this?
Try using ReLU activation function
For our classification output, if we have both mutually exclusive labels and probabilities, we should us ____. If the labels are mutually exclusive, but the probabilities aren’t, we should us _____. If our labels aren’t mutually exclusive, we should use ____.
- tf.nn.softmax_entropy_with_logits_v2
- tf.nn.sparse_softmax_cross_entropy_with_logits
- tf.nn.sigmoid_cross_entropy_with_logits
If you have a classification problem with multiple labels, how does the neural network architecture change?
Have a logistic layer for each label, and send the outputs of the logistic layer to a softmax layer
If you have thousands of classes, computing the cross-entropy loss can be very slow. Which of these is a way to help address that problem?
Use a noise-contrastive loss function
What is the benefit of using a pre-canned Estimator?
It can give us a quick ML model
What is the recommended way to create distributed Keras models?
Write a Keras model as normal, and use the model_to_estimator function to convert it into an Estimator for train_and_evaluate
In the model function for a custom estimator, you can customize four things:
- the set of evaluation metrics
- The loss metric that is optimized
- The optimizer that is used
- The predictions that are returned (Correct. It is possible, for example, in a classification problem to decide to return an intermediate embedding, the class probability, and the logits. This is possible because predictions is a dictionary)
Two reasons for why an RNN (Recurrent Neural Network) is used for machine translation, say translating English to French?
- It can be trained as a supervised learning problem
2. It is applicable when the input/output is a sequence (e.g., a sequence of words).
What does a neuron compute?
A neuron computes a linear function (z=Wx + b) followed by an activation function
What is the loss function for a Logistic classification? Why do we use this one?
cross-entropy loss function, there is a global minimum
Suppose img is a (32,32,3) array, representing a 32x32 image with 3 color channels red, green and blue. How do you reshape this into a column vector?
x = img.reshape((32323,1))
a.shape = (2,3)
b.shape = (2,1)
c = a + b
c.shape = (2,3)
a.shape = (4,3)
b.shape = (3,2)
c = a*b
“Error!” the sizes don’t match for an element-wise multiplication
Suppose you have n_x input features per example. What is the dimension of X?
(n_x, m)
a.shape= (12288, 150)
b.shape = (150, 45)
c = np.dot(a,b)
c.shape = (12288, 45)
a.shape = (3,4)
b.shape = (4,1)
How do you vectorize this?
c = a + b.T
a.shape = (3,3)
b.shape = (3,1)
c = a*b
What does python do to make this work?
This will invoke broadcasting, so b is copied three tiems to become (3,3), and * is an element-wise product so c.shape will be (3,3)
Whey does the tanh activation function usually work better than sigmoid as an activation function in the hidden layers?
The output range is between -1 and 1 and thus center the data around zero, which makes learning simpler for the next layer
You are building a binary classifier for recognizing cucumbers (y=1) vs. watermelons (y=0). Which one of these activation functions would you recommend using for the output layer?
sigmoid
A = np.random.randn(4,3) B = np.sum(A, axis = 1, keepdims = True)
what is B.shape?
(4,1)
What will happen if you build a neural network and you initialize the weights to be zero?
Each neuron in the first hidden layer will perform the same computation. So even after multiple iterations of gradient descent each neuron in the layer will be computing the same thing as other neurons.
You have built a network using the tanh activation for all the hidden units. You initialize the weights to relative large values, using np.random.randn(..,..)*1000. What will happen?
This will cause inputs of the tanh to also be very large, thus causing gradients to be close to zero. The optimization algorithm will thus become slow.
What is the “cache” used for in our implementation of forward propagation and backward propagation?
We use it to pass variables computed during forward propagation to the corresponding backward propagation step. It contains useful values for backward propagation to compute derivatives.
The ____ layers of a neural network are typically computing more complex features of the input than the ____ layers.
deeper, earlier
Vectorization allows you to compute forward propagation in an LL-layer neural network without an explicit for-loop (or any other explicit iterative loop) over the layers l=1, 2, …,L. True/False?
False
in a deeper network, we cannot avoid a for loop iteration over the layers
Why do we need to know the activation function for backpropagation?
To compute the derivative, each activation has a different derivative.
Circuit theory: (i) To compute the function using a shallow network circuit, you will need a large network (where we measure size by the number of logic gates in the network), but (ii) To compute it using a deep network circuit, you need only an ____ smaller network.
exponentially
In general how can we find the dimension of the weight matrix associated with a layer?
W[l] has shape (n[l], n[l-1])
If you have 10,000,000 examples, how would you split the train/deve/test set?
98%, 1% , 1%
The dev and test set should:
come from the same distribution
If your Neural Network model seems to have high bias, what two things could you try?
- increase the number of units in each hidden layer
2. make the Neural Network deeper
You are working on an automated check-out kiosk for a supermarket, and are building a classifier for apples, bananas and oranges. Suppose your classifier obtains a training set error of 0.5%, and a dev set error of 7%. What two things could you try?
- Increase the regularization parameter lambda
2. get more training data
What is weight decay?
A regularization technique (such as L2 regularizaiton) that results in gradient descent shrinking the weights on every iteration
What happens when you increase the regularization hyperparameter lambda?
Weights are pushed toward becoming smaller
With the inverted dropout technique, at test time:
you do not apply dropout (do not randomly eliminate units) and do not keep the 1/keep_prob factor in the calculations used in training
Increasing the parameter kee_prob from (say) 0.5 to 0.6 will likely cause what two things?
- reducing the regularization effect
2. causing the neural network to end up with a lower training set error
Three techniques that can be used to reduce variance when more training data isn’t an option?
- dropout
- data augmentation
- L2 regularization
Why do we normalize the inputs x?
It makes the cost function faster to optimize
Which notation would you use to denote the 3rd layer’s activations when the input is the 7th example from the 8th minibatch?
a[3]{7}(7)
Why is the best mini-batch size usually not 1 and not m, but instead something in-between?
if it’s 1 then you lose the benefits of vectorization across examples in the mini-batch
if it’s m the you end up with batch gradient descent, which can be very slow for big training sets
If you plot the cost with mini-batch what does it look like?
It will look like batch gradient descent but more oscillated
What’s bias correct? Is it popular?
It’s when you use exponentially weighted averages on back propagation, and it corrects for the first few iterations so that they are not zero. In practice it’s not that common.
With exponentially weighted averages what happens if the B is too large? too small? What a good B value?
Too large the line will shift to the right. Too small it will oscillate a lot. Standard practice uses B=0.9=-> average over the last 10 iterations
Suppose batch gradient descent in a deep network is taking excessively long to find a value of the parameters that achieves a small value for the cost function. Which four techniques could help find parameter values that attain a small value for\mathcal{J}J?
- tuning the learning rate
- try mini-batch gradient descent
- try using Adam
- try better random initialization for the weights
Why is grid search a bad idea for searching for hyperparameters?
Random search will let you try more values. A 5x5 grid search of two hyper will let be 25 different combination but only 5 different values of the two hyperparameters. A random search will 25 different values for each hyperparameter.
Which hyperparameters are generally the most important?
1st - learning rate
2nd - hidden units, B (momentum), mini-batch size
3rd - number of layers, learning rate decay
The parameters with Adam are usually fixed: B1 = 0.9, B2 = 0.999, epsilon = 10^(-8)
During hyperparameter search, whether you try to babysit one model (“Panda” strategy) or train a lot of models in parallel (“Caviar”) is largely determined by:
The amount of computational power you can access
If you think β (hyperparameter for momentum) is between on 0.9 and 0.99, what is the recommended way to sample a value for beta?
r = np.random.rand() beta = 1-10**(-r-1)
In batch normalization as presented in the videos, if you apply it on the llth layer of your neural network, what are you normalizing/
Z[l], that will go into the activation function
In the normalization formula (z_norm)= (z(i)-u)/sqrt(sigma^2+epsilon), why do we use epsilon?
to avoid division by zero
What do gamma and beta do in Batch Norm, and how can we find them?
They set the mean and variance of the linear variable z[l] or a given layer, and they can be learned using Adam, Gradient descent with momentum, RMSprop, or gradient descent,
After training a neural network with Batch Norm, at test time, to evaluate the neural network on a new example you should:
Perform the needed normalization using an exponentially weighted average across mini-batches seen during training.
Suppose your input is a 300 by 300 color (RGB) image, and you are not using a convolutional network. If the first hidden layer has 100 neurons, each one fully connected to the input, how many parameters does this hidden layer have (including the bias parameters)?
27,000,100=3003003*100+100
Suppose your input is a 300 by 300 color (RGB) image, and you use a convolutional layer with 100 filters that are each 5x5. How many parameters does this hidden layer have (including the bias parameters)?
7600=253100+100
You have an input volume that is 63x63x16, and convolve it with 32 filters that are each 7x7, using a stride of 2 and no padding. What is the output volume?
29x29x32
dimension=((n+2p-f)/s)+1
((63+2*0-7)/2)+1 = 29
You have an input volume that is 15x15x8, and pad it using “pad=2.” What is the dimension of the resulting volume (after padding)?
19x19x8
You have an input volume that is 63x63x16, and convolve it with 32 filters that are each 7x7, and stride of 1. You want to use a “same” convolution. What is the padding?
3
dimension=((n+2p-f)/s)+1
((63+2p-7)/1)+1=63 -> p=3
You have an input volume that is 32x32x16, and apply max pooling with a stride of 2 and a filter size of 2. What is the output volume?
16X16X16
divide width and height by 2 or ((n+2p-f)/s)+1 works as well
True or False. Because pooling layers do not have parameters, they do not affect the backpropagation (derivatives) calculation.
False
Two reasons why ‘parameter’ sharing is a benefit for using convolutional networks?
- It reduces the total number of parameters, thus reducing overfitting
- It allows a feature detector to be used in multiple locations throughout the whole input/image/input volume
In lecture we talked about “sparsity of connections” as a benefit of using convolutional layers. What does this mean?
Each activation in the next layer depends on only a small number of activations from the previous layer
How do dimension typically change in a ConvNet?
nH and nW decrease, while nC increases
Typical structure of a ConvNet?
Multiple CONV layers followed by a POOL layer repeated a few times, and FC layers in the last few layers
Suppose you have an input volume of dimension 64x64x16. How many parameters would a single 1x1 convolutional filter have (including the bias)?
17
Suppose you have an input volume of dimension nH x nW x nC. How can you change the dimensions? (Assume that “1x1 convolutional layer” below always uses a stride of 1 and no padding.)
You can use a 1x1 convolutional layer to reduce nC but not nH, nW.
You can use a pooling layer to reduce nH, nW, but not nC.
What can a 1x1 convolution do for Inception Networks?
They can reduce the input data volume’s size before applying 3x3 and 5x5 convolutions
What is a inception block?
A single inception block allows the network to sue a combination of 1x1, 3x3, 5x5 convolutions and pooling
What two reasons for using open-source implementations of ConvNets (both the model and/or weights)?
- It is a convenient way to get working an implementation of a complex ConvNet architecture.
- Parameters trained for one computer vision task are often useful as pretraining for other computer vision tasks.
You are working on a factory automation task. Your system will see a can of soft-drink coming down a conveyor belt, and you want it to take a picture and decide whether (i) there is a soft-drink can in the image, and if so (ii) its bounding box. Since the soft-drink can is round, the bounding box is always square, and the soft drink can always appears as the same size in the image. There is at most one soft drink can in each image. Here’re some typical images in your training set: What is the most appropriate set of output units for your neural network?
Logistic unit, bx, by
If you build a neural network that inputs a picture of a person’s face and outputs N landmarks on the face (assume the input image always contains exactly one face), how many output units will the network have?
2N
When training one of the object detection systems described in lecture, you need a training set that contains many pictures of the object(s) you wish to detect. However, bounding boxes do not need to be provided in the training set, since the algorithm can learn to detect the objects by itself.
False
Suppose you are applying a sliding windows classifier (non-convolutional implementation). Increasing the stride would tend to increase accuracy, but decrease computational cost.
False
In the YOLO algorithm, at training time, only one cell —the one containing the center/midpoint of an object— is responsible for detecting this object.
True
What is the IoU between these two boxes? The upper-left box is 2x2, and the lower-right box is 2x3. The overlapping region is 1x1
1/9
Suppose you are using YOLO on a 19x19 grid, on a detection problem with 20 classes, and with 5 anchor boxes. During training, for each image you will need to construct an output volume y as the target value for the neural network; this corresponds to the last layer of the neural network. (y may include some “?”, or “don’t cares”). What is the dimension of this output volume?
19x19x(5x25)
Face _____requires comparing a new picture against one person’s face, whereas face recognition requires comparing a new picture against K person’s faces.
verification, recognition
Why do we learn a function d(img1,img2) for face verification?
This allows us to learn to recognize a new person given just a single image of that person. We need to solve a one-shot learning problem
In order to train the parameters of a face recognition system, it would be reasonable to use a training set comprising 100,000 pictures of 100,000 different persons.
False
having about 10 photos of each person would work well
Which of the following is a correct definition of the triplet loss? Consider that α>0. (We encourage you to figure out the answer from first principles, rather than just refer to the lecture.)
max(||f(A)−f(P)||^2 − ||f(A)−f(N)||^2 + α, 0)
In a siamese network architecture neural networks will have ______ input images, but have exactly the same _____ parameters.
different, same
You train a ConvNet on a dataset with 100 different classes. You wonder if you can find a hidden unit which responds strongly to pictures of cats. (I.e., a neuron so that, of all the input/training images that strongly activate that neuron, the majority are cat pictures.) You are more likely to find this unit in layer 4 of the network than in layer 1
True
In the deeper layers of a ConvNet, each channel corresponds to a different feature detector. The style matrix G[l] measures the degree to which the activations of different feature detectors in layer l vary (or correlate) together with each other.
True
In neural style transfer, what is updated in each iteration of the optimization algorithm?
The pixel values of the generated image G
You are working with 3D data. You are building a network layer whose input volume has size 32x32x32x16 (this volume has 16 channels), and applies convolutions with 32 filters of dimension 3x3x3 (no padding, stride 1). What is the resulting output volume?
30x30x30x32
Two examples for when we would use a many-to-one RNN architecture?
- Sentiment classification from a text [0=negative, 1=postivie]
- Gender recognition from speech [0=male, 1=femal]
how many inputs does a have in a RNN?
two, a and x>t>
You are training an RNN, and find that your weights and activations are all taking on the value of NaN (“Not a Number”). Which of these is the most likely cause of this problem?
Exploding gradient problem
Suppose you are training a LSTM. You have a 10000 word vocabulary, and are using an LSTM with 100-dimensional activations a. What is the dimension of Γu at each time step?
100
You have a pet dog whose mood is heavily dependent on the current and past few days’ weather. You’ve collected data for the past 365 days on the weather, which you represent as a sequence as x<1>,…,x<365>. You’ve also collected data on your dog’s mood, which you represent as y<1>,…,y<365>. You’d like to build a model to map from x→y. Should you use a Unidirectional RNN or Bidirectional RNN for this problem?
Unidirectional RNN, because the value of y depends only on x<1>,…,x, but not on x,…,x<365>
What is t-SNE?
A non-linear dimensionality reduction technique. It can be used to view the relations in Word Embedding Matrix
What equations could you expect to make from ‘boy’, ‘girl’, ‘brother’ and ‘sister’ if your word embedding is good?
boy - girl =~ brother - sister
boy - brother =~ girl - sister
When is it okay to use transfer learning for NLP?
When your training set is smaller than the training set used to create the word embedding
Which other file type is an example of a file type that uses Javascript Object Notation (JSON) formatting that I would use?
Jupyter/iPhython (.ipynb files)
Which residual-based approach to identifying outliers compares running a model with all data to running the same model, but dropping a single observation?
externally studentized residuals
equation to standardize the data?
(x-mean)/std
equation to in-max standardize the data?
(x-min)/(max-min)
What’s the robust scaler? And why use it?
It’s like the min-max standardization but only using the interquartile. It’s less vulnerable to outliers.
(True/False) In general, the population parameters are unknown
True
Question 2 (True/False) Parametric models have finite number of parameters
True
The most common way of estimating parameters in a parametric model is:
using the maximum likelihood estimation
A p-value is:
the smallest significance level at which the null hypothesis would be rejected
Type 1 Error is defined as:
Saying the null hypothesis is false, when it is actually true
Type 2 error is defined as:
Saying the null hypothesis is true, when it is actually false
(True/False) If you reject the null hypothesis, it means that the alternate hypothesis is true.
False
In K-fold cross-validation, how will increasing k affect the variance (across subsamples) of estimated model parameters?
increasing k will usually increase the variance of the estimated parameters
What does Bagging stand for?
bootstrap aggregating
What is the main condition to use stacking as ensemble method?
Models need to output predicted probabilities
This tree ensemble method only uses a subset of the features for each tree:
Random Forest
This is an ensemble model that does not use bootstrapped samples to fit the base trees, takes residuals into account, and fits the base trees iteratively:
Boosting
When clustering with KMeans, what’s the difference between inertia and distortion?
inertia - you want a similar number of observations in each cluster
distortion - you want the observations in each cluster to be very similar
When using DBSCAN, how does the algorithm determine that a cluster is complete and is time to move to a different point of the data set and potentially start a new cluster?
When no point is left unvisited by the chain reaction
What are the advantages of DBSCAN?
Don’t need to specify the number of cluster, allows for noise and can handle strange shapes
What are the disadvantages of DBSCAN?
computationally expensive, hard to choose parameters, and cluster should have similar density
How to find the best number of clusters with the K-means algorithm?
Use the elbow method
What are the key hyperparameters for the Hierarchical Clustering (Ward) algorithm?
distances and linkage
How to choose the number of cluster while you the K-Mean Shift algorithm?
The algorithms chooses it for us
How do we define the core points when we use the DBSCAN algorithm?
A point that has more than n_clu neighbors in its Е-neighborhood
What are the L1 and L2 distances?
L1 = Manhattan L2 = Euclidean
When might the Manhattan distance be better?
When the data is very high dimensional
What is Cosine distance and when do we use it?
It measure the angle form the origin. It can be good for text data when location of occurrence is less important
Which distance metric is useful when we have text documents and we want to group similar topics together?
Jaccard
For data with many features, principal components analysis …
generates new features that are linear combinations of other features
What is the main difference between kernel PCA and linear PCA?
Kernel PCA tend to preserve the geometric distances between the points while reducing the dimensionality of the space
Multi-Dimensional Scaling can be useful to do what?
To visualize the data
When we use the DBSCAN algorithm, how do we know that our cluster is complete and is time to move to a different point of the data set and potentially start a new cluster?
When no point is left unvisisted by the chain reaction
What correctly defines the strengths of the DBSCAN algorithm?
No need to specify the number of clusters, allows for noise, and can handle arbitrary-shaped clusters
What statements correctly defines the weaknesses of the DBSCAN algorithm?
It needs two parameters as inputs, finding appropriate values can be difficult, and does not do well with clusters of different density
How can you have clear separations of clusters while using HAC?
Use Single Linkage
Which linkage refers to maximum pairwise distance between clusters in HAC?
Complete linkage
Which of the following measure methods computes the inertia and picks the pair that is going to ultimately minimize the inertia value in HAC?
Ward Linkage
This is the type of decomposition model that is used if the magnitudes of the seasonal and residual values fluctuate with trend:
Multiplicative Decomposition Model
This decomposition model assumes that the seasonal and residual magnitudes are independent of trend.
Additive Decomposition Model
Which of the following smoothing techniques is appropriate for data with a trend but no seasonality?
Double Exponential Smoothing
Question 2
Which of the following smoothing techniques is appropriate for data with both trend and seasonality?
Triple Exponential Smoothing
How is SARIMA different than than ARIMA?
S= seasonality
What is a characteristic of an autoregressive (AR) model?
A fixed number of past forecast values are used to predict future values.
What is a characteristic of a moving average (MA) model?
A fixed number of past forecast errors are used to predict future values.
An ARIMA model without differencing (I=0) is equivalent to what of the following approaches?
The sum of AR and MA model.
This plot summarizes the 2-way correlation between a variable and its past values:
Autocorrelation plot
Two major ways that data can be collected:
cross-sectional (panel), and longitudinal (time series)
three important pillars of the mathematical sciences
Function approximation, Optimization, Probability and Statistics.