Data Science Flashcards
Two most common supervised tasks?
Classification and Regression
Four common unsupervised tasks?
Clustering, visualization, dimensionality reduction, association rule learning
What model is used to train a robot to walk on various unknown terrains?
Reinforcement Learning
Is spam detection a supervised or unsupervised learning problem?
Supervised, you feed the model many emails that are labeled spam or not spam
What is an online learning system?
A learning system that can learn incrementally. Capable of adapting rapidly to changing data and autonomous systems, and of training on very large quantities of data
What is out-of-core learning?
Out-of-core algorithms can handle vast quantities of data that cannot fit in a computer’s main memory. Chops the data into mini-batches and uses online learning techniques.
What type of learning algorithm relies on a similarity measure to make predictions?
An instance-based learning system learns the training data by heart; then, when given new instances, it uses a similarity measure to find the most similar learned instances and uses them to make predictions.
Difference between parameter and learning algorithm?
a parameter will predict give a new instance (e.g. slope of a linear model), a hyperparameter is a parameter of the learning algorithm itself (max depth of learning tree
what do model-based learning algorithms search for? 2 What is the most common strategy they use to succeed? 3 How do they make predictions?
They search for an optimal value for the model parameters such that the model will generalize well to new instances. 2 Usually by minimizing a cost function. 3 Feed new instances into the model.
Five main challenges to ML?
- lack of data 2. poor data quality 3. nonrepresentative data 4. uninformative features 5. overfitting or underfitting
Four solutions to overfitting?
- get more data 2. simplify the model 3. reduce the noise in the data 4 smaller learning rate
What is a test set?
to generalize the error that the model will make on new instances
The purpose of a validation set?
To compare models and tune the hyperparameters
What is a train-dev set?
Used when there is a risk of mismatch between the training data and the data used for validation
Which Linear Regression training algorithm can you use if you have a training set with million of features?
You can’t use SVD or Normal Equation because computational complexity grows quickly with the number of features. Use Stochastic Gradient Descent or Mini-batch Gradient Descent. If memory allows you can use Batch Gradient Descent.
If your training set has very different scales which algorithms might suffer? What can you do about this?
The cost function will have the shape of an elongated bowl, so the Gradient Descent algorithms will take a long time to converge. (Normal Equation or SVD approach will work fine). To solve this you should scale the data first. Moreover, regularized models may converge to a suboptimal solution if the features are not scaled.
Can Gradient Descent get stuck in a local minimum when training a Logistic Regression Model?
Gradient Descent cannot get stuck in a local minimum when training a Logistic Regression model because the cost function is convex.
Suppose you use Batch Gradient Descent and you plot the validation error at every epoch. If you notice that the validation error consistently goes up, what is likely going on?
If the validation error consistently goes up after every epoch, then one possibility is that the learning rate is too high and the algorithm is diverging. If the training error also goes up, then this is clearly the problem and you should reduce the learning rate. However, if the training error is not going up, then your model is overfitting the training set and you should stop training.
Do all Gradient Descent algorithms lead to the same model, provided you let them run long enough?
If the optimization problem is convex (such as Linear Regression or Logistic Regression), and assuming the learning rate is not too high, then all Gradient Descent algorithms will approach the global optimum and end up producing fairly similar models. However, unless you gradually reduce the learning rate, Stochastic GD and Mini-batch GD will never truly converge; instead they will keep jumping back and forth around the global optimum. This means that even if you let them run for a very long time, these Gradient Descent algorithms will produce slightly different models.
Is it a good idea to stop Mini-batch Gradient Descent immediately when the validation error goes up?
Due to their random nature, neither Stochastic Gradient Descent nor Mini-batch Gradient Descent is guaranteed to make progress at every single training iteration. So if you immediately stop training when the validation error goes up, you may stop much too early, before the optimum is reached. A better option is to
save the model at regular intervals; then, when it has not improved for a long time (meaning it will probably never beat the record), you can revert to the best
saved model.
Which Gradient Descent algorithm (among those we discussed) will reach the vicinity of the optimal solution the fastest? Which will actually converge? How can you make the others converge as well?
Stochastic Gradient Descent has the fastest training iteration since it considers only one training instance at a time, so it is generally the first to reach the vicinity
of the global optimum (or Mini-batch GD with a very small mini-batch size). However, only Batch Gradient Descent will actually converge, given enough
training time. As mentioned, Stochastic GD and Mini-batch GD will bounce around the optimum, unless you gradually reduce the learning rate.
Suppose you are using Polynomial Regression. You plot the learning curves and
you notice that there is a large gap between the training error and the validation
error. What is happening? What are three ways to solve this?
If the validation error is much higher than the training error, this is likely because your model is overfitting the training set. One way to try to fix this is to reduce
the polynomial degree: a model with fewer degrees of freedom is less likely to overfit. Another thing you can try is to regularize the model—for example, by adding an ℓ2 penalty (Ridge) or an ℓ1 penalty (Lasso) to the cost function. This will also reduce the degrees of freedom of the model. Lastly, you can try to increase the size of the training set.
Suppose you are using Ridge Regression and you notice that the training error
and the validation error are almost equal and fairly high. Would you say that the
model suffers from high bias or high variance? Should you increase the regularization hyperparameter α or reduce it?
If both the training error and the validation error are almost equal and fairly high, the model is likely underfitting the training set, which means it has a high
bias. You should try reducing the regularization hyperparameter α.
Why would you want to use:
a. Ridge Regression instead of plain Linear Regression (i.e., without any regula‐
rization) ?
A model with some regularization typically performs better than a model without any regularization, so you should generally prefer Ridge Regression over plain Linear Regression.
Why would you want to use:
b. Lasso instead of Ridge Regression?
Lasso Regression uses an ℓ1 penalty, which tends to push the weights down to exactly zero. This leads to sparse models, where all weights are zero except for
the most important weights. This is a way to perform feature selection automatically, which is good if you suspect that only a few features actually matter. When you are not sure, you should prefer Ridge Regression.
Why would you want to use: c. Elastic Net instead of Lasso?
Elastic Net is generally preferred over Lasso since Lasso may behave erratically in some cases (when several features are strongly correlated or when there are more features than training instances). However, it does add an extra hyper‐parameter to tune. If you want Lasso without the erratic behavior, you can just
use Elastic Net with an l1_ratio close to 1.
Suppose you want to classify pictures as outdoor/indoor and daytime/nighttime.
Should you implement two Logistic Regression classifiers or one Softmax Regression classifier?
If you want to classify pictures as outdoor/indoor and daytime/nighttime, since these are not exclusive classes (i.e., all four combinations are possible) you should train two Logistic Regression classifiers.
What is the fundamental idea behind Support Vector Machines?
The fundamental idea behind Support Vector Machines is to fit the widest possible “street” between the classes. In other words, the goal is to have the largest possible margin between the decision boundary that separates the two classes and the training instances. When performing soft margin classification, the SVM searches for a compromise between perfectly separating the two classes and having the widest possible street (i.e., a few instances may end up on the street). Another key idea is to use kernels when training on nonlinear datasets.
What is a support vector?
. After training an SVM, a support vector is any instance located on the “street” (see the previous answer), including its border. The decision boundary is entirely
determined by the support vectors. Any instance that is not a support vector (i.e., is off the street) has no influence whatsoever; you could remove them, add more instances, or move them around, and as long as they stay off the street they won’t affect the decision boundary. Computing the predictions only involves the support vectors, not the whole training set.
Why is it important to scale the inputs when using SVMs?
SVMs try to fit the largest possible “street” between the classes, so if the training set is not scaled, the SVM will tend to neglect small features
Can an SVM classifier output a confidence score when it classifies an instance?
What about a probability?
An SVM classifier can output the distance between the test instance and the decision boundary, and you can use this as a confidence score. However, this score cannot be directly converted into an estimation of the class probability. If you set probability=True when creating an SVM in Scikit-Learn, then after training it will calibrate the probabilities using Logistic Regression on the SVM’s scores (trained by an additional five-fold cross-validation on the training data). This will add the predict_proba() and predict_log_proba() methods to the SVM.
Should you use the primal or the dual form of the SVM problem to train a model on a training set with millions of instances and hundreds of features?
This question applies only to linear SVMs since kernelized SVMs can only use the dual form. The computational complexity of the primal form of the SVM problem is proportional to the number of training instances m, while the computational complexity of the dual form is proportional to a number between m2 and m3 So if there are millions of instances, you should definitely use the primal form, because the dual form will be much too slow
Say you’ve trained an SVM classifier with an RBF kernel, but it seems to underfit the training set. Should you increase or decrease γ (gamma)? What about C?
If an SVM classifier trained with an RBF kernel underfits the training set, there might be too much regularization. To decrease it, you need to increase gamma or C (or both).
What is the approximate depth of a Decision Tree trained (without restrictions) on a training set with one million instances?
The depth of a well-balanced binary tree containing m leaves is equal to log_2(m), rounded up. A binary Decision Tree (one that makes only binary decisions, as is the case of all trees in Scikit-Learn) will end up more or less well balanced at the end of training, with one leaf per training instance if it is trained without restrictions. Thus, if the training set contains one million instances, the Decision Tree will have a depth of log_2
(10^6) ≈ 20 (actually a bit more since the tree will generally not be perfectly well balanced).
Is a node’s Gini impurity generally lower or greater than its parent’s? Is it generally lower/greater, or always lower/greater?
A node’s Gini impurity is generally lower than its parent’s. This is due to the CART training algorithm’s cost function, which splits each node in a way that minimizes the weighted sum of its children’s Gini impurities. However, it is possible for a node to have a higher Gini impurity than its parent, as long as this increase is more than compensated for by a decrease in the other child’s impurity. For example, consider a node containing four instances of class A and one of class B. Its Gini impurity is 1 – (1/5)2– (4/5)2 = 0.32. Now suppose the dataset is one-dimensional and the instances are lined up in the following order: A, B, A, A, A. You can verify that the algorithm will split this node after the second instance, producing one child node with instances A, B, and the other child node with instances A, A, A. The first child node’s Gini impurity is 1 – (1/2)2– (1/2)2=0.5, which is higher than its parent’s. This is compensated for by the fact that the other node is pure, so its overall weighted Gini impurity is 2/5 × 0.5 + 3/5 × 0 =0.2, which is lower than the parent’s Gini impurity.
If a Decision Tree is overfitting the training set, is it a good idea to try decreasing max_depth?
If a Decision Tree is overfitting the training set, it may be a good idea to decrease max_depth, since this will constrain the model, regularizing it.
If a Decision Tree is underfitting the training set, is it a good idea to try scaling the input features?
Decision Trees don’t care whether or not the training data is scaled or centered; that’s one of the nice things about them. So if a Decision Tree underfits the training set, scaling the input features will just be a waste of time.
If it takes one hour to train a Decision Tree on a training set containing 1 million instances, roughly how much time will it take to train another Decision Tree on a training set containing 10 million instances?
The computational complexity of training a Decision Tree is O(n × m log(m)). So if you multiply the training set size by 10, the training time will be multiplied by
K = (n × 10m × log(10m)) / (n × m × log(m)) = 10 × log(10m) / log(m). If m =10^6, then K ≈ 11.7, so you can expect the training time to be roughly 11.7 hours.
If your training set contains 100,000 instances for Decision Tree Classifier, will setting presort=True speed up training?
Presorting the training set speeds up training only if the dataset is smaller than a few thousand instances. If it contains 100,000 instances, setting presort=True
will considerably slow down training.
If you have trained five different models on the exact same training data, and they all achieve 95% precision, is there any chance that you can combine these models to get better results? If so, how? If not, why?
If you have trained five different models and they all achieve 95% precision, you can try combining them into a voting ensemble, which will often give you even
better results. It works better if the models are very different (e.g., an SVM classifier, a Decision Tree
Classifier, a Logistic Regression classifier, and so on). It is even better if they are trained on different training instances (that’s the whole point of bagging and pasting ensembles), but if not this will still be effective as long as the models are very different.
What is the difference between hard and soft voting classifiers?
A hard voting classifier just counts the votes of each classifier in the ensemble and picks the class that gets the most votes. A soft voting classifier computes the average estimated class probability for each class and picks the class with the highest probability. This gives high-confidence votes more weight and often per‐ forms better, but it works only if every classifier is able to estimate class probabilities (e.g., for the SVM classifiers in Scikit-Learn you must set probability=True).
Is it possible to speed up training of a bagging ensemble by distributing it across multiple servers? What about pasting ensembles, boosting ensembles, Random Forests, or stacking ensembles?
It is quite possible to speed up training of a bagging ensemble by distributing it across multiple servers, since each predictor in the ensemble is independent of the others. The same goes for pasting ensembles and Random Forests, for the same reason. However, each predictor in a boosting ensemble is built based on the previous predictor, so training is necessarily sequential, and you will not gain anything by distributing training across multiple servers. Regarding stacking ensembles, all the predictors in a given layer are independent of each other, so they can be trained in parallel on multiple servers. However, the predictors in one layer can only be trained after the predictors in the previous layer have all been trained.
What is the benefit of out-of-bag evaluation?
With out-of-bag evaluation, each predictor in a bagging ensemble is evaluated using instances that it was not trained on (they were held out). This makes it possible to have a fairly unbiased evaluation of the ensemble without the need for an additional validation set. Thus, you have more instances available for training, and your ensemble can perform slightly better.
What makes Extra-Trees more random than regular Random Forests? How can this extra randomness help? Are Extra-Trees slower or faster than regular Random Forests?
When you are growing a tree in a Random Forest, only a random subset of the features is considered for splitting at each node. This is true as well for ExtraTrees, but they go one step further: rather than searching for the best possible thresholds, like regular Decision Trees do, they use random thresholds for each feature. This extra randomness acts like a form of regularization: if a Random Forest overfits the training data, Extra-Trees might perform better. Moreover,
since Extra-Trees don’t search for the best possible thresholds, they are much faster to train than Random Forests. However, they are neither faster nor slower
than Random Forests when making predictions.
If your AdaBoost ensemble underfits the training data, which hyperparameters should you tweak and how?
If your AdaBoost ensemble underfits the training data, you can try increasing the number of estimators or reducing the regularization hyperparameters of the base estimator. You may also try slightly increasing the learning rate.
If your Gradient Boosting ensemble overfits the training set, should you increase or decrease the learning rate?
If your Gradient Boosting ensemble overfits the training set, you should try decreasing the learning rate. You could also use early stopping to find the right
number of predictors (you probably have too many).
What are the main motivations for reducing a dataset’s dimensionality?
The main motivations for dimensionality reduction are:
• To speed up a subsequent training algorithm (in some cases it may even remove noise and redundant features, making the training algorithm perform
better)
• To visualize the data and gain insights on the most important features
• To save space (compression)
What are the main drawbacks for reducing a dataset’s dimensionality?
The main drawbacks are:
• Some information is lost, possibly degrading the performance of subsequent training algorithms.
• It can be computationally intensive.
• It adds some complexity to your Machine Learning pipelines.
• Transformed features are often hard to interpret.
What is the curse of dimensionality?
The curse of dimensionality refers to the fact that many problems that do not exist in low-dimensional space arise in high-dimensional space. In Machine
Learning, one common manifestation is the fact that randomly sampled highdimensional vectors are generally very sparse, increasing the risk of overfitting and making it very difficult to identify patterns in the data without having plenty of training data.
Once a dataset’s dimensionality has been reduced, is it possible to reverse the operation? If so, how? If not, why?
Once a dataset’s dimensionality has been reduced using one of the algorithms we discussed, it is almost always impossible to perfectly reverse the operation,
because some information gets lost during dimensionality reduction. Moreover, while some algorithms (such as PCA) have a simple reverse transformation procedure that can reconstruct a dataset relatively similar to the original, other
algorithms (such as T-SNE) do not
Can PCA be used to reduce the dimensionality of a highly nonlinear dataset?
PCA can be used to significantly reduce the dimensionality of most datasets, even if they are highly nonlinear, because it can at least get rid of useless dimensions. However, if there are no useless dimensions—as in a Swiss roll dataset—then
reducing dimensionality with PCA will lose too much information. You want to unroll the Swiss roll, not squash it.
Suppose you perform PCA on a 1,000-dimensional dataset, setting the explained variance ratio to 95%. How many dimensions will the resulting dataset have?
That’s a trick question: it depends on the dataset. Let’s look at two extreme examples. First, suppose the dataset is composed of points that are almost perfectly
aligned. In this case, PCA can reduce the dataset down to just one dimension while still preserving 95% of the variance. Now imagine that the dataset is composed of perfectly random points, scattered all around the 1,000 dimensions. In this case roughly 950 dimensions are required to preserve 95% of the variance. So
the answer is, it depends on the dataset, and it could be any number between 1 and 950. Plotting the explained variance as a function of the number of dimensions is one way to get a rough idea of the dataset’s intrinsic dimensionality.
In what cases would you use vanilla PCA, Incremental PCA, Randomized PCA, or Kernel PCA?
Regular PCA is the default, but it works only if the dataset fits in memory. Incremental PCA is useful for large datasets that don’t fit in memory, but it is slower
than regular PCA, so if the dataset fits in memory you should prefer regular PCA. Incremental PCA is also useful for online tasks, when you need to apply
PCA on the fly, every time a new instance arrives. Randomized PCA is useful when you want to considerably reduce dimensionality and the dataset fits in memory; in this case, it is much faster than regular PCA. Finally, Kernel PCA is useful for nonlinear datasets.
How can you evaluate the performance of a dimensionality reduction algorithm on your dataset?
Intuitively, a dimensionality reduction algorithm performs well if it eliminates a lot of dimensions from the dataset without losing too much information. One
way to measure this is to apply the reverse transformation and measure the reconstruction error. However, not all dimensionality reduction algorithms provide a reverse transformation. Alternatively, if you are using dimensionality reduction as a preprocessing step before another Machine Learning algorithm
(e.g., a Random Forest classifier), then you can simply measure the performance of that second algorithm; if dimensionality reduction did not lose too much
information, then the algorithm should perform just as well as when using the original dataset.
Does it make any sense to chain two different dimensionality reduction algorithms?
It can absolutely make sense to chain two different dimensionality reduction algorithms. A common example is using PCA to quickly get rid of a large number of useless dimensions, then applying another much slower dimensionality reduction algorithm, such as LLE. This two-step approach will likely yield the same performance as using LLE only, but in a fraction of the time.
How would you define clustering? Can you name a few clustering algorithms?
In Machine Learning, clustering is the unsupervised task of grouping similar instances together. The notion of similarity depends on the task at hand: for example, in some cases two nearby instances will be considered similar, while in others similar instances may be far apart as long as they belong to the same densely packed group. Popular clustering algorithms include K-Means, DBSCAN, agglomerative clustering, BIRCH, Mean-Shift, affinity propagation, and spectral clustering.
What are some of the main applications of clustering algorithms?
The main applications of clustering algorithms include data analysis, customer segmentation, recommender systems, search engines, image segmentation, semisupervised learning, dimensionality reduction, anomaly detection, and novelty detection.
Describe two techniques to select the right number of clusters when using K-Means.
The elbow rule is a simple technique to select the number of clusters when using K-Means: just plot the inertia (the mean squared distance from each instance to its nearest centroid) as a function of the number of clusters, and find the point in the curve where the inertia stops dropping fast (the “elbow”). This is generally close to the optimal number of clusters. Another approach is to plot the silhouette score as a function of the number of clusters. There will often be a peak, and the optimal number of clusters is generally nearby. The silhouette score is the mean silhouette coefficient over all instances. This coefficient varies from +1 for instances that are well inside their cluster and far from other clusters, to –1 for instances that are very close to another cluster. You may also plot the silhouette diagrams and perform a more thorough analysis.
What is label propagation? Why would you implement it, and how?
Labeling a dataset is costly and time-consuming. Therefore, it is common to have plenty of unlabeled instances, but few labeled instances. Label propagation is a technique that consists in copying some (or all) of the labels from the labeled
instances to similar unlabeled instances. This can greatly extend the number of labeled instances, and thereby allow a supervised algorithm to reach better performance (this is a form of semi-supervised learning). One approach is to use a clustering algorithm such as K-Means on all the instances, then for each cluster find the most common label or the label of the most representative instance (i.e., the one closest to the centroid) and propagate it to the unlabeled instances in the same cluster.
Can you name two clustering algorithms that can scale to large datasets? And two that look for regions of high density?
K-Means and BIRCH scale well to large datasets. DBSCAN and Mean-Shift look for regions of high density.
Can you think of a use case where active learning would be useful? How would you implement it?
Active learning is useful whenever you have plenty of unlabeled instances but labeling is costly. In this case (which is very common), rather than randomly selecting instances to label, it is often preferable to perform active learning, where human experts interact with the learning algorithm, providing labels for Exercise Solutions | 729 specific instances when the algorithm requests them. A common approach is uncertainty sampling (see the description in “Active Learning” on page 255).
What is the difference between anomaly detection and novelty detection?
Many people use the terms anomaly detection and novelty detection interchangeably, but they are not exactly the same. In anomaly detection, the algorithm is trained on a dataset that may contain outliers, and the goal is typically to identify these outliers (within the training set), as well as outliers among new instances. In novelty detection, the algorithm is trained on a dataset that is presumed to be “clean,” and the objective is to detect novelties strictly among new instances. Some algorithms work best for anomaly detection (e.g., Isolation Forest), while others are better suited for novelty detection (e.g., one-class SVM).
What is a Gaussian mixture? What tasks can you use it for?
A Gaussian mixture model (GMM) is a probabilistic model that assumes that the instances were generated from a mixture of several Gaussian distributions whose
parameters are unknown. In other words, the assumption is that the data is grouped into a finite number of clusters, each with an ellipsoidal shape (but the clusters may have different ellipsoidal shapes, sizes, orientations, and densities), and we don’t know which cluster each instance belongs to. This model is useful for density estimation, clustering, and anomaly detection.
Can you name two techniques to find the right number of clusters when using a Gaussian mixture model?
One way to find the right number of clusters when using a Gaussian mixture model is to plot the Bayesian information criterion (BIC) or the Akaike informa‐ tion criterion (AIC) as a function of the number of clusters, then choose the number of clusters that minimizes the BIC or AIC. Another technique is to use a Bayesian Gaussian mixture model, which automatically selects the number of clusters.
What is θ=(XT*X)^-1 * XT *y
and when can we use it.
The normalization equation is an alternative to gradient descent when our number of features isn’t too big.
How to turn a feature into ‘standard norm’ form. Mean =0, std=1
Subtract each instance by the feature mean and divide it by the feature std
What do you do if your cost function increases after each iteration?
Make the learning rate (alpha), smaller.
When is Gradient Descent a better option than the Normal Equation?
When there are too many features (eg 10000).
Why use feature scaling?
It will make Gradient Descent quicker, more direct.
What are the dimensions of theta(j) in a neural network?
S(j+1) by S j plus one
eg. hidden layer by input layer + 1
What is a common reason for an ML model that works well in training but fails in production?
The ML dataset was improperly created
Personalized Algorithms are often built using which type of ML model?
Recommendation systems (but you must understand and know the tools and tricks of image processing and sequence systems to understand recommendation systems).
Question 3
What is a key lesson Google has learned with regards to reducing the chance of failure in production ML models?
Process batch and streaming data the same way
Which of the following scenarios may require a supervised learning model to be retrained as a new model?
The model was trained on labeled data and we now wish to correct the labels of the data.
Someone read emails for a company and then forwards the emails to the appropriate department. How can we automate this process?
Use several models to read, sort, and send to departments. If there are any pre-existing models then use them.
A team is preparing to develop and deploy an ML model for use on a shopping website. They have collected a little data to train the model. The team plans on gathering more data once the model is developed. Now they are ready for the next phase, training.
Which of these scenarios will most likely lead to a successful deployment of the ML model?
The team should take time to gather more data, because with more data, it is possible to create a simpler ML model that performs better.
What are the five phases of the “Path to ML”?
Individual contributor, delegation, digitization, big data and analytics, machine learning
You are going to develop an ML model. You are in Canada and the rest of the team is in Mexico.
Your team wants to use Google Cloud Platform with Python Notebook. Which of the following statements support your decision.
Datalab notebooks are hosted in the cloud
Question 2
Your team has decided to use the Compute Engine, Cloud Storage, and Datalab for ML model development
Which two statements are applicable to your situation
Every member of the team, regardless of their location, can directly read data from Cloud Storage.
Latency of data access can be a concern, so carefully select the zone for data storage.
The third wave of cloud is _________________ so you can focus on data ___________ instead of infrastructure.
serverless, insights
Three quality attributes of data?
Consistency, accuracy, auditability
Two categories of data quality tools?
Cleaning tools, monitoring tools
Three features of low data quality?
unreliable info, incomplete data, duplicated data
What is the Orderliness of data?
The data entered has the required format and structure
Three best practices for data quality management?
resolving missing values, preventing duplicates, automating data entry
Which is the correct sequence of steps in data science after the data is gathered? 4 steps
Data Exploration -> Data Cleaning -> Model Building -> Present Results
Three objectives of exploratory data analysis?
Check for missing data and other mistakes, Gain maximum insight into the data set and its underlying structure, uncover a parsimonious model (the most useful features)
Two main methods for Exploratory Data Analysis?
Univariate and Bivariate
What machine learning models have labels, or in other words, the correct answers to whatever it is that we want to learn to predict?
Supervised model
Two most common types of Supervised machine learning models?
Regression model, and classification model
Which model would you use if your problem required a discrete number of values or classes?
Classification model
Question 5
When the data isn’t labelled, what is an alternative way of predicting the output?
Clustering Algorithms
Question 5
What is the most essential metric a regression model uses?
Mean squared error as their loss funciton
Question 1
Fill in the blanks. In the video, we presented a linear equation. This hypothesis equation is applied to every _________ of our dataset, where the weight values are fixed, and the feature values are from each associated column, and our machine learning data set.
row
Question 3
Fill in the blanks. Fundamentally, classification is about predicting a _______ and regression is about predicting a __________.
Label, Quantity
What component of a biological neuron is analogous to the input portion of a perceptron?
Dendrites
Which of the following is an algorithm for supervised learning of binary classifiers - given that a binary classifier is a function which can decide whether or not an input, represented by a vector of numbers, belongs to some specific class.
Binary classifier, Perceptron, Linear Regression
Perceptron
Which model is the linear classifier, also unsed in Supervised learning?
Neuron, Dendrites, Perceptron
Perceptron
A perceptron is a type of _____ that makes its predictions based on a linear predictor function combining a set of weights with the ________.
linear classifier, feature vector
Three steps in the Perceptron Learning Process
- Takes the inputs, multiplies them by their weights, and computes their sum.
- Adds a bias factor, the number 1 multiplied by a weight.
- Feeds the sum through the activation function.
Six elements of a perceptron?
- Input function X
- Bias b (constant)
- Weights
- Weighted sum
- Activation function
- Output
Neural Networks: I I wanted my outputs to be in the form of probabilities which activation function should I use in the final layer?
Sigmoid
A single unit for a non-input neuron has these three things:
- Weighted Sum
- Activation function
- Output of the activation function
What activation functions are needed to get the complex chain functions that allow neural networks to learn data distributions?
Nonlinear activation functions
The range of a ReLU output?
between zero and infinity
The range of Tanh output?
between -1 and 1
The range of a Sigmoid output?
between zero and 1
The range of a ELU output?
between -1 and infinity
In a decision classification tree, what does each decision or node consist of?
Linear classifier of one feature
Mean squared error minimizer and euclidean distance minimizer are used in ______, not ______.
regression, classification
What thing in neural network can map to a higher dimensional vector space?
More neurons per layer
SVM: The _____ is the distance between two separate vectors.
margin
SVM: The more generalizable the decision boundary, the ____ the margin.
wider
SVM are used for text classification tasks such as __________,
__________, and _________.
category assignment,
detecting spam, sentiment analysis
SVMs are based on the idea of finding a ________ that best divides a dataset into _____ classes. ___________ are the data points nearest to the hyperplane,, the points of a data set that, if removed, would alter the position of the dividing ______. As a simple example, for a classification task with only two features, you can think of a _______ as a ______ that ______ separates and classifies a set of data.
hyperplane, two, support vectors, hyperplane, hyperplane, line, linearly
A _______ maps the data from our ______ vector space to a vector space that has features that can be ______ separated.
kernel transformation, input, linearly
In ML, kernel methods are a class of algorithms for ________, whose best know member is the ________.
pattern analysis, support vector machine
Dropout in neural networks works by randomly setting the _______ of hidden units to ____ at each update of training phase.
outgoing edges, 0
How does dropout help neural networks generalize?
In setting the output to 0, the cost function becomes more sensitive to neighboring neurons changing the way the weights will be updated during the process of backpropagation.
Three types of modern neural networks.
Convolutional, modular, recurrent
Three was to improve generalization in a NN?
Adding dropout layers, performing data augmentation, adding noise
At its core, a ________ is a method of evaluating how well your algorithm models your dataset. If your predictions are totally off, your _________ will output a higher number. If they’re pretty good, it will output a lower number. As you change pieces of your algorithm to try and improve your model, your ______ will tell you if you’re getting anywhere.
loss function
Simply speaking, __________ is the workhorse of basic loss functions. ______ is the sum of squared distances between our target variable and predicted values.
mean squared error
Loss functions can be broadly categorized into 2 types: Classification and Regression Loss. _____ is typically used for regression and ______ is typically used for classification.
mean squared error, cross entropy
Gradient Descent is an optimization algorithm used to _______ some function by iteratively moving in the direction of the steepest descent as by the _________. In machine learning, we use gradient descent to update the _______ of our model.
minimize, negative of the gradient, parameters
________, also called vanilla gradient descent, calculates the error for _______ within the training dataset, but only ________ all training examples have evaluated does the model get updated. This whole process is like a cycle and it’s called a training epoch.
Batch gradient descent, each example, after
In the ________________________ method, one training sample (example) is passed through the neural network at a time and the parameters (weights) of each layer are updated with the computed gradient.
Stochastic Gradient Descent
________________: Parameters are updated after computing the gradient of error with respect to the entire training set
________________: Parameters are updated after computing the gradient of error with respect to a single training example
________________: Parameters are updated after computing the gradient of error with respect to a subset of the training set
Batch Gradient Descent, Stochastic Gradient Descent, Mini-Batch Gradient Descent
What is a type one error?
When the model predicts positive but it’s actually a negative (predicts face when it’s a statue).
Formula for precision
True positives / (True positives + False Positives)
An increase in what factor will drive down the precision ratio?
False Positives
What is type two error?
When the predicts negative and it’s actually a positive (predicts not face when it’s a face in winter clothes).
Formula for recall
True positives /(true positives + false negatives)
Why is RMSE preferred?
The loss metric output is measured in the same units as the error making it easier to directly interpret.
There will always be a ____ between the metrics we care about and the metrics that work well with gradient descent.
gap
What is the significance of performance metrics?
Plus two benefits
Performance metrics will allow us to reject models that have settled into inappropriate minima.
- easier to understand
- directly connected to business goals
Two ways to think about recall?
- inversely related to precision
2. Recall is like a person who never wants to be left out of a positive decision
Two parameters that affect gradient descent?
- learning rate
2. batch size
What is the best way to assess the quality of a model?
To observe how well a model performs against a new dataset that it hasn’t seen before
How do you decide when to stop training a model?
When your loss metrics start to increase against the validation set
What actions can you perform on your model when it is trained and validated?
You can run it once, and only once, against the independent test dataset.
What two loss functions are the most common for Regressions?
RMSE for linear regression, cross-entropy for classification
Which is the most preferred way to traverse loss surfaces efficiently?
By analyzing the slopes of our loss functions, which provide us directions and step magnitude.
What core algorithm is used to construct Decision Trees?
Greddy algorithms
The RAND function in BigQuery generates a value between ____ and ____.
zero, one
How can you create repeatable samples of your data in BigQuery?
Use the last few digits of a hash function on the field that you’re using to split or bucketize your data
What allows you to split the dataset based upon a filed in your data?
FARM_FINGERPRINT, an open-source hashing algorithm that is implemented in BigQuery SQL.
TensorFlow is a _____ and _____ platform programming interface for implementing and running machine learning algorithms, including convenience wrappers for deep learning.
scalable, multi
In TensorFlow, ____ are multi-dimensional arrays with a uniform type. All tensors are ____ like Python numbers and strings: you can never update the contents of a tensor, only create a new one.
tensors, immutable
How does TensorFlow represent numeric computations?
Using a Directed Acyclic Graph (or DAG)
How can we improve the calculation speed in TensorFlow, without losing accuracy?
Using GPU
tf.losses, tf.metrics, and tf.optimizers are useful components when?
building custom Neural Network models.
Which processing units can you run TensorFlow?
CPU, GPU, TPU
tf.estimator, tf.keras, tf.data are high level APIs used for?
distributed training
You need to build a custom NN model. What are two options?
We can use an estimator from TF, or we can use a high-level API such as Keras
Which of the following API’s are not used in the TensorFlow abstraction layers?
C++ API, Python API, tf.keras, tf.image
tf.image
Which API is used to build performant, complex input pipelines from simple, re-usable pieces that will feed your model’s training or evaluation loops.
tf.data.Dataset
Two operations that can be performed on tensors?
reshaped, sliced
What rank is Shape:[3,4]?
Rank 2