Data Science Interview Flashcards
What is supervised machine learning?
Supervised learning is a type of machine learning in which our algorithms are trained using well-labeled training data, and machines predict the output based on that data. Labeled data indicates that the input data has already been tagged with the appropriate output. Basically, it is the task of learning a function that maps the input set and returns an output. Some of its examples are: Linear Regression, Logistic Regression, KNN, etc.
What is regression? Which models can you use to solve a regression problem?
Regression is a part of supervised ML. Regression models investigate the relationship between a dependent (target) and independent variable (s) (predictor). Here are some common regression models
Linear Regression establishes a linear relationship between target and predictor (s). It predicts a numeric value and has a shape of a straight line. Polynomial Regression has a regression equation with the power of independent variable more than 1. It is a curve that fits into the data points. Ridge Regression helps when predictors are highly correlated (multicollinearity problem). It penalizes the squares of regression coefficients but doesn’t allow the coefficients to reach zeros (uses L2 regularization). Lasso Regression penalizes the absolute values of regression coefficients and allows some of the coefficients to reach absolute zero (thereby allowing feature selection).
What is linear regression? When do we use it?
Linear regression is a model that assumes a linear relationship between the input variables (X) and the single output variable (y).
The case of one explanatory variable is called simple linear regression. For more than one explanatory variable, the process is called multiple linear regression.
What are the main assumptions of linear regression?
There are several assumptions of linear regression. If any of them is violated, model predictions and interpretation may be worthless or misleading.
Linear relationship between features and target variable. Additivity means that the effect of changes in one of the features on the target variable does not depend on values of other features. For example, a model for predicting revenue of a company have of two features - the number of items a sold and the number of items b sold. When company sells more items a the revenue increases and this is independent of the number of items b sold. But, if customers who buy a stop buying b, the additivity assumption is violated. Features are not correlated (no collinearity) since it can be difficult to separate out the individual effects of collinear features on the target variable. Errors are independently and identically normally distributed (yi = B0 + B1*x1i + ... + errori): No correlation between errors (consecutive errors in the case of time series data). Constant variance of errors - homoscedasticity. For example, in case of time series, seasonal patterns can increase errors in seasons with higher activity. Errors are normaly distributed, otherwise some features will have more influence on the target variable than to others. If the error distribution is significantly non-normal, confidence intervals may be too wide or too narrow.
What’s the normal distribution? Why do we care about it? 👶
The normal distribution is a continuous probability distribution whose probability density function takes the following formula:
formula
where μ is the mean and σ is the standard deviation of the distribution.
The normal distribution derives its importance from the Central Limit Theorem, which states that if we draw a large enough number of samples, their mean will follow a normal distribution regardless of the initial distribution of the sample, i.e the distribution of the mean of the samples is normal. It is important that each sample is independent from the other.
This is powerful because it helps us study processes whose population distribution is unknown to us.
How do we check if a variable follows the normal distribution?
- Plot a histogram out of the sampled data. If you can fit the bell-shaped “normal” curve to the histogram, then the hypothesis that the underlying random variable follows the normal distribution can not be rejected.
- Check Skewness and Kurtosis of the sampled data. Skewness = 0 and kurtosis = 3 are typical for a normal distribution, so the farther away they are from these values, the more non-normal the distribution.
- Use Kolmogorov-Smirnov or/and Shapiro-Wilk tests for normality. They take into account both Skewness and Kurtosis simultaneously.
- Check for Quantile-Quantile plot. It is a scatterplot created by plotting two sets of quantiles against one another. Normal Q-Q plot place the data points in a roughly straight line.
What if we want to build a model for predicting prices? Are prices distributed normally? Do we need to do any pre-processing for prices?
Data is not normal. Specially, real-world datasets or uncleaned datasets always have certain skewness. Same goes for the price prediction. Price of houses or any other thing under consideration depends on a number of factors. So, there’s a great chance of presence of some skewed values i.e outliers if we talk in data science terms.
Yes, you may need to do pre-processing. Most probably, you will need to remove the outliers to make your distribution near-to-normal.
What methods for solving linear regression do you know? ⭐️
- Matrix Algebra method
- Singular Value Decomposition (SVD)
- Gradient Descent
What is gradient descent? How does it work? ⭐️
Gradient descent is an algorithm that uses calculus concept of gradient to try and reach local or global minima. It works by taking the negative of the gradient in a point of a given function, and updating that point repeatedly using the calculated negative gradient, until the algorithm reaches a local or global minimum, which will cause future iterations of the algorithm to return values that are equal or too close to the current point. It is widely used in machine learning applications.
What is the normal equation? ⭐️
Normal equations are equations obtained by setting equal to zero the partial derivatives of the sum of squared errors (least squares); normal equations allow one to estimate the parameters of a multiple linear regression.
What is SGD — stochastic gradient descent? What’s the difference with the usual gradient descent? ⭐️
In both gradient descent (GD) and stochastic gradient descent (SGD), you update a set of parameters in an iterative manner to minimize an error function.
While in GD, you have to run through ALL the samples in your training set to do a single update for a parameter in a particular iteration, in SGD, on the other hand, you use ONLY ONE or SUBSET of training sample from your training set to do the update for a parameter in a particular iteration. If you use SUBSET, it is called Minibatch Stochastic gradient Descent.
Which metrics for evaluating regression models do you know? 👶
Mean Squared Error(MSE)
Root Mean Squared Error(RMSE)
Mean Absolute Error(MAE)
R² or Coefficient of Determination
Adjusted R²
What are MSE and RMSE? 👶
MSE stands for Mean Square Error while RMSE stands for Root Mean Square Error. They are metrics with which we can evaluate models.
What is the bias-variance trade-off? 👶
Bias is the error introduced by approximating the true underlying function, which can be quite complex, by a simpler model. Variance is a model sensitivity to changes in the training dataset.
Bias-variance trade-off is a relationship between the expected test error and the variance and the bias - both contribute to the level of the test error and ideally should be as small as possible:
ExpectedTestError = Variance + Bias² + IrreducibleError
But as a model complexity increases, the bias decreases and the variance increases which leads to overfitting. And vice versa, model simplification helps to decrease the variance but it increases the bias which leads to underfitting.
What is overfitting? 👶
When your model perform very well on your training set but can’t generalize the test set, because it adjusted a lot to the training set.
How to validate your models? 👶
One of the most common approaches is splitting data into train, validation and test parts. Models are trained on train data, hyperparameters (for example early stopping) are selected based on the validation data, the final measurement is done on test dataset. Another approach is cross-validation: split dataset into K folds and each time train models on training folds and measure the performance on the validation folds. Also you could combine these approaches: make a test/holdout dataset and do cross-validation on the rest of the data. The final quality is measured on test dataset.
Why do we need to split our data into three parts: train, validation, and test? 👶
The training set is used to fit the model, i.e. to train the model with the data. The validation set is then used to provide an unbiased evaluation of a model while fine-tuning hyperparameters. This improves the generalization of the model. Finally, a test data set which the model has never “seen” before should be used for the final evaluation of the model. This allows for an unbiased evaluation of the model. The evaluation should never be performed on the same data that is used for training. Otherwise the model performance would not be representative.
Can you explain how cross-validation works? 👶
Cross-validation is the process to separate your total training set into two subsets: training and validation set, and evaluate your model to choose the hyperparameters. But you do this process iteratively, selecting differents training and validation set, in order to reduce the bias that you would have by selecting only one validation set.
What is K-fold cross-validation? 👶
K fold cross validation is a method of cross validation where we select a hyperparameter k. The dataset is now divided into k parts. Now, we take the 1st part as validation set and remaining k-1 as training set. Then we take the 2nd part as validation set and remaining k-1 parts as training set. Like this, each part is used as validation set once and the remaining k-1 parts are taken together and used as training set. It should not be used in a time series data.
How do we choose K in K-fold cross-validation? What’s your favorite K? 👶
There are two things to consider while deciding K: the number of models we get and the size of validation set. We do not want the number of models to be too less, like 2 or 3. At least 4 models give a less biased decision on the metrics. On the other hand, we would want the dataset to be at least 20-25% of the entire data. So that at least a ratio of 3:1 between training and validation set is maintained.
I tend to use 4 for small datasets and 5 for large ones as K.
What is classification? Which models would you use to solve a classification problem? 👶
Classification problems are problems in which our prediction space is discrete, i.e. there is a finite number of values the output variable can be. Some models which can be used to solve classification problems are: logistic regression, decision tree, random forests, multi-layer perceptron, one-vs-all, amongst others.
What is logistic regression? When do we need to use it? 👶
Logistic regression is a Machine Learning algorithm that is used for binary classification. You should use logistic regression when your Y variable takes only two values, e.g. True and False, “spam” and “not spam”, “churn” and “not churn” and so on. The variable is said to be a “binary” or “dichotomous”.
Is logistic regression a linear model? Why? 👶
Yes, Logistic Regression is considered a generalized linear model because the outcome always depends on the sum of the inputs and parameters. Or in other words, the output cannot depend on the product (or quotient, etc.) of its parameters.
What is sigmoid? What does it do? 👶
A sigmoid function is a type of activation function, and more specifically defined as a squashing function. Squashing functions limit the output to a range between 0 and 1, making these functions useful in the prediction of probabilities.
How do we evaluate classification models? 👶
Accuracy
Precision
Recall
F1 Score
Logistic loss (also known as Cross-entropy loss)
Jaccard similarity coefficient score
What is accuracy? 👶
Accuracy is a metric for evaluating classification models. It is calculated by dividing the number of correct predictions by the number of total predictions.
Is accuracy always a good metric? 👶
Accuracy is not a good performance metric when there is imbalance in the dataset. For example, in binary classification with 95% of A class and 5% of B class, a constant prediction of A class would have an accuracy of 95%. In case of imbalance dataset, we need to choose Precision, recall, or F1 Score depending on the problem we are trying to solve.
What is the confusion table? What are the cells in this table? 👶
Confusion table (or confusion matrix) shows how many True positives (TP), True Negative (TN), False Positive (FP) and False Negative (FN) model has made.
What are precision, recall, and F1-score? 👶
Precision and recall are classification evaluation metrics:
P = TP / (TP + FP) and R = TP / (TP + FN).
Where TP is true positives, FP is false positives and FN is false negatives
In both cases the score of 1 is the best: we get no false positives or false negatives and only true positives.
F1 is a combination of both precision and recall in one score (harmonic mean):
F1 = 2 * PR / (P + R).
Max F score is 1 and min is 0, with 1 being the best.
Precision-recall trade-off ⭐️
Tradeoff means increasing one parameter would lead to decreasing of other. Precision-recall tradeoff occur due to increasing one of the parameter(precision or recall) while keeping the model same.
In an ideal scenario where there is a perfectly separable data, both precision and recall can get maximum value of 1.0. But in most of the practical situations, there is noise in the dataset and the dataset is not perfectly separable. There might be some points of positive class closer to the negative class and vice versa. In such cases, shifting the decision boundary can either increase the precision or recall but not both. Increasing one parameter leads to decreasing of the other.
What is the ROC curve? When to use it? ⭐️
ROC stands for Receiver Operating Characteristics. The diagrammatic representation that shows the contrast between true positive rate vs false positive rate. It is used when we need to predict the probability of the binary outcome.
What is AUC (AU ROC)? When to use it? ⭐️
AUC stands for Area Under the ROC Curve. ROC is a probability curve and AUC represents degree or measure of separability. It’s used when we need to value how much model is capable of distinguishing between classes. The value is between 0 and 1, the higher the better.
How to interpret the AU ROC score? ⭐️
AUC score is the value of Area Under the ROC Curve.
An excellent model has AUC near to the 1 which means it has good measure of separability. A poor model has AUC near to the 0 which means it has worst measure of separability. When AUC score is 0.5, it means model has no class separation capacity whatsoever.
What is the PR (precision-recall) curve? ⭐️
A precision-recall curve (or PR Curve) is a plot of the precision (y-axis) and the recall (x-axis) for different probability thresholds. Precision-recall curves (PR curves) are recommended for highly skewed domains where ROC curves may provide an excessively optimistic view of the performance.
What is the area under the PR curve? Is it a useful metric? ⭐️I
A high area under the curve represents both high recall and high precision, where high precision relates to a low false positive rate, and high recall relates to a low false negative rate.
In which cases AU PR is better than AU ROC? ⭐️
What is different however is that AU ROC looks at a true positive rate TPR and false positive rate FPR while AU PR looks at positive predictive value PPV and true positive rate TPR.
Typically, if true negatives are not meaningful to the problem or you care more about the positive class, AU PR is typically going to be more useful; otherwise, If you care equally about the positive and negative class or your dataset is quite balanced, then going with AU ROC is a good idea.
What do we do with categorical variables? ⭐️
Categorical variables must be encoded before they can be used as features to train a machine learning model. There are various encoding techniques, including:
One-hot encoding Label encoding Ordinal encoding Target encoding
Why do we need one-hot encoding? ⭐️
If we simply encode categorical variables with a Label encoder, they become ordinal which can lead to undesirable consequences. In this case, linear models will treat category with id 4 as twice better than a category with id 2. One-hot encoding allows us to represent a categorical variable in a numerical vector space which ensures that vectors of each category have equal distances between each other. This approach is not suited for all situations, because by using it with categorical variables of high cardinality (e.g. customer id) we will encounter problems that come into play because of the curse of dimensionality.
What is “curse of dimensionality”? ⭐️
The curse of dimensionality is an issue that arises when working with high-dimensional data. It is often said that “the curse of dimensionality” is one of the main problems with machine learning. The curse of dimensionality refers to the fact that, as the number of dimensions (features) in a data set increases, the number of data points required to accurately learn the relationships between those features increases exponentially.
A simple example where we have a data set with two features, x1 and x2. If we want to learn the relationship between these two features, we need to have enough data points so that we can accurately estimate the parameters of that relationship. However, if we add a third feature, x3, then the number of data points required to accurately learn the relationships between all three features increases exponentially. This is because there are now more parameters to estimate, and the number of data points needed to accurately estimate those parameters increases exponentially with the number of parameters.
Simply put, the curse of dimensionality basically means that the error increases with the increase in the number of features.
What happens to our linear regression model if we have three columns in our data: x, y, z — and z is a sum of x and y? ⭐️
We would not be able to perform the resgression. Because z is linearly dependent on x and y so when performing the regression XTX would be a singular (not invertible) matrix.
What is regularization? Why do we need it? 👶
Regularization is used to reduce overfitting in machine learning models. It helps the models to generalize well and make them robust to outliers and noise in the data.
Which regularization techniques do you know? ⭐️
There are mainly two types of regularization,
L1 Regularization (Lasso regularization) - Adds the sum of absolute values of the coefficients to the cost function. L2 Regularization (Ridge regularization) - Adds the sum of squares of coefficients to the cost function.
What kind of regularization techniques are applicable to linear models? ⭐️
AIC/BIC, Ridge regression, Lasso, Elastic Net, Basis pursuit denoising, Rudin–Osher–Fatemi model (TV), Potts model, RLAD, Dantzig Selector,SLOPE
How does L2 regularization look like in a linear model? ⭐️
L2 regularization adds a penalty term to our cost function which is equal to the sum of squares of models coefficients multiplied by a lambda hyperparameter. This technique makes sure that the coefficients are close to zero and is widely used in cases when we have a lot of features that might correlate with each other.
How do we select the right regularization parameters? 👶
Regularization parameters can be chosen using a grid search, for example https://scikit-learn.org/stable/modules/linear_model.html has one formula for the implementing for regularization, alpha in the formula mentioned can be found by doing a RandomSearch or a GridSearch on a set of values and selecting the alpha which gives the least cross validation or validation error.
What’s the effect of L2 regularization on the weights of a linear model? ⭐️
L2 regularization penalizes larger weights more severely (due to the squared penalty term), which encourages weight values to decay toward zero.
How L1 regularization looks like in a linear model? ⭐️
L1 regularization adds a penalty term to our cost function which is equal to the sum of modules of models coefficients multiplied by a lambda hyperparamete
What’s the difference between L2 and L1 regularization? ⭐️
Penalty terms: L1 regularization uses the sum of the absolute values of the weights, while L2 regularization uses the sum of the weights squared.
Feature selection: L1 performs feature selection by reducing the coefficients of some predictors to 0, while L2 does not.
Computational efficiency: L2 has an analytical solution, while L1 does not.
Multicollinearity: L2 addresses multicollinearity by constraining the coefficient norm.
Can we have both L1 and L2 regularization components in a linear model? ⭐️
Yes, elastic net regularization combines L1 and L2 regularization.
What’s the interpretation of the bias term in linear models? ⭐️
Bias is simply, a difference between predicted value and actual/true value. It can be interpreted as the distance from the average prediction and true value i.e. true value minus mean(predictions). But dont get confused between accuracy and bias.
How do we interpret weights in linear models? ⭐️
Without normalizing weights or variables, if you increase the corresponding predictor by one unit, the coefficient represents on average how much the output changes. By the way, this interpretation still works for logistic regression - if you increase the corresponding predictor by one unit, the weight represents the change in the log of the odds.
If the variables are normalized, we can interpret weights in linear models like the importance of this variable in the predicted result.
If a weight for one variable is higher than for another — can we say that this variable is more important? ⭐️
Yes - if your predictor variables are normalized.
Without normalization, the weight represents the change in the output per unit change in the predictor. If you have a predictor with a huge range and scale that is used to predict an output with a very small range - for example, using each nation’s GDP to predict maternal mortality rates - your coefficient should be very small. That does not necessarily mean that this predictor variable is not important compared to the others.
When do we need to perform feature normalization for linear models? When it’s okay not to do it? ⭐️
Feature normalization is necessary for L1 and L2 regularizations. The idea of both methods is to penalize all the features relatively equally. This can’t be done effectively if every feature is scaled differently.
Linear regression without regularization techniques can be used without feature normalization. Also, regularization can help to make the analytical solution more stable, — it adds the regularization matrix to the feature matrix before inverting it.
What is feature selection? Why do we need it? 👶
Feature Selection is a method used to select the relevant features for the model to train on. We need feature selection to remove the irrelevant features which leads the model to under-perform.
Is feature selection important for linear models? ⭐️
Yes, It is. It can make model performance better through selecting the most importance features and remove irrelanvant features in order to make a prediction and it can also avoid overfitting, underfitting and bias-variance tradeoff.
Which feature selection techniques do you know? ⭐️
Here are some of the feature selections:
Principal Component Analysis Neighborhood Component Analysis ReliefF Algorithm
Can we use L1 regularization for feature selection? ⭐️
Yes, because the nature of L1 regularization will lead to sparse coefficients of features. Feature selection can be done by keeping only features with non-zero coefficients.
Can we use L2 regularization for feature selection? ⭐️
No, Because L2 regularization doesnot make the weights zero but only makes them very very small. L2 regularization can be used to solve multicollinearity since it stablizes the model.
What are the decision trees? 👶
This is a type of supervised learning algorithm that is mostly used for classification problems. Surprisingly, it works for both categorical and continuous dependent variables.
In this algorithm, we split the population into two or more homogeneous sets. This is done based on most significant attributes/ independent variables to make as distinct groups as possible.
A decision tree is a flowchart-like tree structure, where each internal node (non-leaf node) denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node (or terminal node) holds a value for the target variable.
Various techniques : like Gini, Information Gain, Chi-square, entropy.
How do we train decision trees? ⭐️
Start at the root node.
For each variable X, find the set S_1 that minimizes the sum of the node impurities in the two child nodes and choose the split {X,S} that gives the minimum over all X and S.
If a stopping criterion is reached, exit. Otherwise, apply step 2 to each child node in turn.
What are the main parameters of the decision tree model? 👶
maximum tree depth
minimum samples per leaf node
impurity criterion
How do we handle categorical variables in decision trees? ⭐️
Some decision tree algorithms can handle categorical variables out of the box, others cannot. However, we can transform categorical variables, e.g. with a binary or a one-hot encoder.
What are the benefits of a single decision tree compared to more complex models? ⭐️
easy to implement
fast training
fast inference
good explainability
How can we know which features are more important for the decision tree model? ⭐️
Often, we want to find a split such that it minimizes the sum of the node impurities. The impurity criterion is a parameter of decision trees. Popular methods to measure the impurity are the Gini impurity and the entropy describing the information gain.
What is random forest? 👶
Random Forest is a machine learning method for regression and classification which is composed of many decision trees. Random Forest belongs to a larger class of ML algorithms called ensemble methods (in other words, it involves the combination of several models to solve a single prediction problem).
Why do we need randomization in random forest? ⭐️
Random forest in an extention of the bagging algorithm which takes random data samples from the training dataset (with replacement), trains several models and averages predictions. In addition to that, each time a split in a tree is considered, random forest takes a random sample of m features from full set of n features (with replacement) and uses this subset of features as candidates for the split (for example, m = sqrt(n)).
Training decision trees on random data samples from the training dataset reduces variance. Sampling features for each split in a decision tree decorrelates trees.
What are the main parameters of the random forest model? ⭐️
max_depth: Longest Path between root node and the leaf
min_sample_split: The minimum number of observations needed to split a given node
max_leaf_nodes: Conditions the splitting of the tree and hence, limits the growth of the trees
min_samples_leaf: minimum number of samples in the leaf node
n_estimators: Number of trees
max_sample: Fraction of original dataset given to any individual tree in the given model
max_features: Limits the maximum number of features provided to trees in random forest model
How do we select the depth of the trees in random forest? ⭐️
The greater the depth, the greater amount of information is extracted from the tree, however, there is a limit to this, and the algorithm even if defensive against overfitting may learn complex features of noise present in data and as a result, may overfit on noise. Hence, there is no hard thumb rule in deciding the depth, but literature suggests a few tips on tuning the depth of the tree to prevent overfitting:
limit the maximum depth of a tree limit the number of test nodes limit the minimum number of objects at a node required to split do not split a node when, at least, one of the resulting subsample sizes is below a given threshold stop developing a node if it does not sufficiently improve the fit.
How do we know how many trees we need in random forest? ⭐️
The number of trees in random forest is worked by n_estimators, and a random forest reduces overfitting by increasing the number of trees. There is no fixed thumb rule to decide the number of trees in a random forest, it is rather fine tuned with the data, typically starting off by taking the square of the number of features (n) present in the data followed by tuning until we get the optimal results.
What happens when we have correlated features in our data? ⭐️
In random forest, since random forest samples some features to build each tree, the information contained in correlated features is twice as much likely to be picked than any other information contained in other features.
In general, when you are adding correlated features, it means that they linearly contains the same information and thus it will reduce the robustness of your model. Each time you train your model, your model might pick one feature or the other to “do the same job” i.e. explain some variance, reduce entropy, etc.
What is gradient boosting trees? ⭐️
Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees.
What’s the difference between random forest and gradient boosting? ⭐️
Random Forests builds each tree independently while Gradient Boosting builds one tree at a time.
Random Forests combine results at the end of the process (by averaging or “majority rules”) while Gradient Boosting combines results along the way.
Is it possible to parallelize training of a gradient boosting model? How to do it? ⭐️
Yes, different frameworks provide different options to make training faster, using GPUs to speed up the process by making it highly parallelizable.For example, for XGBoost tree_method = ‘gpu_hist’ option makes training faster by use of GPUs.
What are the main parameters in the gradient boosting model? ⭐️
There are many parameters, but below are a few key defaults.
learning_rate=0.1 (shrinkage). n_estimators=100 (number of trees). max_depth=3. min_samples_split=2. min_samples_leaf=1. subsample=1.0.
How do you approach tuning parameters in XGBoost or LightGBM? 🚀
Depending upon the dataset, parameter tuning can be done manually or using hyperparameter optimization frameworks such as optuna and hyperopt. In manual parameter tuning, we need to be aware of max-depth, min_samples_leaf and min_samples_split so that our model does not overfit the data but try to predict generalized characteristics of data (basically keeping variance and bias low for our model).
How do you select the number of trees in the gradient boosting model? ⭐️
Most implementations of gradient boosting are configured by default with a relatively small number of trees, such as hundreds or thousands. Using scikit-learn we can perform a grid search of the n_estimators model parameter
Which hyper-parameter tuning strategies (in general) do you know? ⭐️
There are several strategies for hyper-tuning but I would argue that the three most popular nowadays are the following:
Grid Search is an exhaustive approach such that for each hyper-parameter, the user needs to manually give a list of values for the algorithm to try. After these values are selected, grid search then evaluates the algorithm using each and every combination of hyper-parameters and returns the combination that gives the optimal result (i.e. lowest MAE). Because grid search evaluates the given algorithm using all combinations, it's easy to see that this can be quite computationally expensive and can lead to sub-optimal results specifically since the user needs to specify specific values for these hyper-parameters, which is prone for error and requires domain knowledge. Random Search is similar to grid search but differs in the sense that rather than specifying which values to try for each hyper-parameter, an upper and lower bound of values for each hyper-parameter is given instead. With uniform probability, random values within these bounds are then chosen and similarly, the best combination is returned to the user. Although this seems less intuitive, no domain knowledge is necessary and theoretically much more of the parameter space can be explored. In a completely different framework, Bayesian Optimization is thought of as a more statistical way of optimization and is commonly used when using neural networks, specifically since one evaluation of a neural network can be computationally costly. In numerous research papers, this method heavily outperforms Grid Search and Random Search and is currently used on the Google Cloud Platform as well as AWS. Because an in-depth explanation requires a heavy background in bayesian statistics and gaussian processes (and maybe even some game theory), a "simple" explanation is that a much simpler/faster acquisition function intelligently chooses (using a surrogate function such as probability of improvement or GP-UCB) which hyper-parameter values to try on the computationally expensive, original algorithm. Using the result of the initial combination of values on the expensive/original function, the acquisition function takes the result of the expensive/original algorithm into account and uses it as its prior knowledge to again come up with another set of hyper-parameters to choose during the next iteration. This process continues either for a specified number of iterations or for a specified amount of time and similarly the combination of hyper-parameters that performs the best on the expensive/original algorithm is chosen.
What kind of problems neural nets can solve? 👶
Neural nets are good at solving non-linear problems. Some good examples are problems that are relatively easy for humans (because of experience, intuition, understanding, etc), but difficult for traditional regression models: speech recognition, handwriting recognition, image identification, etc.