general machine learning Flashcards

Question 1

Q

Why do you want to lock a test set right from the beginning.

Answer

A

If you or your algorithm look at the test data its increase the likelihood that your model will be bias. The bias we are trying to avoid is the data snooping bias.

Question 2

Q

What is the data snooping bias?

Answer

A

The data snooping bias is a statistical bias that appears when exhaustively searching for combinations of variables, the probability that a result arose by pure chance grow with the number of combinations tested.

Question 3

Q

What is the sampling bias?

Answer

A

The sampling bias is a bias in which a sample is collected in such a way that some members of the intended population have a lower sampling probability than others. The results is a biased sample or a non-random sample of a population.

Question 4

Q

What is the confirmation bias?

Answer

A

Confirmation bias, is the tendency to process information by looking for, or interpreting, information that is consistent with one’s existing beliefs.

Question 5

Q

What is the exclusion bias?

Answer

A

Happens as a result of excluding some features from our dataset usually under the umbrella of cleaning our data. We think these features are irrelevant. For example in the titanic survival prediction problem, one might disregard the passenger id of the travelers as they might think it is completely irrelevant. Little did they know that titanic passengers were assigned rooms according to their passenger id. The smaller the id number the closer to the lifeboats.

Question 6

Q

What is the observer bias?

Answer

A

The tendency to see what we expect to see, or what we want to see. When a researcher studies a certain group, they usually come to an experiment with prior knowledge and subjective feeling about the group being studied.

Question 7

Q

What is prejudice bias?

Answer

A

Happen as a result of cultural influences or stereotypes on data. Example: a computer vision program that detects people at work using google image. It will be fed thousand of man coding and women cooking. Your model might conclude that only man code and only women cook.

Question 8

Q

What is measurement bias?

Answer

A

Systematic value value distortion happens when there’s an issue with the device used to observe or measure. This kind of bias tends to skew the data in a particular direction. Example: shooting images data with a camera that increases the brightness. This messed up measurement tool failed to replicate the environment on which the model will operate.

Question 9

Q

What are the eight main steps of a machine learning project?

Answer

A

1) Frame the problem and look at the big picture 2) get the data 3) explore the data to get insight 4) prepare the data 5) explore many different models and shortlist the best ones. 6) Fine-tune your models and combine them into a great solution 7) Present your solution 8) Launch, monitor, and maintain your system.

Question 10

Q

When framing the problem, which question should you ask yourself?

Answer

A

i1) What is the objective in business terms 2) how will the solution be used 3) how should performance be measured and is it aligned with the business objective 4) what would be the minimum performance needed to reach the business objective. 5) list and verify the validity of your assumption.

Question 11

Q

in the step, get the data, what do you need to verify (5)?

Answer

A

1) list the data you need and how much you need 2) find and document where you can find the data. 3) check legal obligation 4) Ensure sensitive information is deleted or protected. 5) sample a test set and put it aside, and never look at it(no data scooping)

Question 12

Q

What do we mean by exploring the data (5 points)?

Answer

A

1) study each attribute and its characteristics 2) verify % of missing values 3) Identify the target attribute 4) Visualize the data 5) study the correlations.

Question 13

Q

What do we mean by preparing the data (7 points)?

Answer

A

1) note make sure to work on copies of the data (keep the original dataset intact) 2) Write function for all data transformation 3) fix or remove outlier 4) fill missing data 5) feature selection: Drop the attributes that provide no usefull information. 6) feature scaling 7) change the type of data. For example from continuous to discret.

Question 14

Q

What do we mean by shortlist promising models?

Answer

A

1) train many quick-and-dirty models and compare their performance. 2) For each model, use N-fold cross-validation. 3) Analyze the most significant variable for each algorithm 4) Analyze the type of errors the models make 5) Perform a quick round of feature selection.

Question 15

Q

What do we mean by fine tuning the system?

Answer

A

1) you will want to use as much data as possible for this step. 2) Fine tune the hyperparameters using cross-validation 3) Try ensemble methods. Combining your best models will often produce better performance than running them individually. 4) Once you are confident about your final model, measure its performance on the test set to estimate the generalization error. 5) Note: do not tweak your model after measuring the generalization error, you would just overfit the test set.

Question 16

Q

when presenting your solution do not forget to: (6)

Answer

A

1) document what you have done 2) create a nice presentation, make sure you highlight the big picture first 3) explain why your solution achieves the business objective 4) present interesting points you noticed along the way 5) list your system limitation 6) ensure your key finding are communicated by easy to remember statement. For example: the median income is the number one predictor of housing price.

Question 17

Q

What is a model validation techniques?

Answer

A

It is a techniques for assessing how the results of a statistical analysis will generalize to an independent data set. It is mainly used in setting where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice.

Question 18

Q

What is the goal of cross-validation?

Answer

A

The goal of cross-validation is to test the model’s ability to predict new data that was not used in estimating it, in order to flag problems like overfitting or selection bias and to give an insight on how the model will generalize to an independent dataset (i.e., an unknown dataset, for instance from a real problem).

Question 19

Q

What are the 3 strategy to deal with missing values?

Answer

A

1) Get rid of the corresponding values 2) get ride of the whole attribute 3) set the values to some value (zero, the mean, the median, etc.)

Question 20

Q

1 pro and con of mean imputing.

Answer

A

Pro: The other attributes are still used in our model con: The standard deviation is artificially lowered. Your model think it has more data than it really does for the given attribute.

Question 21

Q

What do we mean by one-hot encoding, and when is it used?

Answer

A

We have an array of categorical variable and it not clear if there is any order to the set, we create one binary attribute per category. Only 1 attribute will be equal to 1 (hot) and all other will be 0 (cold).

Question 22

Q

What is a sparse matrix and why do we use it ?

Answer

A

A sparse matrix is a matrix that contain a majority of zero. Substantial memory requirement reductions can be realized by storing only the non-zero entries.

Question 23

Q

What is the difference between univariate and multivariate imputing ?

Answer

A

One type of imputation algorithm is univariate, which imputes values in the i-th feature dimension using only non-missing values in that feature dimension (e.g. impute.SimpleImputer). By contrast, multivariate imputation algorithms use the entire set of available feature dimensions to estimate the missing values (e.g. impute.IterativeImputer).

Question 24

Q

What is feature scaling, why do we scale the feature and what are the two most common way to scale the feature ?

Answer

A

1) Feature scaling is transforming the data so they are on the same scale. 2) With few exceptions, Machine learning algorithms do not perform well when the input numerical attributes have very different scales. 3) Most optimization algorithm will slow down considerably if the parameter do not have the same scale. 4) min-max scaling and standardization.

Question 25

Q

What is min-max scaling (often called normalization)?

Answer

A

It is a form of feature engineering. The values are shifted and rescaled so that they end up ranging from 0 to 1. We do this by subtracting the min value and dividing by the max minus the min value. Sklearn provides a transformer called MinMaxScaler

Question 26

Q

What is standardization ?

Answer

A

It is a form of feature engineering. In statistics, standardization is the process of putting different variables on the same scale. Basically we transform the data to follow a N(0,1) distribution. Typically, to standardize variables, you calculate the mean and standard deviation for a variable. Then, for each observed value of the variable, you subtract the mean and divide by the standard deviation.

Question 27

Q

What are the main ways to fix underfitting?

Answer

A

1) Select a more powerful model 2) feed the training algorithm with better features 3) Reduce the constraints on the model.

Question 28

Q

What is ensemble learning ?

Answer

A

Ensemble learning is a machine learning paradigm where multiple models (often called “weak learners”) are trained to solve the same problem and combined to get better results.

Question 29

Q

What is the bias error?

Answer

A

The bias error is an error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting).

Question 30

Q

What is the variance error ?

Answer

A

The variance is an error from sensitivity to small fluctuations in the training set. High variance can cause an algorithm to model the random noise in the training data, rather than the intended outputs (overfitting).

Question 31

Q

What is the variance bias trade off?

Answer

A

In statistics and machine learning, the bias–variance tradeoff is the property of a set of predictive models whereby models with a lower bias in parameter estimation have a higher variance of the parameter estimates across samples, and vice versa.

Question 32

Q

How to detect if we are overfitting ?

Answer

A

If our model does much better on the training set than on the test set, then we’re likely overfitting.

Question 33

Q

How to prevent overfitting ?

Answer

A

1) Cross-validation 2) train with more data 3) remove features 4) early stopping: when a model is iterative. There is a point of diminishing return (mainly in deep learning) 5) Regularization 6) Ensemble learning

Question 34

Q

Why should we save every model we experiment with?

Answer

A

So that you can come back easily to any model you want. Make sure you save both the hyper-parameters and the trained parameters, as well as the cross-validation scores and perhaps the actually predictions as well.

Question 35

Q

What is a model parameter?

Answer

A

A model parameter is a configuration variable that is internal to the model and whose value can be estimated from data. For example, the mean and variance. -They are required by the model when making predictions. -They values define the skill of the model on your problem. -They are estimated or learned from data.

Question 36

Q

What is a model hyperparameter?

Answer

A

A model hyperparameter is a configuration that is external to the model and whose value cannot be estimated from data. They are often used in processes to help estimate model parameters. If you have to specify a model parameter manually then it is probably a model hyperparameter. Some examples include: -The learning rate for training a neural network. -The C and sigma hyperparameters for support vector machines. -The k in k-nearest neighbors.

Question 37

Q

What do we mean by fine tuning the hyperparameter ?

Answer

A

It is the problem of choosing a set of optimal hyperparameters for a learning algorithm. A hyperparameter is a parameter whose value is used to control the learning process.

Question 38

Q

What are the 6 mains approaches to tuning hyperparameters ?

Answer

A

1) Grid search 2) Random search 3) Bayesian optimization 4) Gradient based optimization 5) evolution optimization 6) population based optimization

Question 39

Q

What is the main idea behind grid search?

Answer

A

To give a concrete example, if you’re using a support vector machine, you could use different values for gamma and C. Grid-search would basically train a SVM for each pair of (gamma, C) values, then evaluate it using cross-validation, and select the one that did best.

Question 40

Q

What are the common “recipe” that almost all machine learning algorithms?

Answer

A

1) a dataset, 2) a cost function, 3) an optimization procedure, 4) and a model.

Question 41

Q

What are false positive and false negative?

Answer

A

1) False positive is when your prediction state it a positive but the true value is negative. 2) False negative is when you prediction state it a negative but the true value is positive.

Question 42

Q

When using a classifier, what do we mean by precision and recall ?

Answer

A

First let define some notation. TP = true positive, FP = false positive, FN = false negative. Precision = TP/(TP+FP) Recall = TP/(TP+FN)

Question 43

Q

What is a confusion matrix?

Answer

A

It a NxN matrix showing the amount of TP,TF,FN and Fp for each class.

Question 44

Q

What is the accuracy of a classifier?

Answer

A

Accuracy = (TP+TN)/(TP+TN+FN+FP)

Question 45

Q

What is the Precision/Recall trade-off?

Answer

A

Tradeoff means increasing one parameter would lead to decreasing of other. In this case increasing Precision lead to a decrease of Recall, and vice versa.

Question 46

Q

What is the ROC curve?

Answer

A

It a tool that plot the sensitivity (recall) versus the specificity. Sensitivity: measures the proportion of actual positives that are correctly identified as such (e.g., the percentage of sick people who are correctly identified as having the condition). Specificity:measures the proportion of actual negatives that are correctly identified as such (e.g., the percentage of healthy people who are correctly identified as not having the condition).

Question 47

Q

How is k-fold cross validation performed ?

Answer

A

In k-fold cross-validation, the original sample is randomly partitioned into k equal sized subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data. The cross-validation process is then repeated k times, with each of the k subsamples used exactly once as the validation data.

Question 48

Q

An important theoretical result of statistics and machine learning is the fact that a model’s generalization error can be expressed as the sum of three very different errors:

Answer

A

*Bias:** This part of the generalization error is due to wrong assumptions, such as assuming that the data is linear when it is actually quadratic. A high bias model is most likely to underfit the training data.
*Variance:** This part is due to the model’s excessive sensitivity to small variations in the training data and is likely to overfit.
*Irreducible error:** this part is due to the noisiness of the data itself. The only way to reduce this part of the error is to clean up the data.

Question 49

Q

What tend to happen when we increase a model’s complexity? when we decrease the complexity?

Answer

A

Increase complexity: Typically the variance will increase and the bias will decrease

Decrease complexity: Increase the bias and reduces its variance.

This is the essence of the Bias/Variance trade-off.

Question 50

Q

What do we mean by noise in the data?

Answer

A

By noise we mean the data points that don’t really represent the true properties of your data, but random chance.

Question 51

Q

What is the cause of overfitting?

Answer

A

Overfitting happens because your model is trying too hard to capture the noise in your training dataset.

Question 52

Q

What do we mean by regularization? Give one example.

Answer

A

Regularization

This is a form of regression, that constrains/ regularizes or shrinks the coefficient estimates towards zero. In other words, this technique discourages learning a more complex or flexible model, so as to avoid the risk of overfitting.

example: the image shows ridge regression, where the RSS is modified by adding the shrinkage quantity. Now, the coefficients are estimated by minimizing this function. Here, λ is the tuning parameter that decides how much we want to penalize the flexibility of our model.

Question 53

Q

If you want to know if a continuous variable is correlated with a categorical variable, what are your three options?

Answer

A

1) Logistic regression: If the resulting classifier has a high degree of fit, is accurate, sensitive, and specific we can conclude the two variables share a relationship and are indeed correlated. Note, that linear regression assume there is a linear relationship. Testing if this assumption is correct is not straightforward.
2) biserial correlation:
a) Similar to the Pearson coefficient, the point biserial correlation can range from -1 to +1.
b) The point biserial calculation assumes that the continuous variable is normally distributed and homoscedastic.

3) Kruskal wallis H test(Or parametric forms such as t-test or ANOVA):
A simple approach could be to group the continuous variable using the categorical variable, measure the variance in each group and comparing it to the overall variance of the continuous variable. If the variance after grouping falls down significantly, it means that the categorical variable can explain most of the variance of the continuous variable and so the two variables likely have a strong association.

Question 54

Q

What is Cramer V?

Answer

A

In statistics, Cramer V is a measure of association between two nominal variables. It is based on Pearson’s chi-squared statistic.

1) it is the intercorrelation of two discrete variables and may be used with variables having two or more levels.
2) It is a symmetrical measure, it does not matter which variable we place in the col and row.
3) It can also be applied to goodness of fit chi-squared models when there is a 1xK table.
4) Its varies from 0 (no association) to 1 (complete association)
5) it can be a heavily biased estimator of its population.

Question 55

Q

How to detect if we are underfitting?

Answer

A

Your model is underfitting the training data when the model performs poorly on the training data. This is because the model is unable to capture the relationship between the input examples (often called X) and the target values (often called Y).

Question 56

Q

What is data leakage?

Answer

A

Data leakage is when information from outside the training dataset is used to create the model.

For example: Splitting the candle data randomly minute to minute instead of day to day.

Question 57

Q

What is data drift?

Answer

A

Data-drift is defined as a variation in the production data from the data that was used to test and validate the model before deploying it in production.

For example, Interactive Broker did not send every trade on the live data but did on the historical data. This created a big difference between the live data and the training data.

Question 58

Q

What is precision? what is another name used to describre precision?

Answer

A

precision = tp/(tp+fp)

precision = positive predictive value.

Question 59

Q

What is recall ? what is another name used to describe recall

Answer

A

recall = tp/(tp+fn)

recall = sensitivity

recall = true positive rate

Question 60

Q

What is specificity? what is another name used to describe specificity ?

Answer

A

specificity = tn/(tn+fp)

specificity = true negative rate

Question 61

Q

What is the f1 score and what its formula?

Answer

A

The f1 score is the harmonic mean of precision and recall

f1 = 2*(precision*recall)/(precision+recall)

f1 = 2tp/(2tp+fp+fn)

Question 62

Q

What is data drift?

Answer

A

Data drift is the situation where the model’s input distribution changes over time.

Question 63

Q

What is concept drift?

Answer

A

(Real) concept drift is the situation when the functional relationship between the model inputs and outputs changes.

The cause of the relationship change is some kind of external event or process. For example, we try to predict life expectancy using geographic regions as input. As the region’s development level increases (or decreases) region loses its predictive power, and our model degrades.