general machine learning Flashcards
Why do you want to lock a test set right from the beginning.
If you or your algorithm look at the test data its increase the likelihood that your model will be bias. The bias we are trying to avoid is the data snooping bias.
What is the data snooping bias?
The data snooping bias is a statistical bias that appears when exhaustively searching for combinations of variables, the probability that a result arose by pure chance grow with the number of combinations tested.
What is the sampling bias?
The sampling bias is a bias in which a sample is collected in such a way that some members of the intended population have a lower sampling probability than others. The results is a biased sample or a non-random sample of a population.
What is the confirmation bias?
Confirmation bias, is the tendency to process information by looking for, or interpreting, information that is consistent with one’s existing beliefs.
What is the exclusion bias?
Happens as a result of excluding some features from our dataset usually under the umbrella of cleaning our data. We think these features are irrelevant. For example in the titanic survival prediction problem, one might disregard the passenger id of the travelers as they might think it is completely irrelevant. Little did they know that titanic passengers were assigned rooms according to their passenger id. The smaller the id number the closer to the lifeboats.
What is the observer bias?
The tendency to see what we expect to see, or what we want to see. When a researcher studies a certain group, they usually come to an experiment with prior knowledge and subjective feeling about the group being studied.
What is prejudice bias?
Happen as a result of cultural influences or stereotypes on data. Example: a computer vision program that detects people at work using google image. It will be fed thousand of man coding and women cooking. Your model might conclude that only man code and only women cook.
What is measurement bias?
Systematic value value distortion happens when there’s an issue with the device used to observe or measure. This kind of bias tends to skew the data in a particular direction. Example: shooting images data with a camera that increases the brightness. This messed up measurement tool failed to replicate the environment on which the model will operate.
What are the eight main steps of a machine learning project?
1) Frame the problem and look at the big picture 2) get the data 3) explore the data to get insight 4) prepare the data 5) explore many different models and shortlist the best ones. 6) Fine-tune your models and combine them into a great solution 7) Present your solution 8) Launch, monitor, and maintain your system.
When framing the problem, which question should you ask yourself?
i1) What is the objective in business terms 2) how will the solution be used 3) how should performance be measured and is it aligned with the business objective 4) what would be the minimum performance needed to reach the business objective. 5) list and verify the validity of your assumption.
in the step, get the data, what do you need to verify (5)?
1) list the data you need and how much you need 2) find and document where you can find the data. 3) check legal obligation 4) Ensure sensitive information is deleted or protected. 5) sample a test set and put it aside, and never look at it(no data scooping)
What do we mean by exploring the data (5 points)?
1) study each attribute and its characteristics 2) verify % of missing values 3) Identify the target attribute 4) Visualize the data 5) study the correlations.
What do we mean by preparing the data (7 points)?
1) note make sure to work on copies of the data (keep the original dataset intact) 2) Write function for all data transformation 3) fix or remove outlier 4) fill missing data 5) feature selection: Drop the attributes that provide no usefull information. 6) feature scaling 7) change the type of data. For example from continuous to discret.
What do we mean by shortlist promising models?
1) train many quick-and-dirty models and compare their performance. 2) For each model, use N-fold cross-validation. 3) Analyze the most significant variable for each algorithm 4) Analyze the type of errors the models make 5) Perform a quick round of feature selection.
What do we mean by fine tuning the system?
1) you will want to use as much data as possible for this step. 2) Fine tune the hyperparameters using cross-validation 3) Try ensemble methods. Combining your best models will often produce better performance than running them individually. 4) Once you are confident about your final model, measure its performance on the test set to estimate the generalization error. 5) Note: do not tweak your model after measuring the generalization error, you would just overfit the test set.
when presenting your solution do not forget to: (6)
1) document what you have done 2) create a nice presentation, make sure you highlight the big picture first 3) explain why your solution achieves the business objective 4) present interesting points you noticed along the way 5) list your system limitation 6) ensure your key finding are communicated by easy to remember statement. For example: the median income is the number one predictor of housing price.
What is a model validation techniques?
It is a techniques for assessing how the results of a statistical analysis will generalize to an independent data set. It is mainly used in setting where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice.
What is the goal of cross-validation?
The goal of cross-validation is to test the model’s ability to predict new data that was not used in estimating it, in order to flag problems like overfitting or selection bias and to give an insight on how the model will generalize to an independent dataset (i.e., an unknown dataset, for instance from a real problem).
What are the 3 strategy to deal with missing values?
1) Get rid of the corresponding values 2) get ride of the whole attribute 3) set the values to some value (zero, the mean, the median, etc.)
1 pro and con of mean imputing.
Pro: The other attributes are still used in our model con: The standard deviation is artificially lowered. Your model think it has more data than it really does for the given attribute.
What do we mean by one-hot encoding, and when is it used?
We have an array of categorical variable and it not clear if there is any order to the set, we create one binary attribute per category. Only 1 attribute will be equal to 1 (hot) and all other will be 0 (cold).
What is a sparse matrix and why do we use it ?
A sparse matrix is a matrix that contain a majority of zero. Substantial memory requirement reductions can be realized by storing only the non-zero entries.
What is the difference between univariate and multivariate imputing ?
One type of imputation algorithm is univariate, which imputes values in the i-th feature dimension using only non-missing values in that feature dimension (e.g. impute.SimpleImputer). By contrast, multivariate imputation algorithms use the entire set of available feature dimensions to estimate the missing values (e.g. impute.IterativeImputer).
What is feature scaling, why do we scale the feature and what are the two most common way to scale the feature ?
1) Feature scaling is transforming the data so they are on the same scale. 2) With few exceptions, Machine learning algorithms do not perform well when the input numerical attributes have very different scales. 3) Most optimization algorithm will slow down considerably if the parameter do not have the same scale. 4) min-max scaling and standardization.
What is min-max scaling (often called normalization)?
It is a form of feature engineering. The values are shifted and rescaled so that they end up ranging from 0 to 1. We do this by subtracting the min value and dividing by the max minus the min value. Sklearn provides a transformer called MinMaxScaler