Part 1 Flashcards
Q: What are some of the steps for data wrangling and data cleaning before applying machine learning algorithms?
There are many steps that can be taken when data wrangling and data cleaning. Some of the most common steps are listed below:
Data profiling: Almost everyone starts off by getting an understanding of their dataset. More specifically, you can look at the shape of the dataset with .shape and a description of your numerical variables with .describe().
Data visualizations: Sometimes, it’s useful to visualize your data with histograms, boxplots, and scatterplots to better understand the relationships between variables and also to identify potential outliers.
Syntax error: This includes making sure there’s no white space, making sure letter casing is consistent, and checking for typos. You can check for typos by using .unique() or by using bar graphs.
Standardization or normalization: Depending on the dataset your working with and the machine learning method you decide to use, it may be useful to standardize or normalize your data so that different scales of different variables don’t negatively impact the performance of your model.
Handling null values: There are a number of ways to handle null values including deleting rows with null values altogether, replacing null values with the mean/median/mode, replacing null values with a new category (eg. unknown), predicting the values, or using machine learning models that can deal with null values. Read more here.
Other things include: removing irrelevant data, removing duplicates, and type conversion.
Q: How to deal with unbalanced binary classification?
There are a number of ways to handle unbalanced binary classification (assuming that you want to identify the minority class): First, you want to reconsider the metrics that you’d use to evaluate your model. The accuracy of your model might not be the best metric to look at because and I’ll use an example to explain why. Let’s say 99 bank withdrawals were not fraudulent and 1 withdrawal was. If your model simply classified every instance as “not fraudulent”, it would have an accuracy of 99%! Therefore, you may want to consider using metrics like precision and recall. Another method to improve unbalanced binary classification is by increasing the cost of misclassifying the minority class. By increasing the penalty of such, the model should classify the minority class more accurately. Lastly, you can improve the balance of classes by oversampling the minority class or by undersampling the majority class. You can read more about it here.
Q: What is the difference between a box plot and a histogram?
While boxplots and histograms are visualizations used to show the distribution of the data, they communicate information differently.
Histograms are bar charts that show the frequency of a numerical variable’s values and are used to approximate the probability distribution of the given variable. It allows you to quickly understand the shape of the distribution, the variation, and potential outliers.
Boxplots communicate different aspects of the distribution of data. While you can’t see the shape of the distribution through a box plot, you can gather other information like the quartiles, the range, and outliers. Boxplots are especially useful when you want to compare multiple charts at the same time because they take up less space than histograms.
Q: Describe different regularization methods, such as L1 and L2 regularization?
Both L1 and L2 regularization are methods used to reduce the overfitting of training data. Least Squares minimizes the sum of the squared residuals, which can result in low bias but high variance.
L2 Regularization, also called ridge regression, minimizes the sum of the squared residuals plus lambda times the slope squared. This additional term is called the Ridge Regression Penalty. This increases the bias of the model, making the fit worse on the training data, but also decreases the variance.
If you take the ridge regression penalty and replace it with the absolute value of the slope, then you get Lasso regression or L1 regularization.
L2 is less robust but has a stable solution and always one solution. L1 is more robust but has an unstable solution and can possibly have multiple solutions.
Q: Explain a Neural Network.
A neural network is a multi-layered model inspired by the human brain. Like the neurons in our brain, the circles above represent a node. The blue circles represent the input layer, the black circles represent the hidden layers, and the green circles represent the output layer. Each node in the hidden layers represents a function that the inputs go through, ultimately leading to an output in the green circles. The formal term for these functions is called the sigmoid activation function.
Q: What is cross-validation?
Cross-validation is essentially a technique used to assess how well a model performs on a new independent dataset. The simplest example of cross-validation is when you split your data into two groups: training data and testing data, where you use the training data to build the model and the testing data to test the model.
Q: How to define/select metrics?
There isn’t a one-size-fits-all metric. The metric(s) chosen to evaluate a machine learning model depends on various factors:
Is it a regression or classification task?
What is the business objective? Eg. precision vs recall
What is the distribution of the target variable?
There are a number of metrics that can be used, including adjusted r-squared, MAE, MSE, accuracy, recall, precision, f1 score, and the list goes on.
Q: Explain what precision and recall are
Recall attempts to answer “What proportion of actual positives was identified correctly?”
RECALL = TP/ (TP +FP)
Precision attempts to answer “What proportion of positive identifications was actually correct?”
PRECISION = TP/(TP +FP)
Q: Explain what a false positive and a false negative are. Why is it important these from each other? Provide examples when false positives are more important than false negatives, false negatives are more important than false positives and when these two types of errors are equally important
A false positive is an incorrect identification of the presence of a condition when it’s absent.
A false negative is an incorrect identification of the absence of a condition when it’s actually present.
An example of when false negatives are more important than false positives is when screening for cancer. It’s much worse to say that someone doesn’t have cancer when they do, instead of saying that someone does and later realizing that they don’t.
This is a subjective argument, but false positives can be worse than false negatives from a psychological point of view. For example, a false positive for winning the lottery could be a worse outcome than a false negative because people normally don’t expect to win the lottery anyways.
Q: What is the difference between supervised learning and unsupervised learning? Give concrete examples
Supervised learning involves learning a function that maps an input to an output based on example input-output pairs [1].
For example, if I had a dataset with two variables, age (input) and height (output), I could implement a supervised learning model to predict the height of a person based on their age.
Unlike supervised learning, unsupervised learning is used to draw inferences and find patterns from input data without references to labeled outcomes. A common use of unsupervised learning is grouping customers by purchasing behavior to find target markets.
Q: Assume you need to generate a predictive model using multiple regression. Explain how you intend to validate this model
A) Adjusted R-squared.
R Squared is a measurement that tells you to what extent the proportion of variance in the dependent variable is explained by the variance in the independent variables. In simpler terms, while the coefficients estimate trends, R-squared represents the scatter around the line of best fit.
However, every additional independent variable added to a model always increases the R-squared value — therefore, a model with several independent variables may seem to be a better fit even if it isn’t. This is where adjusted R² comes in. The adjusted R² compensates for each additional independent variable and only increases if each given variable improves the model above what is possible by probability. This is important since we are creating a multiple regression model.
B) Cross-Validation
A method common to most people is cross-validation, splitting the data into two sets: training and testing data. See the answer to the first question for more on this.
Q: What does NLP stand for?
NLP stands for Natural Language Processing. It is a branch of artificial intelligence that gives machines the ability to read and understand human languages.
Q: When would you use random forests Vs SVM and why?
There are a couple of reasons why a random forest is a better choice of model than a support vector machine: Random forests allow you to determine the feature importance. SVM’s can’t do this. Random forests are much quicker and simpler to build than an SVM. For multi-class classification problems, SVMs require a one-vs-rest method, which is less scalable and more memory intensive.
Q: Why is dimension reduction important?
Dimensionality reduction is the process of reducing the number of features in a dataset. This is important mainly in the case when you want to reduce variance in your model (overfitting).
Wikipedia states four advantages of dimensionality reduction (see here):
It reduces the time and storage space required
Removal of multi-collinearity improves the interpretation of the parameters of the machine learning model
It becomes easier to visualize the data when reduced to very low dimensions such as 2D or 3D
It avoids the curse of dimensionality
Q: What is principal component analysis? Explain the sort of problems you would use PCA for.
In its simplest sense, PCA involves project higher dimensional data (eg. 3 dimensions) to a smaller space (eg. 2 dimensions). This results in a lower dimension of data, (2 dimensions instead of 3 dimensions) while keeping all original variables in the model.
PCA is commonly used for compression purposes, to reduce required memory and to speed up the algorithm, as well as for visualization purposes, making it easier to summarize data.