Modeling Flashcards
Tell me about how you designed a model for a past employer or client.
Answer
What are your favorite data visualization techniques?
Answer
How would you effectively represent data with 5 dimensions?
Answer
How is k-NN different from k-means clustering?
k-NN, or k-nearest neighbors is a classification algorithm, where the k is an integer describing the number of neighboring data points that influence the classification of a given observation. K-means is a clustering algorithm, where the k is an integer describing the number of clusters to be created from the given data.
How would you create a logistic regression model?
Answer
Have you used a time series model? Do you understand cross-correlations with time lags?
Answer
Explain the 80/20 rule, and tell me about its importance in model validation.
“People usually tend to start with a 80-20% split (80% training set – 20% test set) and split the training set once more into a 80-20% ratio to create the validation set.”
Explain what precision and recall are. How do they relate to the ROC curve?
Recall describes what percentage of true positives are described as positive by the model. Precision describes what percent of positive predictions were correct. The ROC curve shows the relationship between model recall and specificity–specificity being a measure of the percent of true negatives being described as negative by the model. Recall, precision, and the ROC are measures used to identify how useful a given classification model is
Explain the difference between L1 and L2 regularization methods.
“A regression model that uses L1 regularization technique is called Lasso Regression and model which uses L2 is called Ridge Regression. The key difference between these two is the penalty term.” Read more here.
In your opinion, which is more important when designing a machine learning model: model performance or model accuracy?
Answer
Is it better to spend five days developing a 90-percent accurate solution or 10 days for 100-percent accuracy?
Answer
What are some situations where a general linear model fails?
Answer
Do you think 50 small decision trees are better than a large one? Why?
Answer
Is it better to have too many false positives or too many false negatives
Answer
Your data science team must build a binary classifier, and the number one criterion is the fastest possible scoring at deployment. It may even be deployed in real time. Which technique will produce a model that will likely be fastest for the deployment team use to new cases?
random forest
logistic regression
KNN
deep neural network
To predict a new value,
Random Forest - Value has to be fed to all the trees and then some voting rule applied
KNN - distances have to be computed against the n observations
Logistic Regression - value is fed into sigmoid
Logistic Regression is much quicker in deployment