Modeling Flashcards

Question 1

Q

Tell me about how you designed a model for a past employer or client.

Question 2

Q

What are your favorite data visualization techniques?

Question 3

Q

How would you effectively represent data with 5 dimensions?

Question 4

Q

How is k-NN different from k-means clustering?

Answer

A

k-NN, or k-nearest neighbors is a classification algorithm, where the k is an integer describing the number of neighboring data points that influence the classification of a given observation. K-means is a clustering algorithm, where the k is an integer describing the number of clusters to be created from the given data.

Question 5

Q

How would you create a logistic regression model?

Question 6

Q

Have you used a time series model? Do you understand cross-correlations with time lags?

Question 7

Q

Explain the 80/20 rule, and tell me about its importance in model validation.

Answer

A

“People usually tend to start with a 80-20% split (80% training set – 20% test set) and split the training set once more into a 80-20% ratio to create the validation set.”

Question 8

Q

Explain what precision and recall are. How do they relate to the ROC curve?

Answer

A

Recall describes what percentage of true positives are described as positive by the model. Precision describes what percent of positive predictions were correct. The ROC curve shows the relationship between model recall and specificity–specificity being a measure of the percent of true negatives being described as negative by the model. Recall, precision, and the ROC are measures used to identify how useful a given classification model is

Question 9

Q

Explain the difference between L1 and L2 regularization methods.

Answer

A

“A regression model that uses L1 regularization technique is called Lasso Regression and model which uses L2 is called Ridge Regression. The key difference between these two is the penalty term.” Read more here.

Question 10

Q

In your opinion, which is more important when designing a machine learning model: model performance or model accuracy?

Question 11

Q

Is it better to spend five days developing a 90-percent accurate solution or 10 days for 100-percent accuracy?

Question 12

Q

What are some situations where a general linear model fails?

Question 13

Q

Do you think 50 small decision trees are better than a large one? Why?

Question 14

Q

Is it better to have too many false positives or too many false negatives

Question 15

Q

Your data science team must build a binary classifier, and the number one criterion is the fastest possible scoring at deployment. It may even be deployed in real time. Which technique will produce a model that will likely be fastest for the deployment team use to new cases?

random forest
logistic regression
KNN
deep neural network

Answer

A

To predict a new value,

Random Forest - Value has to be fed to all the trees and then some voting rule applied

KNN - distances have to be computed against the n observations

Logistic Regression - value is fed into sigmoid

Logistic Regression is much quicker in deployment

Modeling Flashcards

(15 cards)