Interview Prep Flashcards by Joe Butcher

Bias-Variance Tradeoff

Bias is error due to erroneous or overly simplistic assumptions in algorithm. This can lead to the model underfitting your data, making it hard to generalize your knowledge from the training set to the test set.

Variance is error due to too much complexity in algorithm. Leads to the algorithm being highly sensitive and overfitting.

How well did you know this?

Not at all

Perfectly

Supervised vs unsupervised learning

Supervised = used data that is labeled

Unsupervised = not labeled

How well did you know this?

Not at all

Perfectly

KNN vs k-means clustering

KNN = supervised classification algorithm

K-means clustering is unsupervised

How well did you know this?

Not at all

Perfectly

ROC Curve

Receiver operating characteristic –> representation of true positive rates vs false positive rates

How well did you know this?

Not at all

Perfectly

Precision vs. recall vs. accuracy

Email spam

Precision = TP / predicted positive (actual cancer patients / pred cancer patients)

Recall = TP / Real positive (Actual cancer patients / actual cancer patients + cancer patients not predicated)

How well did you know this?

Not at all

Perfectly

Type I vs Type II

Type I = False positive

Type II = False negative

How well did you know this?

Not at all

Perfectly

Cross-validation

Holding out different parts of the data to test the model on

How well did you know this?

Not at all

Perfectly

Decision tree pruning

Branches that have lower predictive power are removed in order to reduce complexity

How well did you know this?

Not at all

Perfectly

F1 score

weighted avg between recall and precision

1 = best
0 = worst

How well did you know this?

Not at all

Perfectly

How to avoid overfitting

Keep models simple

Use regularization (Lasso / Ridge)

Use cross-validation

How well did you know this?

Not at all

Perfectly

Examples of ensemble

Decision Tree + Boosting

How well did you know this?

Not at all

Perfectly

How would you handle missing or incomplete data?

Delete row or decide to replace with 0 or another value (mean or median) or prediction

How well did you know this?

Not at all

Perfectly

Write pseudocode for linear regression

1) Define cost function (total squared error) to minimize
2) Initialize gradient descent
3) Iterate on gradient decent based on alpha (learning rate) and number of iterations
4) Analyze results

How well did you know this?

Not at all

Perfectly

Why and how to normalize data

Ensure all data is on the same scale

= x - mean / SD

How well did you know this?

Not at all

Perfectly

Data analysis process

1) Get access to data
2) EDA to get familiar with data + understand any issues
3) Determine how to prepare the data
4) Explore models to use
5) Execute, visualize, and inform

How well did you know this?

Not at all

Perfectly

Difference between variance and covariance

Study These Flashcards

Variance = how apart numbers are from mean

Covariance = how the mean of two variables move together

Elements of CLT

Study These Flashcards

with a large enough sample size, the sample mean and sd will match the population – you can use z-score to get prob

Conditions for linear regression

Study These Flashcards

There is a linear relationship between the dependent variables and the regressors
The errors or residuals of the data are normally distributed and independent from each other
There is minimal multicollinearity between explanatory variables
Homoscedasticity. This means the variance around the regression line is the same for all values of the

Python data structures and which are mutable?

Study These Flashcards

Lists, Dictionaries, Sets, Strings, Tuples, Boolean, Numeric

Tuples are immutable (along with frozensets)

How best to split up test vs train

Study These Flashcards

Use train, test, and validation – validation allows you to fine tune parameters before running final test

80:20 typically appropriate but depends on the data you have

Regression algorithms

Study These Flashcards

Linear regression
Regression trees
Neural Network

Type of classification algorithms

Study These Flashcards

Support Vector Machines
Decisions Trees
KNN

KNN vs SVM

Study These Flashcards

KNN calculates similarity scores

SMV - looks at difference between hyperplanes

Git commands

Study These Flashcards

git pull
git add file/folder
git commit -m “comment”
git push

explain neural networks

Replicates now the brain works Input layer: Info comes in to an input layer Hidden layer: n # of hidden layers, that make computations to make sense of pieces of info Output layer: main outcome

graph database

neo4j = graph db, technically non-SQL, user cypher language neptune (AWS) using SPARQL, gremlin stores in shape of graph/network (nodes and edges)

Interview Prep Flashcards

(26 cards)