Interview Prep Flashcards

1
Q

Bias-Variance Tradeoff

A

Bias is error due to erroneous or overly simplistic assumptions in algorithm. This can lead to the model underfitting your data, making it hard to generalize your knowledge from the training set to the test set.

Variance is error due to too much complexity in algorithm. Leads to the algorithm being highly sensitive and overfitting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Supervised vs unsupervised learning

A

Supervised = used data that is labeled

Unsupervised = not labeled

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

KNN vs k-means clustering

A

KNN = supervised classification algorithm

K-means clustering is unsupervised

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

ROC Curve

A

Receiver operating characteristic –> representation of true positive rates vs false positive rates

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Precision vs. recall vs. accuracy

A

Email spam

Precision = TP / predicted positive (actual cancer patients / pred cancer patients)

Recall = TP / Real positive (Actual cancer patients / actual cancer patients + cancer patients not predicated)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Type I vs Type II

A

Type I = False positive

Type II = False negative

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Cross-validation

A

Holding out different parts of the data to test the model on

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Decision tree pruning

A

Branches that have lower predictive power are removed in order to reduce complexity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

F1 score

A

weighted avg between recall and precision

1 = best
0 = worst
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How to avoid overfitting

A

Keep models simple

Use regularization (Lasso / Ridge)

Use cross-validation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Examples of ensemble

A

Decision Tree + Boosting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How would you handle missing or incomplete data?

A

Delete row or decide to replace with 0 or another value (mean or median) or prediction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Write pseudocode for linear regression

A

1) Define cost function (total squared error) to minimize
2) Initialize gradient descent
3) Iterate on gradient decent based on alpha (learning rate) and number of iterations
4) Analyze results

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Why and how to normalize data

A

Ensure all data is on the same scale

= x - mean / SD

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Data analysis process

A

1) Get access to data
2) EDA to get familiar with data + understand any issues
3) Determine how to prepare the data
4) Explore models to use
5) Execute, visualize, and inform

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Difference between variance and covariance

A

Variance = how apart numbers are from mean

Covariance = how the mean of two variables move together

17
Q

Elements of CLT

A

with a large enough sample size, the sample mean and sd will match the population – you can use z-score to get prob

18
Q

Conditions for linear regression

A
  1. There is a linear relationship between the dependent variables and the regressors
  2. The errors or residuals of the data are normally distributed and independent from each other
  3. There is minimal multicollinearity between explanatory variables
  4. Homoscedasticity. This means the variance around the regression line is the same for all values of the
19
Q

Python data structures and which are mutable?

A

Lists, Dictionaries, Sets, Strings, Tuples, Boolean, Numeric

Tuples are immutable (along with frozensets)

20
Q

How best to split up test vs train

A

Use train, test, and validation – validation allows you to fine tune parameters before running final test

80:20 typically appropriate but depends on the data you have

21
Q

Regression algorithms

A

Linear regression
Regression trees
Neural Network

22
Q

Type of classification algorithms

A

Support Vector Machines
Decisions Trees
KNN

23
Q

KNN vs SVM

A

KNN calculates similarity scores

SMV - looks at difference between hyperplanes

24
Q

Git commands

A

git pull
git add file/folder
git commit -m “comment”
git push

25
Q

explain neural networks

A

Replicates now the brain works

Input layer: Info comes in to an input layer
Hidden layer: n # of hidden layers, that make computations to make sense of pieces of info
Output layer: main outcome

26
Q

graph database

A

neo4j = graph db, technically non-SQL, user cypher language
neptune (AWS) using SPARQL, gremlin
stores in shape of graph/network (nodes and edges)