Interview Prep Flashcards
Bias-Variance Tradeoff
Bias is error due to erroneous or overly simplistic assumptions in algorithm. This can lead to the model underfitting your data, making it hard to generalize your knowledge from the training set to the test set.
Variance is error due to too much complexity in algorithm. Leads to the algorithm being highly sensitive and overfitting.
Supervised vs unsupervised learning
Supervised = used data that is labeled
Unsupervised = not labeled
KNN vs k-means clustering
KNN = supervised classification algorithm
K-means clustering is unsupervised
ROC Curve
Receiver operating characteristic –> representation of true positive rates vs false positive rates
Precision vs. recall vs. accuracy
Email spam
Precision = TP / predicted positive (actual cancer patients / pred cancer patients)
Recall = TP / Real positive (Actual cancer patients / actual cancer patients + cancer patients not predicated)
Type I vs Type II
Type I = False positive
Type II = False negative
Cross-validation
Holding out different parts of the data to test the model on
Decision tree pruning
Branches that have lower predictive power are removed in order to reduce complexity
F1 score
weighted avg between recall and precision
1 = best 0 = worst
How to avoid overfitting
Keep models simple
Use regularization (Lasso / Ridge)
Use cross-validation
Examples of ensemble
Decision Tree + Boosting
How would you handle missing or incomplete data?
Delete row or decide to replace with 0 or another value (mean or median) or prediction
Write pseudocode for linear regression
1) Define cost function (total squared error) to minimize
2) Initialize gradient descent
3) Iterate on gradient decent based on alpha (learning rate) and number of iterations
4) Analyze results
Why and how to normalize data
Ensure all data is on the same scale
= x - mean / SD
Data analysis process
1) Get access to data
2) EDA to get familiar with data + understand any issues
3) Determine how to prepare the data
4) Explore models to use
5) Execute, visualize, and inform
Difference between variance and covariance
Variance = how apart numbers are from mean
Covariance = how the mean of two variables move together
Elements of CLT
with a large enough sample size, the sample mean and sd will match the population – you can use z-score to get prob
Conditions for linear regression
- There is a linear relationship between the dependent variables and the regressors
- The errors or residuals of the data are normally distributed and independent from each other
- There is minimal multicollinearity between explanatory variables
- Homoscedasticity. This means the variance around the regression line is the same for all values of the
Python data structures and which are mutable?
Lists, Dictionaries, Sets, Strings, Tuples, Boolean, Numeric
Tuples are immutable (along with frozensets)
How best to split up test vs train
Use train, test, and validation – validation allows you to fine tune parameters before running final test
80:20 typically appropriate but depends on the data you have
Regression algorithms
Linear regression
Regression trees
Neural Network
Type of classification algorithms
Support Vector Machines
Decisions Trees
KNN
KNN vs SVM
KNN calculates similarity scores
SMV - looks at difference between hyperplanes
Git commands
git pull
git add file/folder
git commit -m “comment”
git push
explain neural networks
Replicates now the brain works
Input layer: Info comes in to an input layer
Hidden layer: n # of hidden layers, that make computations to make sense of pieces of info
Output layer: main outcome
graph database
neo4j = graph db, technically non-SQL, user cypher language
neptune (AWS) using SPARQL, gremlin
stores in shape of graph/network (nodes and edges)