Interview Flashcards
What is linear regression?
It is a linear approach to modelling the relationship between an explanatory variable and a response variable. 1 Variable simple linear regression many variables multiple linear regressions. Example number of covid tests with number of cases.
What are the assumptions of linear regression?
1) There should be a linear relationship between the explanatory and response variable - Scatter plot should check - See constant straight line - not like non-linear e.g. time and actual COVID cases
2) The explanatory variables should not exhibit multi-collinearity (variance inflation factor aim for no more than 2.5) means variance inflated by factor 2.5
3) Homoscedasticity equal distribution of errors - plot residuals, fit with constant variance term
4) For any fixed value of the explanatory variables the response is normally distributed
What metrics can you use to evaluate linear regression models?
R2 - Percentage of variation explained by the model
Adjusted R2 takes into account addition of additional parameters to reduce overfitting - relative fit
Mean Squared Area - measures average of the squared difference between observed and actual - absolute measure of model fit
Root Mean Square Error - measure of distance between actual value and predicted value
What is the difference between an absolute fit and a relative fit?
Relative fit compares the fitted model to the null model, absolute fit just look at the fit
What is overfitting?
It’s where we’ve fit our model to the data too well and we’re going to struggle to generalise to other data. Example with prime ministers, observed by test data accuracy being lower than training data
How do you deal with overfitting?
Cross validation, more data (can help find signal), remove irrelevant input features, Early stopping (monitor iterations then stop), regularisation techniques (pruning, dropout, penalty parameter to cost function, Ensembling)
https://elitedatascience.com/overfitting-in-machine-learning
How do you deal with underfitting
Increase the number of features more data won’t help
How do you evaluate the performance of a classifier
Confusion matrix is a good place to start. Accuracy is the number of predictions you got right/total, null accuracy comparison is what you would have got if you just assumed the most frequent class.
https://www.ritchieng.com/machine-learning-evaluate-classification-model/
What is a type I error?
False Positive
What is a type 2 error?
False Negative
How does cross-validation work
Split data into k folds then take one out for test set train on the k-1 separate folds and then average test vs train performance - Check this!
How to handle missing data?
Very circumstantial, Why is it missing, is it random, is there a reason for it not being there, how much of it is missing? Could be useful i.e. internet outage I hear it a lot you don’t want this it’s full of missing data….Some algorithms will deal with this XGBoost. Mean/median imputation for continuous column, impute most frequent value if categorical. K-NN imputation (computationally expensive)
What are the main types of machine learning?
Supervised, Unsupervised, semi-supervised, reinforcement
What is precision vs recall?
Precision is TP/(TP+FP) - True Positive/Actual results (Percentage of results which are relevant) ‘What proportion of positive identifications was actually correct?’
Recall is TP/(TP+FN) - True Positive/Predicted Results (Percentage of total relevant results “What proportion of actual positives was identified correctly?”
How is F1 defined?
F1 = 2*1/(1/precision)+1/(1/Recall))
How is deep learning different to machine learning?
Artificial Intelligence is a technique which enables machines to mimic human behavior.
Machine Learning is a subset of AI technique which uses statistical methods to enable machines to improve with experience.
Deep learning is a subset of ML which make the computation of multi-layer neural network feasible. It uses Neural networks to simulate human-like decision making.
Deep learning does the feature engineering for you, Deep learning general performs poorly with small amounts of data but excels with large amounts of data
What open source datasets have you used?
ONS postcode centroids and local authority shape files
What is selection bias?
Selection bias, in general, is a problematic situation in which error is introduced due to a non-random population sample
Selecting phone calls only not web traffic etc.
How do you go about influencing technical and non-technical audiences?
Use analogies for non-technical in terms they understand, keep in mind the purpose of why you’re telling them something, what do they need to get from this. Storytelling, an example etc.
What is anova?
Analysis of variance asks do the samples come from different population. A one way ANOVA is one factor accounted for (2+ levels), a two way ANOVA is two factors (each 2+ levels) investigated at the same time.
How does a one-way ANOVA work?
Hypothesis test with only one single factor or categorical variable, compare 3 or more sample means (if 2 use t-test) Null hypothesis no difference in means, alternative hypothesis there is a difference in means. Compute variance within samples, compute variance between sample means then produce Fstatistic from the ratio ‘between group variability’/’within group variability’
- Does age, sex or income have an effect whether someone becomes prime minister?
What are the assumptions of ANOVA
The responses for each factor level have a normal population distribution.
These distributions have the same variance.
The data are independent.
Explain K-Means
We can select K using our knowledge or by doing it empirically. Initialise k-points at random positions in the feature space, these are known as the cluster centroids. The Euclidean distance is calculated from each observation to the centroid and assigned to the closest centroid.
Inertia measures how far a sample is from a cluster centroid - lower values of inertia are better
Do an elbow plot inertia vs number of clusters
How to choose K
Elbow plot (K vs sum of total within sum of squares) look for characteristic elbow or Silhouette analysis
Give examples of where a false negative is more important than a false positive
Covid-19
What is logistic regression?
A classification model that takes in input variables and relates it to whether a binary category is the result
What is the null hypothesis and how do we state it?
(in a statistical test) the hypothesis that there is no significant difference between specified populations, any observed difference being due to sampling or experimental error.
The observed patterns are due to random chance
What is and how do you deal with heteroskedasticity?
uneven distribution of errors
What is a p-value?
In statistics, the p-value is the probability of obtaining results at least as extreme as the observed results of a statistical hypothesis test, assuming that the null hypothesis is correct.
What is statistical power and how do you calculate it?
Power is the probability of not making a type II error,
To increase power
1. Increase the effect size (the difference between the null and alternative values) to be detected
2. Increase the sample size(s)
3. Decrease the variability in the sample(s)
4. Increase the significance level (alpha) of the test
How do you find the correlation between a categorical variable and a continuous variable?
You can’t; at least, not if the categorical variable has more than two levels. If it has two levels, you can use point biserial correlation.
But, with a categorical variable that has three or more levels, the notion of correlation breaks down. Correlation is a measure of the linear relationship between two variables. That makes no sense with a categorical variable.
There are ways to measure the relationship between a continuous and categorical variable; probably the closest to correlation is a log linear model. Regression (which some other people said would be good) imposes a dependent and independent variable which correlation does not.
What is a p-value?
In statistics, the p-value is the probability of obtaining results at least as extreme as the observed results of a statistical hypothesis test, assuming that the null hypothesis is correct.
Likelihood that the null hypothesis is correct
How do you deal with imbalance?
Use the right evaluation metrics (not just accuracy), under over sample data (k fold must be done before oversampling - otherwise we overfit on specific artificial result )
Explain the bias/variance tradeoff
High bias - underfitting, high variance - overfitting - middle ground is just right thus there is a tradeoff
How do you deal with imbalance?
Use the right evaluation metrics (not just accuracy), under over sample data (k fold must be done before oversampling - otherwise we overfit on specific artificial result )
if you want the minority class - oversample it or undersample the majority class
increase the cost of misclassifying the minority class
What is the difference between a box plot and a histogram?
Whilst both show the distribution of data, they communicate it differently. Histograms show us the shape of the distribution, boxplots show us the quartiles and the tukey fences and are better for comparing multiple plots.
Compare logistic regression to random forest
Random forest doesn’t assume a linear relationship
LG more explanable and scales better
Assume you need to generate a predictive model using multiple regression. Explain how you intend to validate this model
Adjusted R^2 - adding more variables increases the R2 value
Cross Validation
When would you use random forests Vs SVM and why
Random forests allow you to determine the feature importance. SVM’s can’t do this.
Random forests are much quicker and simpler to build than an SVM.
For multi-class classification problems, SVMs require a one-vs-rest method, which is less scalable and more memory intensive.
What is the difference between union and union all in SQL?
union only combines distinct values, union all create duplicates
Why is dimension reduction important?
1) It reduces storage space
2) Removal of multi-collinearity improves the interpretation of the parameters of the machine learning model
3) It becomes easier to visualize the data when reduced to very low dimensions such as 2D or 3D
4) It avoids the curse of dimensionality
Why is Naive Bayes so bad? How would you improve a spam detection algorithm that uses naive Bayes?
assumes inputs are uncorrelated. Garden flavoured ice cream
What are the drawbacks of a linear model?
A linear model holds some strong assumptions that may not be true in application. It assumes a linear relationship, multivariate normality, no or little multicollinearity, no auto-correlation, and homoscedasticity
A linear model can’t be used for discrete or binary outcomes.
You can’t vary the model flexibility of a linear model.
What is the significance of a Cost/Loss function?
It is the function telling us how badly our model maps X -> y
When should you use precision-recall curve over ROC?
When dataset is imbalanced
When should you use precision-recall curve over ROC?
When dataset is imbalanced
https://www.quora.com/What-is-the-difference-between-a-ROC-curve-and-a-precision-recall-curve-When-should-I-use-each
Explain what resampling methods are and why they are useful. Also explain their limitations
Classical statistical parametric tests compare observed statistics to theoretical sampling distributions. Resampling a data-driven, not theory-driven methodology which is based upon repeated sampling within the same sample.
Resampling refers to methods for doing one of these
Estimating the precision of sample statistics (medians, variances, percentiles) by using subsets of available data (jackknifing) or drawing randomly with replacement from a set of data points (bootstrapping)
Exchanging labels on data points when performing significance tests (permutation tests, also called exact tests, randomization tests, or re-randomization tests)
Validating models by using random subsets (bootstrapping, cross validation)
Give an example of an unsupervised learning technique for continuous data
Dimensionality reduction
How can you deal with outliers?
Ignore, remove, log transform…might want to keep depending on business problem like cyber security etc.
If I repeat a cluster analysis will I get the same result?
No could find a local minima not the global
What is the difference between a bar graph and a histogram?
A bar graph is for discrete data whereas a histogram is for continuous data
KNN
Creates decision boundary
How does KNN work?
Creates decision boundary
What are the residuals in linear regression?
Vertical distance between fitted line and points
What is a cost function?
A measure of how badly our model maps x to y
What is the difference between long and tall data?
Long data in one context in another, wide 1 feature to 1 column
What are parametric vs non-parametric models?
A parametric model is an ml model that captures all the information about its predictions in a finite number of parameters
What is meant by the term confidence interval?
Range of plausible values for an unknown parameter
Draw a graph of precision vs accuracy what is more precise a 95% confidence interval or a 99%?
99%
How do you calculate a z-score? (use for questions comparing two results with different means and SD)
z = (x-u)/sigma
What is a standard deviation?
Measure for dispersion
How does hierarchical clustering work?
starts with everything as a cluster then merges with nearest neighbour etc.
What is a q-q plot?
plotting the quantiles of a variable against each other will give a straight line if the variable is normally distributed
How does scaling work?
Juice analogy, normalization bound number between e.g. 0-1, standardisation zero mean and a variance of 1
Feature scaling also speeds up gradient descent
How would you use a chi square test for feature selection in machine learning
Checks the independence of two variables
chi square test compares proportions of discrete categories
Why are outliers a problem?
Standard error increases - increases variance
How does CNN work?
This will be your ‘favourite’
How does scaling work?
Juice analogy, normalization bound number between e.g. 0-1, standardisation zero mean and a variance of 1
Feature scaling is essential for machine learning algorithms that calculate distances between data.
KNN
K-means
Principle component analysis
Whereas random forest (rules) and naive bayes (weights) are unaffected by scaling
Feature scaling also speeds up gradient descent
How do a one or a two sample t-test differ
is the mean of the sample different to a given value
is the mean of the sample different to the mean of the other sample
Fichers test vs chi square test
Chi squared assumes large sample size
Fichers test vs chi square test
Chi squared assumes large sample size (p value is approximate)
Fischer is the two sided version
What is a confounding variable?
A confounding variable, also called a confounder or confounding factor, is a third variable in a study examining a potential cause-and-effect relationship
How do you deal with confounding?
Blocking make sure equal proportions of a confounding variable are in treatment and control group
What is statistical significance vs effect size
Statistical significance is how certain we are that an effect happened. The effect size is how much difference that effect makes
What is statistical significance vs effect size
Statistical significance is how certain we are that an effect happened. The effect size is how much difference that effect makes. You can get to effect size using Cohen’s D
What values can power take?
0 won’t detect 1 will always detect
What values can power take?
0 won’t detect 1 will always detect as power increases type 2 effects decreases
What is a q-q plot?
plotting the quantiles of a variable against theoretical quantiles of a normal distribution will give a straight line if the variable is normally distributed
What is the problem of missing data in ML
It tends to introduce bias - skewing results and reducing accuracy
Compare ridge and lasso regression
scikit learn series