Interview Flashcards
What is linear regression?
It is a linear approach to modelling the relationship between an explanatory variable and a response variable. 1 Variable simple linear regression many variables multiple linear regressions. Example number of covid tests with number of cases.
What are the assumptions of linear regression?
1) There should be a linear relationship between the explanatory and response variable - Scatter plot should check - See constant straight line - not like non-linear e.g. time and actual COVID cases
2) The explanatory variables should not exhibit multi-collinearity (variance inflation factor aim for no more than 2.5) means variance inflated by factor 2.5
3) Homoscedasticity equal distribution of errors - plot residuals, fit with constant variance term
4) For any fixed value of the explanatory variables the response is normally distributed
What metrics can you use to evaluate linear regression models?
R2 - Percentage of variation explained by the model
Adjusted R2 takes into account addition of additional parameters to reduce overfitting - relative fit
Mean Squared Area - measures average of the squared difference between observed and actual - absolute measure of model fit
Root Mean Square Error - measure of distance between actual value and predicted value
What is the difference between an absolute fit and a relative fit?
Relative fit compares the fitted model to the null model, absolute fit just look at the fit
What is overfitting?
It’s where we’ve fit our model to the data too well and we’re going to struggle to generalise to other data. Example with prime ministers, observed by test data accuracy being lower than training data
How do you deal with overfitting?
Cross validation, more data (can help find signal), remove irrelevant input features, Early stopping (monitor iterations then stop), regularisation techniques (pruning, dropout, penalty parameter to cost function, Ensembling)
https://elitedatascience.com/overfitting-in-machine-learning
How do you deal with underfitting
Increase the number of features more data won’t help
How do you evaluate the performance of a classifier
Confusion matrix is a good place to start. Accuracy is the number of predictions you got right/total, null accuracy comparison is what you would have got if you just assumed the most frequent class.
https://www.ritchieng.com/machine-learning-evaluate-classification-model/
What is a type I error?
False Positive
What is a type 2 error?
False Negative
How does cross-validation work
Split data into k folds then take one out for test set train on the k-1 separate folds and then average test vs train performance - Check this!
How to handle missing data?
Very circumstantial, Why is it missing, is it random, is there a reason for it not being there, how much of it is missing? Could be useful i.e. internet outage I hear it a lot you don’t want this it’s full of missing data….Some algorithms will deal with this XGBoost. Mean/median imputation for continuous column, impute most frequent value if categorical. K-NN imputation (computationally expensive)
What are the main types of machine learning?
Supervised, Unsupervised, semi-supervised, reinforcement
What is precision vs recall?
Precision is TP/(TP+FP) - True Positive/Actual results (Percentage of results which are relevant) ‘What proportion of positive identifications was actually correct?’
Recall is TP/(TP+FN) - True Positive/Predicted Results (Percentage of total relevant results “What proportion of actual positives was identified correctly?”
How is F1 defined?
F1 = 2*1/(1/precision)+1/(1/Recall))
How is deep learning different to machine learning?
Artificial Intelligence is a technique which enables machines to mimic human behavior.
Machine Learning is a subset of AI technique which uses statistical methods to enable machines to improve with experience.
Deep learning is a subset of ML which make the computation of multi-layer neural network feasible. It uses Neural networks to simulate human-like decision making.
Deep learning does the feature engineering for you, Deep learning general performs poorly with small amounts of data but excels with large amounts of data
What open source datasets have you used?
ONS postcode centroids and local authority shape files
What is selection bias?
Selection bias, in general, is a problematic situation in which error is introduced due to a non-random population sample
Selecting phone calls only not web traffic etc.
How do you go about influencing technical and non-technical audiences?
Use analogies for non-technical in terms they understand, keep in mind the purpose of why you’re telling them something, what do they need to get from this. Storytelling, an example etc.
What is anova?
Analysis of variance asks do the samples come from different population. A one way ANOVA is one factor accounted for (2+ levels), a two way ANOVA is two factors (each 2+ levels) investigated at the same time.
How does a one-way ANOVA work?
Hypothesis test with only one single factor or categorical variable, compare 3 or more sample means (if 2 use t-test) Null hypothesis no difference in means, alternative hypothesis there is a difference in means. Compute variance within samples, compute variance between sample means then produce Fstatistic from the ratio ‘between group variability’/’within group variability’
- Does age, sex or income have an effect whether someone becomes prime minister?
What are the assumptions of ANOVA
The responses for each factor level have a normal population distribution.
These distributions have the same variance.
The data are independent.
Explain K-Means
We can select K using our knowledge or by doing it empirically. Initialise k-points at random positions in the feature space, these are known as the cluster centroids. The Euclidean distance is calculated from each observation to the centroid and assigned to the closest centroid.
Inertia measures how far a sample is from a cluster centroid - lower values of inertia are better
Do an elbow plot inertia vs number of clusters
How to choose K
Elbow plot (K vs sum of total within sum of squares) look for characteristic elbow or Silhouette analysis
Give examples of where a false negative is more important than a false positive
Covid-19
What is logistic regression?
A classification model that takes in input variables and relates it to whether a binary category is the result
What is the null hypothesis and how do we state it?
(in a statistical test) the hypothesis that there is no significant difference between specified populations, any observed difference being due to sampling or experimental error.
The observed patterns are due to random chance
What is and how do you deal with heteroskedasticity?
uneven distribution of errors
What is a p-value?
In statistics, the p-value is the probability of obtaining results at least as extreme as the observed results of a statistical hypothesis test, assuming that the null hypothesis is correct.
What is statistical power and how do you calculate it?
Power is the probability of not making a type II error,
To increase power
1. Increase the effect size (the difference between the null and alternative values) to be detected
2. Increase the sample size(s)
3. Decrease the variability in the sample(s)
4. Increase the significance level (alpha) of the test
How do you find the correlation between a categorical variable and a continuous variable?
You can’t; at least, not if the categorical variable has more than two levels. If it has two levels, you can use point biserial correlation.
But, with a categorical variable that has three or more levels, the notion of correlation breaks down. Correlation is a measure of the linear relationship between two variables. That makes no sense with a categorical variable.
There are ways to measure the relationship between a continuous and categorical variable; probably the closest to correlation is a log linear model. Regression (which some other people said would be good) imposes a dependent and independent variable which correlation does not.
What is a p-value?
In statistics, the p-value is the probability of obtaining results at least as extreme as the observed results of a statistical hypothesis test, assuming that the null hypothesis is correct.
Likelihood that the null hypothesis is correct