Interview Flashcards

Question

Give examples of where a false negative is more important than a false positive

Answer 1

A classification model that takes in input variables and relates it to whether a binary category is the result

Answer 2

(in a statistical test) the hypothesis that there is no significant difference between specified populations, any observed difference being due to sampling or experimental error. The observed patterns are due to random chance

Answer 3

uneven distribution of errors

Answer 4

In statistics, the p-value is the probability of obtaining results at least as extreme as the observed results of a statistical hypothesis test, assuming that the null hypothesis is correct.

Answer 5

Power is the probability of not making a type II error, To increase power 1. Increase the effect size (the difference between the null and alternative values) to be detected 2. Increase the sample size(s) 3. Decrease the variability in the sample(s) 4. Increase the significance level (alpha) of the test

Answer 6

You can’t; at least, not if the categorical variable has more than two levels. If it has two levels, you can use point biserial correlation. But, with a categorical variable that has three or more levels, the notion of correlation breaks down. Correlation is a measure of the linear relationship between two variables. That makes no sense with a categorical variable. There are ways to measure the relationship between a continuous and categorical variable; probably the closest to correlation is a log linear model. Regression (which some other people said would be good) imposes a dependent and independent variable which correlation does not.

Answer 7

In statistics, the p-value is the probability of obtaining results at least as extreme as the observed results of a statistical hypothesis test, assuming that the null hypothesis is correct. Likelihood that the null hypothesis is correct

Answer 8

Use the right evaluation metrics (not just accuracy), under over sample data (k fold must be done before oversampling - otherwise we overfit on specific artificial result )

Answer 9

High bias - underfitting, high variance - overfitting - middle ground is just right thus there is a tradeoff

Answer 10

Use the right evaluation metrics (not just accuracy), under over sample data (k fold must be done before oversampling - otherwise we overfit on specific artificial result ) if you want the minority class - oversample it or undersample the majority class increase the cost of misclassifying the minority class

Answer 11

Whilst both show the distribution of data, they communicate it differently. Histograms show us the shape of the distribution, boxplots show us the quartiles and the tukey fences and are better for comparing multiple plots.

Answer 12

Random forest doesn't assume a linear relationship | LG more explanable and scales better

Answer 13

Adjusted R^2 - adding more variables increases the R2 value Cross Validation

Answer 14

Random forests allow you to determine the feature importance. SVM’s can’t do this. Random forests are much quicker and simpler to build than an SVM. For multi-class classification problems, SVMs require a one-vs-rest method, which is less scalable and more memory intensive.

Answer 15

union only combines distinct values, union all create duplicates

Answer 16

1) It reduces storage space 2) Removal of multi-collinearity improves the interpretation of the parameters of the machine learning model 3) It becomes easier to visualize the data when reduced to very low dimensions such as 2D or 3D 4) It avoids the curse of dimensionality

Answer 17

assumes inputs are uncorrelated. Garden flavoured ice cream

Answer 18

A linear model holds some strong assumptions that may not be true in application. It assumes a linear relationship, multivariate normality, no or little multicollinearity, no auto-correlation, and homoscedasticity A linear model can’t be used for discrete or binary outcomes. You can’t vary the model flexibility of a linear model.

Answer 19

It is the function telling us how badly our model maps X -> y

Answer 20

When dataset is imbalanced

Answer 21

When dataset is imbalanced | https://www.quora.com/What-is-the-difference-between-a-ROC-curve-and-a-precision-recall-curve-When-should-I-use-each

Answer 22

Classical statistical parametric tests compare observed statistics to theoretical sampling distributions. Resampling a data-driven, not theory-driven methodology which is based upon repeated sampling within the same sample. Resampling refers to methods for doing one of these Estimating the precision of sample statistics (medians, variances, percentiles) by using subsets of available data (jackknifing) or drawing randomly with replacement from a set of data points (bootstrapping) Exchanging labels on data points when performing significance tests (permutation tests, also called exact tests, randomization tests, or re-randomization tests) Validating models by using random subsets (bootstrapping, cross validation)

Answer 23

Dimensionality reduction

Answer 24

Ignore, remove, log transform...might want to keep depending on business problem like cyber security etc.

Answer 25

No could find a local minima not the global

Answer 26

A bar graph is for discrete data whereas a histogram is for continuous data

Answer 27

Creates decision boundary

Answer 28

Creates decision boundary

Answer 29

Vertical distance between fitted line and points

Answer 30

A measure of how badly our model maps x to y

Answer 31

Long data in one context in another, wide 1 feature to 1 column

Answer 32

A parametric model is an ml model that captures all the information about its predictions in a finite number of parameters

Answer 33

Range of plausible values for an unknown parameter

Answer 34

z = (x-u)/sigma

Answer 35

Measure for dispersion

Answer 36

starts with everything as a cluster then merges with nearest neighbour etc.

Answer 37

plotting the quantiles of a variable against each other will give a straight line if the variable is normally distributed

Answer 38

Juice analogy, normalization bound number between e.g. 0-1, standardisation zero mean and a variance of 1 Feature scaling also speeds up gradient descent

Answer 39

Checks the independence of two variables chi square test compares proportions of discrete categories

Answer 40

Standard error increases - increases variance

Answer 41

This will be your 'favourite'

Answer 42

Juice analogy, normalization bound number between e.g. 0-1, standardisation zero mean and a variance of 1 Feature scaling is essential for machine learning algorithms that calculate distances between data. KNN K-means Principle component analysis Whereas random forest (rules) and naive bayes (weights) are unaffected by scaling Feature scaling also speeds up gradient descent

Answer 43

is the mean of the sample different to a given value | is the mean of the sample different to the mean of the other sample

Answer 44

Chi squared assumes large sample size

Answer 45

Chi squared assumes large sample size (p value is approximate) Fischer is the two sided version

Answer 46

A confounding variable, also called a confounder or confounding factor, is a third variable in a study examining a potential cause-and-effect relationship

Answer 47

Blocking make sure equal proportions of a confounding variable are in treatment and control group

Answer 48

Statistical significance is how certain we are that an effect happened. The effect size is how much difference that effect makes

Answer 49

Statistical significance is how certain we are that an effect happened. The effect size is how much difference that effect makes. You can get to effect size using Cohen's D

Answer 50

0 won't detect 1 will always detect

Answer 51

0 won't detect 1 will always detect as power increases type 2 effects decreases

Answer 52

plotting the quantiles of a variable against theoretical quantiles of a normal distribution will give a straight line if the variable is normally distributed

Answer 53

It tends to introduce bias - skewing results and reducing accuracy

Answer 54

scikit learn series

Interview Flashcards

(80 cards)