Interview Questions Flashcards
What are the steps of a Data Science project?
Define, refine and measure the problem statement Gather and understand the data Prepare the data Build the models Evaluate the models Deploy into production Track performance
How do you multiply matrices?
Integer x Matrix = Each element is multiplied by the integer
Matrix x Matrix = Has to be of the shape R1 = C2, Dot product by summing the product of row x column.
What is the normal distribution?
Symmetric bell shaped curve where c. 68% of the data falls within 1 s.d. & 95% within 2 s.d.
What is the log-normal distribution?
Same as normal distribution except taking the log of the numbers.
Four methods to check for normality?
Visually (histogram), Skewness (0), Kurtosis (3) & Q-Q plot
Four common distributions in the Exponential family and why they are Exponential?
Gaussian, Poisson, Binomial & Gamma.
All can be expressed in the form:
exp( n(beta) * T(x) - A(beta) + B(x) )
What is the Dirichlet distribution?
A multivariate version of the beta distribution.
What is a Gaussian distribution?
Bell shaped around the mean, equal on both sides.
What is a Poisson distribution?
Part of the exponential family. A distribution of an event over time. Used to calculate the probability of seeing k events in X time.
I.e. if a bus arrives on average once every 5 minutes, if you were interested in the probability of seeing 6 buses in an hour then lambda = 60/5 = 12.
e^-u * u^x * x!
where u is your average and x is the value you’re predicting.
What is the Binomial distribution?
The distribution of the probability of having a set number of successes events given n repetitions.
I.e. rolling a 6 has 1/6 probability. If I rolled a dice 60 times, the highest probability (peak of the curve) is 10 occurrences, with reducing probability of having 9 or 11 occurences.
What is a gamma distribution?
A two parameter distribution with shape a, and scale beta. Will always return a positive integer value and is often used as a prior for another distribution, such as the lambda in a Poisson distribution.
I.e. if we want to calculate the probability of k buses arriving at a bus station in an hour, but we’re unsure what lambda (mean) is - we can assume a gamma prior for lambda with shape a and scale beta.
What are summary statistics?
Statistics that can be used to capture meaningful aspects of the data, this may include:
Mean, Medium, Mode, Quartiles, Kurtosis, Skewness
What is Skewness and how is it calculated?
The symmetry of the data around the mean, calculated with the 3rd moment of Sum of (Xi - Xu)^3 / (N-1) sigma^3. 0 is no skewness.
What is Kurtosis and how is it calculated?
The shape of the tails compared to the normal distribution. Calculated as the 4th moment of X (Xi-Xu)^4 / sigma^4. 3 indicates no Kurtosis.
How to calculate sample size?
Either conduct small experiment or make valid assumptions. Then use Cochran’s formula to estimate a sample size (Desired Z value^2 * assumed value * (1-assumed value) / Desired precision^2.
What is Power Analysis?
Calculating the probability of a hypothesis test finding an effect, if it exists.
What is a confounding variable?
A variable which impacts both the dependent and other independent variables.
How do you calculate a confidence interval?
Point estimate +- The desired critical value (Z value or t-table) * standard deviation of data/standard error.
What is a Type I and Type II error?
Type I (False Positive) - Identifying hypothesis as true when it is false. Type II (False Negative) - Identifying hypothesis as false when it is true.
Difference between Z-test, T-test, F-test & ANOVA test?
Z-test and T-test are used to test point estimate hypothesis, estimate means or compare means. Z-test requires the population s.d to be known, whilst T-tests do not.
F-test is used to test whether there is variability in two samples. ANOVA (Analysis of Variance) is a type of F-test, focusing on systematic variance & error-variance.
How do you conduct A|B testing?
Choose a single change to make.
Choose a sample size (low if risky)
Randomly assign users to experience change.
Gather data.
T-test the variation between the two groups with 95-99% confidence.
What is a permutation test?
Like bootstrapping, tests that a sampled population shares the same qualities of the overall population.