Statistics/Finance Flashcards
Central Limit Theorem
Why is it important?
- allows you to assume normality is important because in statistics, normality assumption is important for parametric hypothesis tests of the mean, such as the t test. if the sample size is large enough, the CLT will produce a sampling distribution that is approximately normal.
- precision of estimates: with a large sample size, sample mean is more likely to be close to real population mean.
- important for trusting validity of results and assessing precision of estimates.
- avg of sample means will itself be the population mean
- SD of sample means equals the standard error of the population mean
- as sample size increases, SD of smapling distribution becomes smaller. sampling distribution clusters more tightly around the mean
What is the CLT applied to?
- all probability distributions where population must have finite variance. applies to independent, identically distributed variable [distribution must remain constant across all measurements]
- DOES NOT apply to cauchy distribution bb it has infinite variance
Distribution: Mean, Median, Mode
Median: we will use median over mean when the data shows outliers because the median will represent the center while mean will represent the skewed average and will be heavily affected by the outliers.
mode: most useful when your data is nominal scale. nominal scale is a categorical variable with unordered categories that are mutually exclusive
mean: when data is normally distributed
Linear Regression
- measures the linear relationship between dependent and independent variable(s)
- advantage: LINEARITY, allows us to interpret models
- assumes normality
- if assumption is violated, estimated confidence intervals of the feature weights are invalid
- helps predict future values, trends, economic conditions, etc
- understands how one variable changes when another change
Simple Linear Regression vs Multiple Linear Regression
SLR: relationship between 2 variables
MLR: 2 or more explanatory variables have a linear relationship with the dependent variable
- assumes that there is a linear relationship between both the dependent and independent variables
- assumes no major correlation between the independent variables = multicollinearity
Multicollinearity
- Use variance inflation factors (VIF)
- If the VIF is large that means it is highly correlated with at least one of the other predictors in the model
In order to deal with multicollinearity:
-remove violating predictors from the model.
Gaussian Distribution
Importance
- calculate probabilities for events
- dependent on 2 parameters of data set: mean and standard deviation
- mean, mode and median of distribution are equal
- 68.2% within first standard deviation
- 95.5% within 2nd SD
- 99.7 within 3 SD
Mean (pro/con]
Pro
- utilizes all observations
- rigidly defined
- easy to understand and compute
- can be used for further mathematical treatments
Con
- it’s badly affected my extremely small or large values [outliers]
- it can’t be calculated for open end class intervals
- not preferred for highly skewed distributions
Median [pro/con]
Pro
- rigidly define
- easy to understand and compute
- great for dealing with outliers aka extremely small or large values
Con
- if we have an even number of observations, we only get an estimate of the median by taking hte mean of the 2 middle vlaues. we don’t get the exact value
- doesn’t utilize all of the observations, median won’t be affected by the change in numbers
- it is not amendable to algebraic treatments
- affected by sampling fluctuations
Mode
- easiest average to understand and easy to calculate
- not affected by extreme values
- can calculate for open end classes
- as far as modal class is confirmed the pre-modal class and the post modal class are of equal qidth
- can be calculated even if other classes are of unequal width
cons
- not rigidly defined. distribution can have multiple modes
- doesn’t utilize all observations
- not amenable ot alegraic treatment
- greatly affected by sampling fluctuations
coefficient of variation
statistical measure of the dispersion of data points in a data series around the mean. coefficient of variation represents the ratio of the standard deviation to the mean, and it is a useful statistic for companring the degree of variation from one data series to another, even if the means are drastically different from one another
Let’s say that your company is running a standard control and variant AB test on a feature to increase conversion rates on the landing page. The PM checks the results and finds a .04 p-value.
How would you assess the validity of the result?
It is always important to clarify assumptions about the question upfront. In this particular question, clarifying the context of how the AB test was set up and measured will specifically draw out the solutions that the interviewer wants to hear.
If we have an AB test to analyze, there are two main ways in which we can look for invalidity. We could likely re-phrase the question to: How do you set up and measure an AB test correctly?
Let’s start out by answering the first part of figuring out the validity of the set up of the AB test.
- How were the user groups separated?
Can we determine that the control and variant groups were sampled accordingly to the test conditions? If we’re testing changes to a landing page to increase conversion, can we compare the two different users in the groups to see different metrics in which the distributions should look the same?
For example, if the groups were randomly bucketed, does the distribution of traffic from different attribution channels still look similar or is the variant A traffic channel coming primarily from Facebook ads and the variant B from email? If testing group B has more traffic coming from email then that could be a biased test.
- Were the variants equal in all other aspects?
The outside world often has a much larger effect on metrics than product changes do. Users can behave very differently depending on the day of week, the time of year, the weather (especially in the case of a travel company like Airbnb), or whether they learned about the website through an online ad or found the site organically.
If the variants A’s landing page has a picture of the Eifel Tower and the submit button on the top of the page, and variant B’s landing page has a large picture of an ugly man and the submit button on the bottom of the page, then we could get conflicting results based on the change to multiple features.
Measurement
Looking at the actual measurement of the p-value, we understand that industry standard is .05, which means that 19 out of 20 times that we perform that test, we’re going to be correct that there is a difference between the populations. However, we have to note a couple of things about the test in the measurement process.
What was the sample size of the test?
Additionally, how long did it take before the product manager measured the p-value?
Lastly, how did the product manager measure the p-value and did they do so by continually monitoring the test?
If the product manager ran a T-test with a small sample size, they could very well easily get a p-value under 0.05. Many times, the source of confusion in AB testing is how much time you need to make a conclusion about the results of an experiment.
The problem with using the p-value as a stopping criterion is that the statistical test that gives you a p-value assumes that you designed the experiment with a sample and effect size in mind. If we continuously monitor the development of a test and the resulting p-value, we are very likely to see an effect, even if there is none. The opposite error is also common when you stop an experiment too early, before an effect becomes visible.
The number one most important reason is that we are performing a statistical test every time you compute a p-value and the more you do it, the more likely you are to find an effect.
How long should we recommend an experiment to run for then? To prevent a false negative (a Type II error), the best practice is to determine the minimum effect size that we care about and compute, based on the sample size (the number of new samples that come every day) and the certainty you want, how long to run the experiment for, before starting the experiment.
What does coefficient of determination tell you? When would you use it?
- r^2 is coefficient of determination
- result of dividing regression sum of squares by the total sum of squares
- explains how many y values are explained by x variables
- ranges from 0 to 1
- used to explain how much variability of one factor can be caused by its relationship to another factor
- correlation known as “goodness of fit”
in investing, r^2 is interpreted as percentage of fund or security’s movements that can be explained by movements in a benchmark index.
e.x an r^2 for a fixed income security vs a bond index indentifies the security’s proporiton of price movement that is predictable based on a price movement of the index
-measures how close the data are to the fitted regression line
= explained variation / total variation
-r-squared is high when there error terms are low
what is the difference between r-squared and adjusted r-squared
-r-squared is best for a simple linear regression model with one explanatory variable
- adjusted r-squared compensates for the addition of variables for a multiple linear regression
- it is a better fit and only increases if the new term enhances the model above what would be obtained by probability and decreases when a predictor enhances the model less than what is predicted by chance
-in an overfitting condition an incorrectly high value of r-squared is obtained
difference between r-squared and beta
beta: measure of relative risk
- if beta is high, it may produce higher returns than the benchmark
- measures how large those price changes are in relation to a benchmark
- r-squared measures how closely each change in the price of an asset is correlated to a benchmark
- used together, r-squared and beta give investors a thorough picture of the performance of asset managers
limitations of r-squared
- wont tell you whether your chosen model is good or bad
- wont tell you whether the data and predictions are biased
what is beta?
- measure of volatility of a security or portfolio compared to the market as a whole
- used in capital asset pricing model(CAPM) which describes the relationship between systematic risk and expected return for assets(stocks)
- for beta to be meaningful, the stock should be related to the benchmark that is used in the calculation
- in statistics, beta represents the slope of the line through a regression of data points
- in finance, each of these data points represents an individual’s stock’s returns against those of the market as a whole
- calculation allows investors to understand whether a stock moves in the same direction as the rest of the market.
- provides insights on how risky a stock is relative to the rest of the market
- determines a security’s short term risk and for analyzing volatility to arrive at equity costs when using the CAPM
disadvantages:
- less meaningful for looking to predict a stock’s future movements since it’s calculated using historical data points