- measures the linear relationship between dependent and independent variable(s) - advantage: LINEARITY, allows us to interpret models - assumes normality - if assumption is violated, estimated confidence intervals of the feature weights are invalid - helps predict future values, trends, economic conditions, etc - understands how one variable changes when another change

Pro - utilizes all observations - rigidly defined - easy to understand and compute - can be used for further mathematical treatments Con - it's badly affected my extremely small or large values [outliers] - it can't be calculated for open end class intervals - not preferred for highly skewed distributions

Pro - rigidly define - easy to understand and compute - great for dealing with outliers aka extremely small or large values Con - if we have an even number of observations, we only get an estimate of the median by taking hte mean of the 2 middle vlaues. we don't get the exact value - doesn't utilize all of the observations, median won't be affected by the change in numbers - it is not amendable to algebraic treatments - affected by sampling fluctuations

- easiest average to understand and easy to calculate - not affected by extreme values - can calculate for open end classes - as far as modal class is confirmed the pre-modal class and the post modal class are of equal qidth - can be calculated even if other classes are of unequal width cons - not rigidly defined. distribution can have multiple modes - doesn't utilize all observations - not amenable ot alegraic treatment - greatly affected by sampling fluctuations

- measure of volatility of a security or portfolio compared to the market as a whole - used in capital asset pricing model(CAPM) which describes the relationship between systematic risk and expected return for assets(stocks) - for beta to be meaningful, the stock should be related to the benchmark that is used in the calculation - in statistics, beta represents the slope of the line through a regression of data points - in finance, each of these data points represents an individual's stock's returns against those of the market as a whole - calculation allows investors to understand whether a stock moves in the same direction as the rest of the market. - provides insights on how risky a stock is relative to the rest of the market - determines a security's short term risk and for analyzing volatility to arrive at equity costs when using the CAPM disadvantages: - less meaningful for looking to predict a stock's future movements since it's calculated using historical data points

Statistics/Finance Flashcards by Tina Nguyen

Central Limit Theorem

Why is it important?

allows you to assume normality is important because in statistics, normality assumption is important for parametric hypothesis tests of the mean, such as the t test. if the sample size is large enough, the CLT will produce a sampling distribution that is approximately normal.
precision of estimates: with a large sample size, sample mean is more likely to be close to real population mean.
important for trusting validity of results and assessing precision of estimates.
avg of sample means will itself be the population mean
SD of sample means equals the standard error of the population mean
as sample size increases, SD of smapling distribution becomes smaller. sampling distribution clusters more tightly around the mean

What is the CLT applied to?

all probability distributions where population must have finite variance. applies to independent, identically distributed variable [distribution must remain constant across all measurements]
DOES NOT apply to cauchy distribution bb it has infinite variance

How well did you know this?

Not at all

Perfectly

Distribution: Mean, Median, Mode

Median: we will use median over mean when the data shows outliers because the median will represent the center while mean will represent the skewed average and will be heavily affected by the outliers.

mode: most useful when your data is nominal scale. nominal scale is a categorical variable with unordered categories that are mutually exclusive
mean: when data is normally distributed

How well did you know this?

Not at all

Perfectly

Linear Regression

measures the linear relationship between dependent and independent variable(s)
advantage: LINEARITY, allows us to interpret models
assumes normality
if assumption is violated, estimated confidence intervals of the feature weights are invalid
helps predict future values, trends, economic conditions, etc
understands how one variable changes when another change

How well did you know this?

Not at all

Perfectly

Simple Linear Regression vs Multiple Linear Regression

SLR: relationship between 2 variables

MLR: 2 or more explanatory variables have a linear relationship with the dependent variable

assumes that there is a linear relationship between both the dependent and independent variables
assumes no major correlation between the independent variables = multicollinearity

Multicollinearity

Use variance inflation factors (VIF)
If the VIF is large that means it is highly correlated with at least one of the other predictors in the model

In order to deal with multicollinearity:
-remove violating predictors from the model.

How well did you know this?

Not at all

Perfectly

Gaussian Distribution

Importance

calculate probabilities for events
dependent on 2 parameters of data set: mean and standard deviation
mean, mode and median of distribution are equal
68.2% within first standard deviation
95.5% within 2nd SD
99.7 within 3 SD

How well did you know this?

Not at all

Perfectly

Mean (pro/con]

Pro

utilizes all observations
rigidly defined
easy to understand and compute
can be used for further mathematical treatments

Con

it’s badly affected my extremely small or large values [outliers]
it can’t be calculated for open end class intervals
not preferred for highly skewed distributions

How well did you know this?

Not at all

Perfectly

Median [pro/con]

Pro

rigidly define
easy to understand and compute
great for dealing with outliers aka extremely small or large values

Con

if we have an even number of observations, we only get an estimate of the median by taking hte mean of the 2 middle vlaues. we don’t get the exact value
doesn’t utilize all of the observations, median won’t be affected by the change in numbers
it is not amendable to algebraic treatments
affected by sampling fluctuations

How well did you know this?

Not at all

Perfectly

Mode

easiest average to understand and easy to calculate
not affected by extreme values
can calculate for open end classes
as far as modal class is confirmed the pre-modal class and the post modal class are of equal qidth
can be calculated even if other classes are of unequal width

cons

not rigidly defined. distribution can have multiple modes
doesn’t utilize all observations
not amenable ot alegraic treatment
greatly affected by sampling fluctuations

How well did you know this?

Not at all

Perfectly

coefficient of variation

statistical measure of the dispersion of data points in a data series around the mean. coefficient of variation represents the ratio of the standard deviation to the mean, and it is a useful statistic for companring the degree of variation from one data series to another, even if the means are drastically different from one another

How well did you know this?

Not at all

Perfectly

Let’s say that your company is running a standard control and variant AB test on a feature to increase conversion rates on the landing page. The PM checks the results and finds a .04 p-value.

How would you assess the validity of the result?

It is always important to clarify assumptions about the question upfront. In this particular question, clarifying the context of how the AB test was set up and measured will specifically draw out the solutions that the interviewer wants to hear.

If we have an AB test to analyze, there are two main ways in which we can look for invalidity. We could likely re-phrase the question to: How do you set up and measure an AB test correctly?

Let’s start out by answering the first part of figuring out the validity of the set up of the AB test.

How were the user groups separated?

Can we determine that the control and variant groups were sampled accordingly to the test conditions? If we’re testing changes to a landing page to increase conversion, can we compare the two different users in the groups to see different metrics in which the distributions should look the same?

For example, if the groups were randomly bucketed, does the distribution of traffic from different attribution channels still look similar or is the variant A traffic channel coming primarily from Facebook ads and the variant B from email? If testing group B has more traffic coming from email then that could be a biased test.

Were the variants equal in all other aspects?

The outside world often has a much larger effect on metrics than product changes do. Users can behave very differently depending on the day of week, the time of year, the weather (especially in the case of a travel company like Airbnb), or whether they learned about the website through an online ad or found the site organically.

If the variants A’s landing page has a picture of the Eifel Tower and the submit button on the top of the page, and variant B’s landing page has a large picture of an ugly man and the submit button on the bottom of the page, then we could get conflicting results based on the change to multiple features.

Measurement

Looking at the actual measurement of the p-value, we understand that industry standard is .05, which means that 19 out of 20 times that we perform that test, we’re going to be correct that there is a difference between the populations. However, we have to note a couple of things about the test in the measurement process.

What was the sample size of the test?
Additionally, how long did it take before the product manager measured the p-value?
Lastly, how did the product manager measure the p-value and did they do so by continually monitoring the test?
If the product manager ran a T-test with a small sample size, they could very well easily get a p-value under 0.05. Many times, the source of confusion in AB testing is how much time you need to make a conclusion about the results of an experiment.

The problem with using the p-value as a stopping criterion is that the statistical test that gives you a p-value assumes that you designed the experiment with a sample and effect size in mind. If we continuously monitor the development of a test and the resulting p-value, we are very likely to see an effect, even if there is none. The opposite error is also common when you stop an experiment too early, before an effect becomes visible.

The number one most important reason is that we are performing a statistical test every time you compute a p-value and the more you do it, the more likely you are to find an effect.

How long should we recommend an experiment to run for then? To prevent a false negative (a Type II error), the best practice is to determine the minimum effect size that we care about and compute, based on the sample size (the number of new samples that come every day) and the certainty you want, how long to run the experiment for, before starting the experiment.

How well did you know this?

Not at all

Perfectly

What does coefficient of determination tell you? When would you use it?

r^2 is coefficient of determination
result of dividing regression sum of squares by the total sum of squares
explains how many y values are explained by x variables
ranges from 0 to 1
used to explain how much variability of one factor can be caused by its relationship to another factor
correlation known as “goodness of fit”

in investing, r^2 is interpreted as percentage of fund or security’s movements that can be explained by movements in a benchmark index.
e.x an r^2 for a fixed income security vs a bond index indentifies the security’s proporiton of price movement that is predictable based on a price movement of the index

-measures how close the data are to the fitted regression line
= explained variation / total variation

-r-squared is high when there error terms are low

How well did you know this?

Not at all

Perfectly

what is the difference between r-squared and adjusted r-squared

-r-squared is best for a simple linear regression model with one explanatory variable

adjusted r-squared compensates for the addition of variables for a multiple linear regression
it is a better fit and only increases if the new term enhances the model above what would be obtained by probability and decreases when a predictor enhances the model less than what is predicted by chance

-in an overfitting condition an incorrectly high value of r-squared is obtained

How well did you know this?

Not at all

Perfectly

difference between r-squared and beta

beta: measure of relative risk
- if beta is high, it may produce higher returns than the benchmark
- measures how large those price changes are in relation to a benchmark

r-squared measures how closely each change in the price of an asset is correlated to a benchmark
used together, r-squared and beta give investors a thorough picture of the performance of asset managers

How well did you know this?

Not at all

Perfectly

limitations of r-squared

wont tell you whether your chosen model is good or bad

- wont tell you whether the data and predictions are biased

How well did you know this?

Not at all

Perfectly

what is beta?

measure of volatility of a security or portfolio compared to the market as a whole
used in capital asset pricing model(CAPM) which describes the relationship between systematic risk and expected return for assets(stocks)
for beta to be meaningful, the stock should be related to the benchmark that is used in the calculation
in statistics, beta represents the slope of the line through a regression of data points
in finance, each of these data points represents an individual’s stock’s returns against those of the market as a whole
calculation allows investors to understand whether a stock moves in the same direction as the rest of the market.
provides insights on how risky a stock is relative to the rest of the market
determines a security’s short term risk and for analyzing volatility to arrive at equity costs when using the CAPM

disadvantages:
- less meaningful for looking to predict a stock’s future movements since it’s calculated using historical data points

How well did you know this?

Not at all

Perfectly

What are the 4 principal assumptions which justify the use of linear regression models for purposes of inference or prediction?

Linearity and additivity of the relationship between dependent and independent variables:

a) expected value of dependent variable is a straight line function of each independent variable, holding the others fixed
b) the slope of the line does not depend on the values of the other variables
c) the effects of different independent variables on the expected value of the dependent variable are additive

2) statistical independence of the errors (in particular, no correlation between consecutive errors in the case of time series data)
3) homoscedasticity (constant variance) of the errors

a) versus time (in the case of time series data)
b) versus the predictions
c) versus any independent variable

4) normality of the error distribution
- if any of these assumptions are violated, then forecasts, confidence intervals and scientific insights yielded by a regression model may be (at best) inefficient or (at worst) seriously biased or misleading

What happens if regression has a violation of linearity or additivty

if you fit a linear model to data which are non linear or non additively related, your predictions are likely to be seriously in error, especially when you extrapolate beyond the range of teh sample data

What happens if regression has violations of independence

very serious in time series regression models: serial correlation in the errors (ex correlation between consecutive errors or errors separated by some other number of periods)

shows that there is room for improvement in the model
extreme serial correlation is often a symptom of a badly mis specified model
serial correlation is sometimes a byproduct of a violation of linearity assumption, as in the case of a simple (straight) trend line fitted to data which are growing exponentially over time.

-independence can also be violated in non time series models if errors tend to always have the same sign under particular conditions

What happens if there are violations of homoscedasticity (heteroscedasticty)

difficult to to gauge the true standard deviation of the forecast errors, usually resulting in confidence intervals that are too wide or too narrow.
if variance of errors is increasing over time, confidence intervals for out of sample predictions will tend to be unrealistically narrow
may have effect of giving too much weight to a small subset of the data when estimating coefficients
Like the other assumptions, if there is a violation there is a chance that you cannot trust the statistical results
It does not cause bias in the coefficient estimates, however, it makes them less precise
lower precision increases the likelihood that the coefficient estimates are futher from the correct population value
tends to produce pvalues that are smaller than they should be
heteroscedasticity increases the variance of the coefficient estimates but the OLS procedure does not detect this increase calculating t values and fvalues using an underestimated amount of variance… leading to conduct that the model term is statistically significant when it is actually not

Good example of heteroscedasticity:
- examining relationship between household consumption with income. because lower income households focus on purchasing necessities while higher income households have broader spending habits

violations of normality

-create problems for determining whether model coefficients are significantly different from zero and for calculating confidence intervals for forecasts.

How does linear regression measure error? Does it make sense to use absolute error vs squared error?

Mean Standard Error (MSE) is often used to measure error.

Depends upon the distribution of the target error, when trying to decide between absolute error vs squared error

absolute: when target is fairly normal in distribution with a relatively high peak
squared: when target has a fat tail but not high number of noise data, thus penalizing data points higher which are at extremes

What are the major assumptions of linear regression? Why do they matter?

Variables are normally distributed
Independent variable should have linear and additive relationship with target.
There should be no linear relationship among the independent variables
Target should not be auto correlated
Error term has constant variance (Heteroscedasticty should not be in data)
- Variance of the target / residual should not change with change of the interval of independent variable
- we can check by looking at a scatterplot of the residuals vs fitted values

These assumptions are important because they affect:

coefficient stability
model variance
predictive power

Which assumptions of regression can be violated without causing major problems? Why?

Variables should be normally distributed. This assumption can be violated as we might have some workaround it. Some transactions are able to made in order to have normality

Can you evaluate variable importance from the coefficients of the model alone? Why or why not?

No, you cannot use the coefficients in order to determine how important the variable us UNLESS the data was scaled before modeling

–coefficient only tells you about change in dependent variable with each unit change in variable

in order to measure importance, you need to employ forward selection and monitor change in r-square
every variable has a lienar additive relationship with the target
need to use domain knowledge to understand what makes more sense in terms of implementation

If you have lots of correlated predictors, what are the potential implications for the model?

MULTICOLLINEARITY - in order to prevent this we will need to examine the variation inflation factors (VIF) and remove any predictors - multicollinearity affects coefficient stability.

what is the purpose of a regression?

generate relationship between 2 variables and access the strength between the 2

How to identify heteroscedasticity?

- plot residuals by fitted values. if there is a fan or cone shape in the residual plots then we know then thats the indication for heteroscedasticity

What causes heteroscedasticity?

- most common: error variance changes proportionally with a factor. the factor may be a variable in the model - occurs most often in datasets that have a large range between the largest and smallest observed values

What happens to heteroscedasticity in time series models?

Time series model: - happens when dependent variable changes significantly from the beginning to the end of teh series - measurement error changes over time, heteroscedasticity can be present because regression analysis includes measurement error in the error term

How do you fix heteroscedasticity

Three common approaches for pure heteroscedasticity: [recap: Pure heteroscedasticity refers to cases where you specify the correct model and yet you observe non-constant variance in the residual plots.] 1. Redefine the variables. for example we can change a model from using raw measure to using rates and per capita values.. 2. Weighted regression: assign each data point a weight based on the variance of its fitted value. - give small weights to observations associated with higher variance to shrink their squared residuals - important to assess the standardized residuals and examine the residual plot 3. Transform the dependent variable - last resort because it involves the most manipulation - can use a Box-Cox transformation on dependent variable

Describe pvalue in layman term

pvalue is used to determine if something is statistically significant and is the probability of an observed result occurring. For example: If we want to perform an experiment to see if the food that you order from doordash will be delivered in 45 minutes. We would set our null hypothesis which is the idea scenario that Doordash set the correct time as the food delivery time is equal to 45 minutes. Alternate hypothesis would be that the food delivery time of Doordash is greater than 45 minutes. There are many factors that can influence the delivery time such as traffic, rain, etc. -WE will run the experiment on a sample of 10,000 customers. -After running the experiment we generate the p-value from the test that we perform -the pvlaue that we will get will give us the probability that the food delivery time is equal to 45 minutes. - if we get a pvlaue of of .03, we will reject the null hypothesis since 3% of thes amples had delivery times equal to 45 minutes which is far below the cut off you have set which is 5% level of significance