Linear regression-2 Flashcards
P-values, linear regression
What is the problem with R squared
R squared is sensitive to amount of data. When we see a pattern in small dataset, we don’t have confidence to say that the pattern is not due to random chance.
What is p-value?
p-value gives us a measure of confidence in the results from statistical analysis.
Set the intuition of p-value with example
Let us consider two drugs. Drug A and Drug B. p-value helps us determine if both these drugs are equally efficient or one is more efficient than other. If p-value allows us to establish a difference, then we worry about if A is good/worse than B
p-value interpretation when A and B are different.
We did one experiment giving drugs A and B to people. A cured 37% and B cured 31%. From the overall picture, A worked better than B. But how confidently can we say that A is better than B. This is where p-values help. p-values are numbers between 0 and 1 and they quantify how CONFIDENT WE SHOULD BE THAT DRUG a IS DIFFERENT THAN DRUG B.
What does p-value 0 denote
The closer a p-value is to 0, the more confidence we have that drugs A and B are different.
The closer a p-value is to 0, the more confidence we have that drugs A and B are different. - How close should a p-value should be to 0?
In practice, commonly used threshold is 0.05.
What does p-value threshold of 0.05 mean?
It means that if theres no difference between drugs A and B, and if we did the exact same experiment a bunch of times, then ONLY 5% OF THOSE EXPERIMENTS WOULD RESULT IN WRONG DECISION
How to calculate p-value
Using a statistical test like - Fischer’s test.
Explain the definition of p-value threshold when the value is 0.05 with example
Example:
1. Let’s give same drug, drug A to two groups of people. The first test gave p-value 0.09 - failed to see a difference between the two groups.
2. Repeat the same experiment and the p-value will be really high and we will fail to see the difference.
3. Next, once in a while there might be a chance that group of people allergic to the drug might end up in the same group. In this case drug A might fail to work on them and will get a small p-value - suggesting that there is a difference in the drug used by the two groups, although they are using the same drug. This is a FALSE POSITIVE.
A threshold of 0.05 means that 5% of the experiments, where the only differences come from weird random things will generate a p-value smaller than 0.05.
How to set threshold for p-value
For extremely important test, like the effectiveness of drug, we need high confidence when make a statement that the drugs are either different or not difference. In such cases we can use extremely small thresholds like 0.01 or 0.001 (1 in 1000 experiments can lead to False Positive)
Likewise, for not so important tests, we can use a bit higher threshold like - 0.2 (2 in 10 can lead to FP)
p = 0.24 denotes that drugs …..
We are not confident that drugs A and B are different.
p = 0.02 denotes that drugs ….
We are confident that the drugs A and B are different.
What is Hypothesis testing?
The idea of trying to determine if the drugs are same or not is called Hypothesis testing.
What is Null Hypothesis?
The Null Hypothesis is the drugs are same and the p-values helps us to decide if we should reject the Null Hypothesis.
What does p-value DOESN’T tell us
Although p-value helps us decide if drugs A and B are different, they don’t tell us HOW DIFFERENT they are.
What is the MOST IMPORTANT property of p-value that you should keep in mind while viewing the p-value results?
A small p-value DOESN’T IMPLY that the effect size or difference between drugs A and B is large.
A large p-value DOESN’T IMPLY that the effect size or difference between drugs A and B is small.
How does Linear regression fit a line, like final decision based on what?
Linear regression fits a line to the data that minimizes the Sum of Squared Residuals.
How to quantify the accuracy or quality of the predictions in Linear regression?
Once we fit a line to the data, we can calculate R squared which gives us a sense of how accurate our predictions will be.
And then linear regression provides p-value for R squared, so we should get a sense of how confident we should be on the predictions.
How to minimize the SSR in linear regression
Choose the y-axis intercept and slope such that it minimizes the SSR
What is analytical solution in general?
In Analytical solution we end up with a formula that we can plug the data into and the output is the optimal value.
What is the analytical solution for linear regression to find optimal y-intercept
- Keeping the slope constant, we see how SSR changes for different Y-intercept value => result in a curve in u shape.
- Find the lowest SSR/point on the curve.
- To find the lowest point in the curve is to calculate the derivative of the curve. When the derivative is 0, that is the bottom of the curve.
What is iterative approach in Linear regression
Gradient Descent - A way to find the optimal slope and y-axis intercept.
Which one is quicker - analytical or iterative approach
Iterative.
Where is iterative approach used?
Scenarios where there are no analytical solutions, including Logistic regression, Neural networks etc.
What does Rsq - 0.66 and p-value - 0.1 suggest
For p-value - 0.1 implies that there is a 10% chance that the random data(random data is used in deriving p-value) could give us Rsq >= 0.66. this is relative high p-value, so we might not have a lot of confidence in the predictions.