midterm review Flashcards
Define Simple Linear Regression
A dependent variable (ex. Y) is predicted from one independent variable (ex. X) based on a linear relationship
Define Regression
The relation/dependency of & between 2 variables
SLR equation
y = a + βx
Define Residuals
The differences between the real data & the line
Goal with SSR
To find THE line that minimizes the SSR
Define Stock “Beta”
Beta is a risk measure of stock investment, calculated as the coefficient of the market return
When |β| > 1…
Stock is riskier & its returns have greater volatility (change unpredictable)
When |β| < 1…
Stock is less risky & its returns swing less than market returns
Monthly return formula
Monthly return = (Current month-end price - Last month-end price)/Last month-end price
Basic set-up for lm() function
regression_analysis_result_name <- lm(Y ~ X, data)
How to calculate SD manually
- (y-µ)^2
- square all results from step 1 + divide by count
How to manually calculate Q1 & Q3
- split dataset into 2 halves
- find the median of each half
How to find IQR
Q3-Q1
How to calculate Lower & Upper Whisker
LW = Q1 - 1.5IQR
UW = Q3 + 1.5IQR
How to calculate Extreme Lower & Upper Whisker
eLW = Q1 - 3IQR
eUW = Q3 + 3IQR
Difference in using summarize() & mutate()
must use <- when mutating to save the new variable into a dataset
When do you $?
When you are referring to a specific dataset for a variable
Define Probability Density Curve
Density curve visualizes the probability distribution → how probabilities are distributed over the values of a random variable
Advantages of Probability Density Curve
- A more refined representation of data
- Facilitate the probability calculation (even if data is absent)
Describe Skewed distribution
Mean > Median → Right-skewed
Mean < Median → Left-skewed
Density Curve Properties
- A density curve must lie on or above the horizontal axis
- The area under the density curve always equal to 1 or 100%
> Cannot be on y-axis or be below x-axis
Relationship between Probability Density & Probability
Probability Density ≠ Probability
- For a continuous variable, discussing its probability of being a specific value is not meaningful because it always equals to zero
Meaning of PD & Probability
Probability = area
Likelihood = height = density = straight line
Representation of Normal Distribution
If a random variable follows a normal distribution, it is presented as:
X ~ N (µ, σ)
Mean → center of the curve
Stdev → wideness of the curve
How to find z-score in standard normal table
- Find the row that matches the (signed) first 2 digits of the z-score
- Find the column that matches the (signed) third digit of the z-score
- Find the probability value in the cell where the row & column meet
Define Standardization
Transforming a general normal distribution (ex. N ~ X (µ, σ)) to a standard normal distribution (ex. Z ~ N (0, 1))
If a dataset follows a (general) standard normal distribution, then
68% of the data lies within one standard deviation of the mean
95% of the data lies within two standard deviations of the mean
99.7% of the data lies within three standard deviations of the mean
FInd Pr (-1 < Z < 1) using R code
pnorm(1, mean=0, sd=1) - pnorm(-1, mean=0, sd=1)
Checks for Normality
- Histogram & Density Curve → bell-shaped & symmetric around the mean
- Empirical Rule intervals → 68%, 95%, 99.7%
- IQR-to-SD ratio = 1.3
- Quantile-Quantile (Q-Q) Normality Plot
2 questions related to Population vs Sample
- Can we make accurate inferences about the population based on a sample of data? (Today’s class)
- How confident can we be in these inferences? (After the reading week)
Define Parameter
A numerical value that describes a specific characteristic of an entire population
Define Sample Statistic
A sample of data
What is the meaning behind Statistical Inferences
Want to make informed inferences about the unknown population parameter based on the sample statistics
Define normal distribution
A perfectly normal distribution would appear as a symmetric, bell-shaped curve centered around the mean but not smooth
What is the SD of the sample means
Standard error
- ~ 10 times smaller than the population SD
Define Central Limit Theorem
The variance between sample mean & actual mean decreases the more samples that are generated
- As the # of samples increases, the distribution starts to indicate a normal distribution with a smooth bell-shaped curve, more closely adhering to the normal distribution’s characteristics