Week 1: Central Tendency, Varibality Measures, Z-Score, Linear Regression, Correlation, R Squared, Predicted values, Residual Flashcards
1. What type of variables are these? Country Income Temperature IQ PH Cancer types Hair color Socio-economic statuscolor Number of Pets 2. What central tendency and measures of variability can we calculate for each?
1. Country - Nominal (Mode only) Income - Ratio (median, mode, mean, IQR, Var, SD) Temperature - Interval (all of them) IQ - Interval (all) PH - Interval(all) Cancer types - Nominal (mode only) Hair colour - Nominal (mode) SES - oridnal (median, mode) nr of pets - ratio (median, mode)
What graphs do we use for numerical/qualitative variables?
Qualitative - bar chart
Numeric - histogram, boxplot
Central Tendency What is 1. Mean 2. Median 3. Mode
- Mean - average, typical (sum of var/nr of var)
- Median - the middle, order it ten pick middle
- Mode - the most frequent value
Measures of Variability What is: (formulas) 1. Variance 2 Standard Deviation 3. IQR
- Variance: how much the subject differ from each other
Population Var: sigma^2=x1-miu^2/population size
Sample Var: xi-mean value of all observ^2/n-1 - SD: measure the number of variations/ dispersion of a set of values
Formula same as Var(x) but all with root square - IQR: spread of data, also midpsread
1s quartile- 2nd quartile
What is a normal distribution?
mean=median=mode empirical rule: 68/95/99.7% 1/2/3 SD wel discribed by its SD unimodal symmetrical centered fixed score distirbution
The Standard Normal Distibution is…
a ND with mean: 0 and Variance: -1
Describe the +/- skewness
+ right skewed - mode>median>mean
- negaive skewed - mode>median>mean
What is the Z-Score?
- How far is an observation from the mean in terms of SDs
- The nr of SDs by which the value of raw score is above or below the mean value of what is being observed
- The standardized score
Z =(observed value-men)/SD - if we extend 1 SD above the mean and 1 SD below=> approx 68% of the observations are within the interval
- Approx 95% of the populations would be between 2 SD above the mean and 2 SD below for a ND
- Also, if x is normally distirbuted, then 1 is ND, with mean=0 and SD=1
What is the Correlation Coefficient?
Steps
Way of sum a scatter plot into an nr between -1 and 1
Steps
1. fits a straight line to the data
2. the cc rememebrs if the slope of the striaght line points downwards or upwards
if slope + => coeff (0-1)=> positive
if slope - => coeff (1-0)=> negative
is flope striaght => coeff is 0 => closer to 0 the weaker it is
3. looks at the quality of the fit of the straight line of the data
What is Pearson Correlation (Formula)
IN LINEAR REGRESSION ONLY !
- summ the strength and direction of a straight-line relationship
1. strength - the closebess of the points to a straight line
2. direction - if one var generally increases or decreases
rxy= (xi-x mean values)(yi- y mean values)/squar root (xi-x)(yi-y)^2
What is Linear Regression Analysis?
used to predict the value of a variable based on the value of another variable
describes the average relation between y-values and x-values
the points on the regression line are predicted by y-values and denoted by y hat
explores the relation btw a quantitative response var and oneor more explanatory
Regression Line is fully determined if:
- > the intercation with te y-axis is known–> intercept
- > it is known how steep the line is–> slope
Formulas for:
Regression Line
Regression Model
RL: Y hat=b0(intercept)-b1(slope) x Xi
RM: Yi= Y hat i+ ei=b0+b1 x Xi+ei (residual)
How R squared and rxy are related?
R^2=rxy
About simple linear regression
One explanatory variable→simple regression
Multiple explanatory variables→multiple regression
- describes the average relation between Y values and X values
–> used whe y is numeric or continuous, x var as well
limited because it is useful for summ associations only
Y Hat = estimated value
Y Line = predicted value for an individual