Data Analysis week 6 Flashcards
What does the covariance measure
The covariance of (x,y) measures the (strength and direction of the) relation of x and y, the spread of x and the spread of y.
What does the formula of the covariance look like
The variance is the covariance of a variable with itself. Therefore the formula for the covariance is the formula of the variance with x and y instead of x^2.
What is the correlation
The correlation is the covariance normalized, such that it becomes scale independent and measures only the relation between x and y.
What does the covariance show
It shows the tendency of the variables to change together and about the spread.
What does a positive covariance mean
If the covariance is positive, x tends to be high when y tends to be high and visa versa.
What does a negative covariance mean
If the covariance is negative, x tends to be high when y tends to be low and visa versa.
What does a positive correlation mean
If the correlation is positive, x tends to be high when y tends to be high and visa versa.
What does a negative correlation mean
If the correlation is negative, x tends to be high when y tends to be low and visa versa.
What is the range of the correlation and what does the value tell us about x and y
The correlation is always a number between -1 and 1. If the correlation is equal to -1 or 1, x and y are on a straight line. If the correlation is 0, x and y are absolutely not on a straight line.
What kind of relations does the correlation measure
Only linear relations.
How can you check how sure you are of you estimated correlation
Use bootstrapping on the correlation (this can be done, because correlation is descriptive statistic).
What is the null hypothesis in hypothesis testing for the correlation
That there is no relation between the variables.
How do you make the null hypothesis true in hypothesis testing for the correlation
You break the relation between the variables by random shuffling on of the variables. You do this by drawing without replacement (different to bootstrapping, where we draw samples with replacement). This is called permutation testing.
What is a prediction model
A model that describes the relation between variables in such a way that other values of the variables can be predicted.
What is the equation of a regression model
y = alpha + beta * x
How can you compute alpha and beta for the equation of a regression model and what are these called.
By using formulas that use the covariance. These alpha and beta are called the regression coefficients. These form the best fitting line for the dataset.
What are alpha and beta
alpha is the intercept and beta (the slope) is the point where the regression line intersects the y-axis, So the y-value of the smallest x-value in the dataset.
What does the regression line show and what is a condition for a good regression line
The regression line shows you what the values of the variables should be. Prediction models will make mistakes for predicting values, but it should make the same kind of mistakes for different values.
How can you check how sure you are of your computed regression coefficients
By bootstrapping, and/or by displaying a bunch of regression lines computed on resamples.
What is a residual and what do residuals measure
A residual is the difference between the actual value of an observation and its predicted value. Residuals measure the prediction error
What does a residual model show
It shows if the prediction model wits the data well.
When does a regression model fit the data well and how do you check this
If the residuals have about zero mean everywhere and have about the same spread everywhere. You check this by plotting the residuals and adding a smooth line.
What is the coefficient of determination
R^2. It checks how much the model explains. Is always a value between 0 and 1. 0 meaning we’re not explaining anything and 1 meaning we’re perfectly explaining the data.
What do prediction intervals show
They include uncertainty of the prediction itself.