Statistics in R: Class test 1 Flashcards
Describe the two different types of categorical data
Nominal data - Data items placed into discrete classes. There is no inherent order between categories (Gender, Blood Group)
Ordinal data - numbers (or letters) acting as labels only, data are
placed in an order based on their value. Various categories can be logically arranged in a meaningful order (Ranks, Grades)
Describe the difference between interval and ratio data
For interval data doubling is not meaningful and there is no true zero (Temperature). For ratio data these two factors are meaningful, such as distance
Describe the difference between continuous and discrete data
Continuous data has an infinite number of values such as weight where a value of 1.5kg is meaningful.
Discrete data is has a set number of values, for example 1.5 fingers is not meaningful
What indicates that a dataset is normally distributed?
68% of data is + or - 1 standard deviation away from the centre mean
Define the mean, median and mode
Mean - average of all values
Mode - the value that occurs most frequently
Median - the middle value
What are the null hypothesis and the alternative hypothesis?
Null hypothesis - there is no significant difference, correlation or relationship
Alternative hypothesis - there is a significant relationship correlation or difference
What is a Type I and Type II error?
A Type I error is a false rejection of the null hypothesis
A Type II error is the non-rejection of a false null hypothesis
What is the p-value?
The p-value is a measure of whether the results of your test could have occurred by chance. If the p-value is <=0.05 then the results are significant and the null hypothesis can be rejected
What does Pearson’s correlation coefficient(r) tell us?
The strength and direction of a linear, normally distributed, numerical relationship between two variables between -1 and +1 with values close to +1 and -1 indicating correlation and close to 0 indicating no relationship
Why is linear regression used?
To determine the line of best fit in a dataset that is linear, normally distributed and numerical and make predictions based of this
What are residuals?
Residuals are the difference between observed values and the predicted values from the line of best fit
What does the Coefficient of Determination(r^2) tell us and how is it found?
The Coefficient of Determination tells us how much of the variance in the response variable is caused by the explanatory variable and is found by squaring Pearson’s(r)
What does the Adjusted r^2 value tell us?
Adjusted r^2 tells us the amount of variance explained by the model but adjusts for the number of terms in the model
What is the formula for multiple linear regressions analysis?
y = m1ev1 + m2ev2 + … + c
How would you visualise the relationships between the RV and EVs in a multiple linear regression analysis?
Using a pairs plot, you are able to visualise all single linear regression models between the variables