Stats 2 - Stats & Linear Models Flashcards

1
Q

What are the componenets of a Linear Model?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the types of variables you can have in a Linear Model?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How many Co-efficients do Continious terms and categorical factors have in a Linear Model?

A

Continuous terms always have one coefficient (β2)

Categorical Factors have N − 1 coefficients, where N is the number of levels in the category

Why N-1? Why are we missing a level?

The missing variable is incorporate into the baseline/refrence value known as the Intercept (β1) –> the level chosen as the refrence is done alphabetically

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What makes a Linear model Linear?

A

Linear models are just a sum of terms that are linear in the coefficients –> Coefficients of each term are simple

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Examples of types of Linear models?

A

The response variable is always Continuous

  1. Simple Regression –> Continuous Explanatory vairiables
  2. Multiple Linear Regression –> Continuous/Categorical explanatory variables
  3. ANOVA –> Categorical Explanatory Variable
  4. ANCOVA –> Categorical/Continuous explanatory variables

Note –> MLR and ANCOVA are very similar just place emphasis on one type of variable or the other

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How can we decide what the best fit for our Linear Model is?

A

Solution Least Squares Fitting Solution

Given the data collected, we need to figure out a Linear model that best represents the data. This is done by…

  • minimizing the sum of the distance (vertical distance between the data points and the line.

This is called the ‘Least squared solution’ which is where we minimize sum squared of ‘R’ –> we are squaring because we do not want negative values.

Plots below shows you the changes in resvar (Sum of squared R) when changing the coefficient –> The minimum corresponds to the solution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How can we denote the Lienar model with least sqaures fitting solution?

A

Y with a hat

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the assumptions of a Linear Model?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Outline the role of each of the diagnostic plots

A

Code

par(mfrow = c(2, 2), mar = c(5, 5, 1.5, 1.5))

plot(Model Name)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the two main tests performed in order to check whether your model realy explains the data? How do you know if we can rely on it?

A

F-Test –> Tests how much variation is explained

T-Test –> Tests the significane of the estimated Co-efficients

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Outline the mean of the terms TSS, ESS and RSS?

A

TSS - Differences between the observed dependent variable (y) and the mean of y –> tells you the spread around the mean

ESS – differences between predicted y values and the mean of Y –> Looks at the spread around the mean when using the linear model

We expect and hope that it is smaller than TSS –> otherwise our model is shit

RSS –> sum of the squared differences between observed y and predicted y (model) –> basically the sum of the R2

RSS outlines how much variation our model cannot explain/take into account

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the relationship between ESS, RSS and TSS?

A

TSS = ESS + RSS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How is the F-Statistic calculated?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Now that you have calculated the F-Statistic, how do you know whether it is significant?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How is the T-Statistic/Value Calculated?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Figure out the combinations of Effect size and precision for each plot.

A

Effect size –> is there a correlation between the two variables

Precision –> How close do the data points lie along the slope

Dotted lines represent confidence bounds  visual representation of reliability –> bounds at the extremes are curve as extreme values can bias the slope more/more leverage

Hence, it is useful to sample at the extremes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Now that you have calculated the t-value for your co-efficients, how can you test whether they are statistically significant?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Generally speaking in statistics, what is a T-Test?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Why does using T-Tests not useful comparing more than 2 levels?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Formula to calculate the T-value?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

How is the standard error calculated in a One-sample T-test?

A

One sampled

22
Q

Breakdown the following One-Sample T-Test output.

A

One sample T-Test

Used to test whether the mean of a sample is different from a specific value

23
Q

What is a two-sample T-Test? How does it compare to a one-sample?

A

Two-sample T Test is used to compare two means –> We are asking if they are statistically different from eachother

Main Difference

  1. One-sample T-Test we assume that one mean value is already known (Specific value) –> hence we only have one source of error in our single sample
  2. In the two-sample T-Test we are using two means estimated from two samples  both of which will have error
24
Q

How is the standard error calculated for a Two-sample T-Test?

A

Since we are dealing with two sources of error for our mean values we need to change our standard error calculation to reflect this.

25
Q

Assumptions for a One and Two-sampled T-Test?

A

One Sampled T-Test

  1. Normally distributed data –> means that the means are a good reflection of central tendency

Two Sampled T-Test

  1. Normally distributed data –> means that the means are a good reflection of central tendency
  2. The two data sets have similar variance
26
Q

Breakdown the following Two-sample T-Test Output

A
27
Q

Generally speaking what is the F-Test?

A
28
Q

What does var.test() on R do?

A

The var.test –> function can be used to do compare two variances –> E.g. Comparing the genome size variation for two species suborders

Is the ratio of variances = 1 –> meaning that they are they same or not?

29
Q

If we want to perform a T-Test but our data is NOT normally distributed, what non-parametric test can we perform?

A

Wilcoxon Test

30
Q

What type of data is suitable for a Linear model - Linear Regression?

A

Linear regression is a class of Linear models that is frequently a good choice if your response and explanatory variable are continuous.

31
Q

What does the Pearson correlation coefficent tell you?

A

It tells you the association between two variables NOT the effect –> it doesn’t assign the explanatory and response variables.

Pearson correlation assumes that your data is Normaly distributed

Results

Pearson Correlation coefficient (r) –> Ranges from -1 (Perfectly negatively correlated) to +1 (Perfectly positive correlated) with 0 being no correlation.

‘r’ will be positive if both ‘x’ and ‘y’ tend to move in the same direction

‘r’ will be negative if both ‘x’ and ‘y’ move in opposite directions

32
Q

Difference between cor() and cor.test() on R?

A
  1. cor() –> calculates correlations between pairs of variables (multiple) –> For example you have a table and you want to find the correlation between each variable
  2. cor.test() –> Only looks at the correlation between a single pair of variables but also performs a t-test –> to indicate the significance of the correlation –> T-test used to check whether the correlation is statistically different from 0.

NOTE –> use = “pairwise” –> tells R to omit observations with missing data and use complete pairs of observations –> e.g. dismiss NA

33
Q

Why is it a good idea to transform (log) your data when performing a correlation test, if it is showing a curved relationship?

A

A problem that may arise when trying to calculate the correlation coefficient –> it assumes as straight line relationship –> Hence, we run in to problems when there is a curved relationship.

34
Q

Outline the code needed to plot two data sets on the same scatter plot

A
35
Q

What code can you use to plot the model diagnostic plots on the same page?

A
36
Q

What type of data is suitable when constructing a Linear Model - ANOVA?

A

ANOVA –> suitable when the response (dependent) variable is continuous, and your predictor (independent) variables is categorical.

ANOVA –> Used to see whether the Linear model produced with an explanatory and response variable, accounts better for the variance in the data than the null model (i.e. where there is no correlation

37
Q

Difference between a one-way ANOVA and a Two-Way ANOVA?

A

One-way ANOVA –> One categorical variable (with different levels) effect on a response variable

Two-way ANOVA –> Influence of two different categorical variables on a response variable.

When there is only two levels in a category –> you can simply use a t-test to compare the means –> in this case the relationship is F = t2

38
Q

Outline how the different Df values are calculated for TSS, ESS and RSS?

A
39
Q

Correlation between R (Correlation co-efficient) and R2?

A

Note that 𝑅, the square root of 𝑅2 (ESS/TSS),

Where R is the correlation between the observed values (𝑦), and the predicted values (𝑦̂).

40
Q

What is the quickest way to plot bar plots showing the mean and the confiedence intervals?

A

Instal –> Instal.package(gplots)

Load –> Library(gplots)

41
Q

What is the Tukey Test used for?

A
42
Q

Breakdown the following Tukey output

A
43
Q

How can you test whether explanatory variables are independent? Can this test be used for all types of data?

A

Error in Image

Null hypothesis –> Factors are independent

Chi-sqaured is significant –> factors are not independent

Chi-sqaured is not significant –> Factors are independent

44
Q

What information does the Summary output directly tell you>

A

This Output summarizes several statistical test in one single output –> Co-efficient estimates, T-Value/Test, Mutliple/adjusted R squared, F Value/Test and DF which can be ued to calculate sample size

45
Q

Breakdown the following Linear regression summary output

A

Note - ~ is used to define the response variable (LEFT) and explanatory variable (RIGHT)

1. Call –> Highlights the Response and Explanatory Variable in question –> We are asking whether our Linear model can explain the relationship between the two

2. Residuals –> Outlines the distribution of the residual values using the min, max, quartiles and Median

3. Coefficients Table –> Estimates co-efficients calculates the T-Value for the Coefficients and subsequently performs a T-Test to test whether they are statistically significant

  • NOTE –> we have two co-efficients –> One that describes the Intercept and another that describes the Slope of the line (Log(covid$Totalcases)
  • From the Coefficients we can generate a Linear Equation
  • T-value is calculated by dividing Estimate/Std. Error
  • T-Test to check whether our T-Values for each Co-efficient is equal to 0 (H0)

4. Residual Standard Error

5. Multiple R-Squared –> the proportion of the variance explained by your model –> ESS/TSS (ESS+RSS)

  • Different to F-Value which is the ratio

6. Adjusted R-squared –> same as Multiple R-squared–> includes a penalty for degrees of freedom –> Hence, Value is Lower

Note - R-squared can be multiplied by 100 to obtain a percentage

7. F-Statistic –> Briefly tells you your F-Value for your Linear Model and the level of statistical significance –> Note running a var.test(..)/analysis of variance will give you a more comprehensive F-Test output

46
Q

What does the ANOVA output in R tell you?

A

The direct output tell you how much variation is explained by each term.

But…

  • You can also calculate the total variation –> overall F-value of the model
  • You can also calculate R2 of the model
47
Q

Breakdown the following ANOVA output

Logtotaldeaths ~ Logtotalcases

A

Remember, in theory we want our model to maximize the F-value ratio –> proportionately more variation is explained relative to variation that is not explained.

1. Left most column –> Each contributing term in the model

2. Degrees of freedom (Df)

a) Term Df –> Calculated by n-1 –> where n is the number of parameters –> simple regression we have two estimates parameters (slope and intercept)
b) Residual Df –> The residual degrees of freedom –> the number of data points that are allowed to vary from the Linear model

3. Sum sq –> Sum of squares

a) Term Sum sq –> thought of as ESS
b) Residual Sum sq –> thought of as RSS

4. Mean Sq –> Each each Sum of sq by their respective degrees of freedom –> yielding EMsq or RMsq

  • Note - Mean sq = Variance

5. F-Value –> calculated by dividing each EMsq by RMsq

6. F-Test significance –> Testing our value against the Null hypothesis –> Null hypothesis (H0) = ratio of the variances = 1 –> Are EMsq and RMsq the same?

48
Q

When approaching an statistical test question what is one of the first things you need to consider?

A

What type of data you are working with!

As this will dictate which statistical tests/procedures are appropriate considering their assumptions

E.g. Proportions –> is bound between 0-1 –> meaning that the data is not continious –> so mean is not a good measure of central tendency.

Thus…

Any statistical test involving the mean would not be appropriate! –> Variance, Standard deviation, sum of sq.

Whereas…

Range, median and IQR are appropriate

49
Q

Difference between a type 1 and type 2 error?

A

Type 1 Error –> False Positive –> Meaning that we reject a true null hypothesis –> we think that there is a correlation but in reality, there is not!

Type 2 Error –> False Negative –> Meaning that we do not reject a false null hypothesis –> we basically accept a null hypothesis thinking that there is no correlation but there is!

We spot Type 2 errors using our own intuition/previous literature on what should be happening –> normally we expect/should be a correlation but we this time we don’t –> Possible False Negative

50
Q

What is the non-parametric altenrative to the Pearson correlation?

A

Spearman Correlation

Pearson’s is used for normally distributed data whereas Spearman is used for non-normally distributed data.

51
Q

How would you approach this question?

A

Remember an F-test is fundamentally used to compare two variances –> Are they statistically the same or not?

Context –> two-sample T-test –> Comparing the Two means of hospital beds per 1000 populations between North America and South America

BUT! For a two-sample T-Test to be reliable there has to be homogeneity of variance –> So this can be followed up with a F-Test!

Output Breakdown

  1. F-Value –> Ratio of variance between NA and SA
  2. Df –> calculted by n-1 –> where n is the sample size –> work back to figure out the number of data points for each sample
  3. P-value –> indicates significance –> THis case it is above 0.05 –> No statistical significance –> we fail toe reject the null hypothesis, meaning the variances are the SAME

Means that the –> we can intrepret the results as they are!

If there was significance (different variance) we could perform a Wilcoxon test instead or attempt to transform the data (log)

52
Q

You know what a one-sample and two sample t-tests are but what is a Paired T-test?

A

Paired T-Test –> Requires the same individual in one group to another –> e.g. Measure people’s weight in our class today and in 10 year’s time –> Matching the people up and looking at the difference.