Stats 2 - Stats & Linear Models Flashcards

Question

Assumptions for a One and Two-sampled T-Test?

Answer 1

One Sampled T-Test 1. Normally distributed data --\> means that the means are a good reflection of central tendency Two Sampled T-Test 1. Normally distributed data --\> means that the means are a good reflection of central tendency 2. The two data sets have similar variance

Answer 2

The var.test --\> function can be used to do compare two variances --\> E.g. Comparing the genome size variation for two species suborders Is the ratio of variances = 1 --\> meaning that they are they same or not?

Answer 3

Wilcoxon Test

Answer 4

Linear regression is a class of Linear models that is frequently a good choice if your response and explanatory variable are continuous.

Answer 5

It tells you the **association** between two variables **NOT** the effect --\> it doesn't assign the explanatory and response variables. Pearson correlation assumes that your data is Normaly distributed Results Pearson Correlation coefficient (r) --\> Ranges from -1 (Perfectly negatively correlated) to +1 (Perfectly positive correlated) with 0 being no correlation. ‘r’ will be positive if both ‘x’ and ‘y’ tend to move in the same direction ‘r’ will be negative if both ‘x’ and ‘y’ move in opposite directions

Answer 6

1. cor() --\> calculates correlations between pairs of variables (multiple) --\> For example you have a table and you want to find the correlation between each variable 2. cor.test() --\> Only looks at the correlation between a single pair of variables but also performs a t-test --\> to indicate the significance of the correlation --\> T-test used to check whether the correlation is statistically different from 0. **NOTE** --\> use = “pairwise” --\> tells R to omit observations with missing data and use complete pairs of observations --\> e.g. dismiss NA

Answer 7

A problem that may arise when trying to calculate the correlation coefficient --\> it assumes as straight line relationship --\> Hence, we run in to problems when there is a curved relationship.

Answer 8

ANOVA --\> suitable when the response (dependent) variable is continuous, and your predictor (independent) variables is categorical. ANOVA --\> Used to see whether the Linear model produced with an explanatory and response variable, accounts better for the variance in the data than the null model (i.e. where there is no correlation

Answer 9

One-way ANOVA --\> One categorical variable (with different levels) effect on a response variable Two-way ANOVA --\> Influence of two different categorical variables on a response variable. When there is only two levels in a category --\> you can simply use a t-test to compare the means --\> in this case the relationship is F = t2

Answer 10

Note that 𝑅, the square root of 𝑅²(ESS/TSS), Where R is the correlation between the observed values (𝑦), and the predicted values (𝑦̂).

Answer 11

Instal --\> Instal.package(gplots) Load --\> Library(gplots)

Answer 12

Error in Image Null hypothesis --\> Factors are independent Chi-sqaured is significant --\> factors are not independent Chi-sqaured is not significant --\> Factors are independent

Answer 13

This Output summarizes several statistical test in one single output --\> Co-efficient estimates, T-Value/Test, Mutliple/adjusted R squared, F Value/Test and DF which can be ued to calculate sample size

Answer 14

**Note** - ~ is used to define the response variable (LEFT) and explanatory variable (RIGHT) **1. Call --\>** Highlights the Response and Explanatory Variable in question --\> We are asking whether our Linear model can explain the relationship between the two **2. Residuals** --\> Outlines the distribution of the residual values using the min, max, quartiles and Median **3. Coefficients Table** --\> Estimates co-efficients calculates the T-Value for the Coefficients and subsequently performs a T-Test to test whether they are statistically significant - NOTE --\> we have two co-efficients --\> One that describes the Intercept and another that describes the Slope of the line (Log(covid$Totalcases) - From the Coefficients we can generate a Linear Equation - T-value is calculated by dividing Estimate/Std. Error - T-Test to check whether our T-Values for each Co-efficient is equal to 0 (H₀) **4. Residual Standard Error** **5. Multiple R-Squared --\>** the **proportion** of the variance explained by your model --\> ESS/TSS (ESS+RSS) - Different to F-Value which is the **ratio** **6. Adjusted R-squared --\>** same as Multiple R-squared--\> includes a penalty for degrees of freedom --\> Hence, Value is Lower Note - R-squared can be multiplied by 100 to obtain a percentage **7. F-Statistic --\>** Briefly tells you your F-Value for your Linear Model and the level of statistical significance --\> Note running a var.test(..)/analysis of variance will give you a more comprehensive F-Test output

Answer 15

The direct output tell you how much variation is explained by each term. But... - You can also calculate the total variation --\> overall F-value of the model - You can also calculate R² of the model

Answer 16

Remember, in theory we want our model to maximize the F-value ratio --\> proportionately more variation is explained relative to variation that is not explained. **1. Left most column** --\> Each contributing term in the model **2. Degrees of freedom (Df)** a) Term Df --\> Calculated by n-1 --\> where n is the number of parameters --\> simple regression we have two estimates parameters (slope and intercept) b) Residual Df --\> The residual degrees of freedom --\> the number of data points that are allowed to vary from the Linear model **3. Sum sq** --\> Sum of squares a) Term Sum sq --\> thought of as ESS b) Residual Sum sq --\> thought of as RSS **4. Mean Sq** --\> Each each Sum of sq by their respective degrees of freedom --\> yielding EMsq or RMsq - Note - Mean sq = Variance **5. F-Value** --\> calculated by dividing each EMsq by RMsq **6. F****-Test significance** --\> Testing our value against the Null hypothesis --\> Null hypothesis (H0) = ratio of the variances = 1 --\> Are EMsq and RMsq the same?

Answer 17

What type of data you are working with! As this will dictate which statistical tests/procedures are appropriate considering their assumptions E.g. Proportions --\> is bound between 0-1 --\> meaning that the data is not continious --\> so mean is not a good measure of central tendency. Thus... Any statistical test involving the mean would not be appropriate! --\> Variance, Standard deviation, sum of sq. Whereas... Range, median and IQR are appropriate

Answer 18

**Type 1 Error** --\> False Positive --\> Meaning that we reject a true null hypothesis --\> we think that there is a correlation but in reality, there is not! **Type 2 Error** --\> False Negative --\> Meaning that we do not reject a false null hypothesis --\> we basically accept a null hypothesis thinking that there is no correlation but there is! We spot Type 2 errors using our own intuition/previous literature on what should be happening --\> normally we expect/should be a correlation but we this time we don’t --\> Possible False Negative

Answer 19

Spearman Correlation Pearson’s is used for normally distributed data whereas Spearman is used for non-normally distributed data.

Answer 20

Remember an F-test is fundamentally used to compare two variances --\> Are they statistically the same or not? Context --\> two-sample T-test --\> Comparing the Two means of hospital beds per 1000 populations between North America and South America BUT! For a two-sample T-Test to be reliable there has to be homogeneity of variance --\> So this can be followed up with a F-Test! Output Breakdown 1. F-Value --\> Ratio of variance between NA and SA 2. Df --\> calculted by n-1 --\> where n is the sample size --\> work back to figure out the number of data points for each sample 3. P-value --\> indicates significance --\> THis case it is above 0.05 --\> No statistical significance --\> we fail toe reject the null hypothesis, meaning the variances are the SAME Means that the --\> we can intrepret the results as they are! If there was significance (different variance) we could perform a Wilcoxon test instead or attempt to transform the data (log)

Answer 21

Paired T-Test --\> Requires the same individual in one group to another --\> e.g. Measure people’s weight in our class today and in 10 year’s time --\> Matching the people up and looking at the difference.

Stats 2 - Stats & Linear Models Flashcards

(52 cards)