Data Distributions and Introduction to Inferential Statistics Flashcards

1
Q

What is a frequency distribution?

A

A theoretical continuous curve that best fits a data histogram
- numerical discrete variables have frequency histograms, while numerical continuous variables have density curves

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why are frequency distributions important?

A
  • help us model our data & determine which descriptive statistics would be most useful
  • parametric tests
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a parametric test?

A

A statistical method that assumes that the data come from a specific theoretical distribution (e.g. a normal distribution) and makes inferences based on that assumption.
- parametric tests examples include t-tests and ANOVA

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

When should parametric tests be used?

A

If the dependent variable has a normal frequency distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the difference between a frequency histogram and a density curve?

A

Frequency histogram:
- used for numerical discrete variables
- displays frequency or count of observations for each discrete value or range of values

Density curve:
- used for numerical continuous variables
- displays the probability density function, which represents the probability of observing a value within a range of values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are some common frequency distributions?

A
  • Binomial distribution
  • Poisson distribution
  • Normal distribution
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a binomial distribution?

A

Describes the number of successes in a fixed number of independent trials, where each trial has only two possible outcomes (e.g., success or failure, heads or tails, yes or no)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the characteristics of a binomial distribution?

A
  • a fixed number of trials
  • only two possible outcomes per trial
  • independence of the trials
  • a constant probability of success on each trial
  • a discrete number of successes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a Poisson distribution?

A

It gives the probability of an event happening a certain number of times (k) within a given interval of time or space.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the characteristics of a Poisson distribution?

A
  • a fixed interval of time or space
  • rare events occurring with a constant average rate
  • independence of the events
  • a discrete number of occurrences
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is a normal distribution?

A

A symmetrical probability distribution that is characterised by its mean and standard deviation
- often referred to as a bell curve because of its shape

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the properties of a normal distribution?

A
  • symmetrical
  • mean, median and mode are equal
  • it is described by its mean & standard deviation
  • majority of data falls within 1 standard deviation of the mean
  • almost all the data falls within 3 standard deviations of the mean
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the standard deviation?

A

A measure of how dispersed the data is in relation to the mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the null hypothesis (H0) in statistical hypothesis testing?

A

The null hypothesis (H0) is the hypothesis that there is no difference or no association between our variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the alternative hypothesis (H1) in statistical hypothesis testing?

A

The alternative hypothesis (H1) is the hypothesis that there is a statistical significant difference or association between our variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the goal of statistical hypothesis testing?

A

To determine if we can reject the null hypothesis and accept the alternative hypothesis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the significance level in statistical hypothesis testing?

A

The threshold for rejecting the null hypothesis.
- represents the probability of making a type I error (rejecting the null hypothesis when it is actually true)
- commonly used significance level is 0.05

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What level of confidence do we like to have before rejecting the null hypothesis?

A

By convention, we like to be at least 95% confident that the null hypothesis is wrong before we reject it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What do most hypotheses that we test use?

A

Use data that is characterised by variation and uncertainty

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What are the two types of errors we risk when evaluating whether we can reject the null hypothesis or not?

A

Type I and Type II errors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is a Type I Error?

A

A Type I Error is when we reject a null hypothesis even though it is actually true.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is a Type II Error?

A

A Type II Error is when we accept a null hypothesis even though it is actually false.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Why do we set very stringent confidence levels in rejecting the null hypothesis?

A

Because a Type I Error is more serious than a Type II Error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Summarise the table of errors

A

(slide 13)
H0 True H0 False
Reject H0 Type I Error Correct Rejection
Fail to reject H0 Correct Decision Type II Error

25
Q

Why is the significance level in hypothesis testing important?

A

It is important because it helps us control the probability of making a Type I error, which is rejecting the null hypothesis when it is actually true.

26
Q

What happens if the p-value is greater than 0.05 in a hypothesis test?

A

p > 0.05, the test is considered non-significant and we cannot reject the null hypothesis

27
Q

What happens if the p-value is less than 0.05 in a hypothesis test?

A

p < 0.05, the test is considered significant, and we can reject the null hypothesis and report the trends in the data

28
Q

What is α in hypothesis testing, and what is its significance level?

A

α is the chosen type 1 error rate or significance level

29
Q

What is the critical value in hypothesis testing, and how does it relate to the p-value?

A

For a test to be significant, the calculated test statistic must be higher than the critical value. However, in practice, we use the p-values that tests output to determine significance.

30
Q

How can we test for normality?

A
  • graphically using histograms and QQ plots
  • statistically using tests such as Shapiro-Wilks
31
Q

What are some ways to check for normality in data?

A
  • checking the histogram to see if it looks normal (bell-shaped)
  • checking descriptive statistics (mean, median & mode)
  • checking if approximately 70% of data falls within +/- one standard deviation of the mean
  • conducting a QQ plot or Shapiro-Wilk test for normality if the sample size is greater than 30
32
Q

How can we describe deviations from normality?

A

Using two measures
- kurtosis
- skewness

33
Q

What is kurtosis?

A

Peakedness or flatness
- positive kurtosis = a more peaked distribution
- negative kurtosis = a flatter distribution

34
Q

What is skewness?

A

Measure of asymmetry of a probability distribution.
- positive skewness = a distribution with a longer tail on the right side
- negative skewness = a distribution with a longer tail on the left side

35
Q

What is a QQ (quantile-quantile) plot?

A

A QQ plot is a graphical tool used to display the pattern of dispersion of the dataset against the theoretical distribution, typically normal distribution.

36
Q

What is the Shapiro-Wilks test?

A

Used to determine whether a set of data comes from a normal distribution or not

37
Q

What is the null hypothesis and alternative hypothesis in the Shapiro-Wilks test?

A
  • null hypothesis = observed data comes from a normal distribution
  • alternative hypothesis = observed data does not come from a normal distribution
38
Q

Can the Shapiro-Wilks test be performed on multiple dependent variables at once?

A

No, the Shapiro-Wilks test needs to be performed on one numeric dependent variable at a time
- if there are different levels of a categorical variable, then the test needs to be performed for each level of the categorical variable separately

39
Q

What is the output of the Shapiro-Wilk test in R?

A

Includes the
- test statistic (W)
- p-value
- a statement indicating whether the data is normally distributed: p-value > 0.05 (accept null hypothesis), p-value < 0.05 (reject null hypothesis).
Example:

Shapiro-Wilk normality test data:
mydata

W = 0.935, p-value = 0.002345

40
Q

What should be done if the Shapiro-Wilk test indicates that the data is not normally distributed?

A

Try transforming the data to achieve normality. If this is not possible, non-parametric tests can be used instead of parametric tests that require normality.

41
Q

What is the purpose of t-tests?

A

T-tests are used when the data is normally distributed and we want to test whether two means are significantly different
- e.g. control vs treatment

42
Q

What is the null hypothesis in t-tests?

A

The two samples are drawn from the same statistical population and will have the same mean.
- no significant difference between means

43
Q

What is the alternative hypothesis in t-tests?

A

The two samples are drawn from different statistical populations and have different means.

44
Q

In excel, which t-Test should be used?

A

T-test: Paired Two Sample for Means
T-test: Unpaired Two Sample for Means

45
Q

What is the T-test: Paired Two Sample for Means used for?

A

Used for repeat measures on the same individuals
- e.g. before and after a treatment

46
Q

What is the T-test: Unpaired Two Sample for Means used for?

A

Comparing the means of two independent groups.
- e.g. before and after a treatment

47
Q

What values are important from the output table in excel of a T-test: Paired Two Sample for Means?

A
  • mean
  • df
  • t stat
  • P(T<=t) two-tail
48
Q

How is the result of a ‘T-test: Paired Two Sample for Means’ stated in a report?

A

“There was a significant difference (t = __, df = __, p = __); … ”

49
Q

What test is performed before the t-tests?

A

An F-Test to test for equality of variances

50
Q

What values are important in the output table of the F-test result?

A
  • df
  • F
  • P(F<=f) one-tail
51
Q

When are the variances significantly different in an F-test?

A

calculated F-value > the critical F-value (for p=0.05), then the variances are significantly different.

null = variances of two populations are equal
alternative = variances of two populations are not equal

52
Q

What is stated in regards to the F-test result?

A

“There was no significant difference between variances (F = __, p=___), therefore a t-test with equal variances was performed.”

53
Q

What is the Mann-Whitney U test?

A

The Mann-Whitney U test is a non-parametric statistical test that is equivalent to a t-test.

54
Q

How is the Mann-Whitney U test calculated?

A

Raw data is first converted to ranks before calculating the test statistic.

55
Q

What is the Wilcox.test in R?

A

The Wilcox.test (also known as Mann-Whitney U test)

56
Q

How is the Wilcox.test function used in R?

A
  • takes two sample vectors as input
  • returns the test statistic, p-value, and alternative hypothesis
57
Q

What is the process for choosing a test to determine if the means for two groups are different?

A

1: Identify whether you want to check if the means for a numerical variable are different between two groups of a categorical variable.

2: If the categorical variable is paired, proceed to Step 3a. Otherwise, proceed to Step 3b.

3a: Check if the numerical variable is normally distributed within both groups of the categorical variable. If it is, perform a paired t-test. Otherwise, perform a Wilcoxon Signed-rank test.

3b: Check if the numerical variable is normally distributed within both groups of the categorical variable. If it is, proceed to Step 4. Otherwise, perform a Mann-Whitney U test.

4: Do an F-test. If variances are equal, perform an unpaired t-test assuming equal variances. Otherwise, perform an unpaired t-test assuming unequal variances. Finally, make conclusions and STOP.

58
Q

What should be included in the “Methods” section in a scientific report?

A
  • the tests used for what purposes
  • the software used to implement those tests
  • any citation required for the software used (e.g., R and RStudio, not excel)
59
Q

What should be included in the “Results” section in a scientific report?

A
  • describe the outcome of each test result
  • report test-statistics (e.g. t or F or Wilk’s lambda, a measure of effect size)
  • df (an indicator of sample size)
  • p-value and your decision (can you reject the null hypothesis or not)
  • a statement of the biological meaning of your result