STAT Notes Flashcards

Question 1

Q

Define descriptive statistics

Answer

A

Methods used to summarize or describe our observations

Question 2

Q

Describe inferential statistics

Answer

A

Using observations as a basis for making estimates or predictions

Question 3

Q

What two methods can be used to ensure random sampling is truly random?

Answer

A

Mechanical
Blind

Question 4

Q

Define mechanical sampling

Answer

A

Assigning every individual in the population a number and randomly generating numbers

Question 5

Q

Define stratified random sampling

Answer

A

Selects characteristics of the sample based on proportion of said characteristics in the population

Question 6

Q

Define dispersion of data

Answer

A

How far it lies from a given average

Question 7

Q

How is sample variance calculated?

Answer

A

Σ(difference between each value (xi) and the mean (x̄))^2 ÷ 1(n-1) where n is the number of observations

Question 8

Q

How is standard deviation calculated?

Question 9

Q

How is standard error calculated?

Answer

A

sx ÷ √n where n is the number of observations and sx is standard deviation of a sample

Question 10

Q

Define confidence interval

Answer

A

Specific certainty of a predicted population mean with normal distribution

Question 11

Q

What proportion of the population stands within one standard error?

Question 12

Q

What proportion of the population stands within two standard errors?

Question 13

Q

What proportion of the population stands within three standard errors?

Question 14

Q

What function shows perfect normal distribution?

Answer

A

Gaussian function

Question 15

Q

Define nominal data

Answer

A

Classifies by names

Question 16

Q

Define ordinal data

Answer

A

Classified in an order (by categories)

Question 17

Q

What are the two types of variables?

Answer

A

Categorical
Numeric

Question 18

Q

What are the two sub categories of categorical data?

Answer

A

Ordinal
Nominal

Question 19

Q

How is categorical data referred to in R?

Question 20

Q

What are the two sub categories of numeric data?

Answer

A

Discrete
Continuous

Question 21

Q

How is discrete data referred to in R?

Question 22

Q

How is continuous data referred to in R?

Question 23

Q

Define skewed distribution

Answer

A

A measure of asymmetry

Question 24

Q

Define bimodal distribution

Answer

A

There are two modes (can be symmetrical or asymmetrical)

Question 25

Q

Define a bin

Answer

A

An area in which data is collected

Question 26

Q

Define central tendency

Answer

A

Central values

Question 27

Q

Define probability

Answer

A

Proportion of times a particular outcome will occur from a large sample of trials or the likelihood of a particular outcome of an event

Question 28

Q

What does a P=0 (probability=0) suggest?

Answer

A

Impossible

Question 29

Q

What does a P=1 (probability=1) suggest?

Answer

A

Certainty

Question 30

Q

What does it mean if trials are independent?

Answer

A

The actions of one have no impact on the results of the next trial

Question 31

Q

What is the probability of a OR b where they are both mutually exclusive?

Answer

A

P(a)+P(b)

Question 32

Q

What is the sum of the probabilities of mutually exclusive outcomes?

Question 33

Q

When we use OR when describing mutually exclusive probabilities how do we combine these values?

Question 34

Q

What is a probability distribution?

Answer

A

Graphical distribution of theoretical relative probabilities
y=probability, x=potential outcomes

Question 35

Q

What is true about the area of any sections of a probability distribution graph?

Answer

A

Equivalent to the relative probability

Question 36

Q

How can we draw probabilities with multiple trials but limited outcomes?

Answer

A

Table
Probability tree

Question 37

Q

How do we combine mutually exclusive events using AND?

Answer

A

Multiply probabilities together

Question 38

Q

Define probability distribution

Answer

A

Theoretical probability of each outcome

Question 39

Q

Define frequency distribution

Answer

A

Observed frequency of each outcome

Question 40

Q

After more trials what becomes true about the frequency distribution and probability distribution?

Answer

A

Frequency distribution approaches probability distribution

Question 41

Q

When can we use binomial statistics

Answer

A

Can be used when there are two groups (such as A and B or pass and fail)
NOTE: we can create these groups if we define some outcomes as “success” and the others as “failure” and classify other outcomes beneath these banners

Question 42

Q

Give examples for which type of questions binomial distribution may be used for

Answer

A

Predict the probability of success in a single trial
Predict the proportion of successes in n trials

Question 43

Q

What are requirements for binomial statistics?

Answer

A

2 outcomes (P(success)=p and P(failure)=q) and p+q=1
Each trial is independent with equal p
Fixed no. trials

Question 44

Q

As number of trials increases what becomes true of discrete data?

Answer

A

Begins to resemble continuous data

Question 45

Q

How can we approximate binomial distribution?

Answer

A

Probability distribution

Question 46

Q

How can we find probability up to any point (normal distribution)

Answer

A

Area under the graph up until that point

Question 47

Q

Rules for hypothesis testing

Answer

A

Understand the certainty of a hypothesis test
Don’t base scientific decisions on hypothesis tests alone
Consider the wider picture and plausibility of results

Question 48

Q

Which letter denotes significance level?

Question 49

Q

Which two hypothesis are needed for a hypothesis test?

Answer

A

H0: null hypothesis (no change)
HA: alternative hypotheses (covers all other probability)
These hypotheses must be mutually exclusive

Question 50

Q

What do we assume about H0 in a hypothesis test?

Question 51

Q

What is referred to as the critical region?

Answer

A

Areas above the critical value (above the alpha)

Question 52

Q

When is the null hypothesis rejected in hypothesis testing?

Question 53

Q

What is a tail?

Answer

A

Area at the end of the distribution

Question 54

Q

How do we test both tails?

Answer

A

Two-tailed test

Question 55

Q

How many critical regions are present in a two tailed test?

Question 56

Q

If alpha=0.05 and a two-tailed test is performed, what % of values lie outside the critical region?

Question 57

Q

What is a p value?

Answer

A

The p value assumes the null hypothesis is true and gives the probability of getting a result that extreme or more assuming this

Question 58

Q

What is a contingency table?

Answer

A

One that shows all possible HA and H0 outcomes

Question 59

Q

If H0 is true and we reject it, what is true?

Answer

A

False Positive
Type I error
We do not know what is true

Question 60

Q

If H0 is true and we fail to reject H0 what is true?

Answer

A

There is a true negative
H0 is true

Question 61

Q

If HA is true and we fail to reject H0 what is true?

Answer

A

False negative
Type II error
HA was true

Question 62

Q

If HA is true and we reject H0 what is true?

Answer

A

True positive
H0 is untrue, this does not confirm HA

Question 63

Q

If H0 is true, what are the possible outcomes/errors?

Answer

A

True negative (H0 is true and we fail to reject H0)
Type I error (H0 is true and we reject H0)

Question 64

Q

At an alpha value of 0.05, how often would we expect a Type I error, if H0 is true?

Answer

A

5% Type I error
(95% true negative)

Answer 51

A

How powerful a test is at detecting true positives when there really is a difference to detect

Answer 52

A

When we are outside the critical value (in the direction of the H0)
This is type II error and is shown where the HA graphs overlaps with H0

Answer 53

A

The area of overlap between the H0 and HA graphs (where HA is true)

Answer 54

A

Power=1-beta

Answer 55

A

2.1% of the time

Answer 56

A

Smaller
It is more difficult to identify a true error

Answer 57

A

There will be a lower rate of false negatives (type II error)

Answer 58

A

Increase effect size:
Separate the curves to be skinnier
Increase distance between peaks

Answer 59

A

Power increases (less type II error)

Answer 60

A

Increased trials
(decreases curve dispersion)

Answer 61

A

There must be two hypotheses:
H0 - null hypothesis (no change/ effect)
HA - alternative hypothesis (mutually exclusive and covers all other options (different for one and two-tailed tests))

Answer 62

A

It is only the probability of a false positive if the alternative hypothesis is true, we can not know if the alternative hypothesis is true we can only speculate based on evidence

Answer 63

A

Proportion of true positives for a particular HA

Answer 64

A

Comparing and testing several conditions or treatments

Answer 65

A

When comparing two samples with each other (i.e.: control and drug)

Answer 66

A

When comparing a sample to a mean

Answer 67

A

When samples are closely replated to one another (such as before and after a treatment)

Answer 68

A

Outcome variable is continuous dependent variable and experimental variable is bivariate independent variable
Normal distribution
Equal Variance

Answer 69

A

Contains two groups

Answer 70

A

A normal quantile-quantile plot compares quantiles of your data to theoretical quantiles for a normal distribution (if these match closely the data is normally distributed)

Answer 71

A

There is an increase in the probability of false positives
(FWER (family-wise error rate))

Answer 72

A

Family wise error rate is the probability of getting a false positive if the null hypothesis is true

Answer 73

A

(1-alpha)^n
in n tests

Answer 74

A

1-(1-alpha)^n

Answer 75

A

Compares several samples with each other and compares variance within samples with that between samples

Answer 76

A

Analysis of variance (ANOVA)

Answer 77

A

Compare means with one another to find statistical difference

Answer 78

A

Mean of sample means
(Add all means and divide by number of groups)

Answer 79

A

Observational
Experimental

Answer 80

A

Makes observations without intervention

Answer 81

A

A study where an intervention is made to test a hypothesis

Answer 82

A

Any relevant condition, characteristic, number or quantity that can be measured, assessed or counted

Answer 83

A

Explanatory variable

Answer 84

A

Response variable

Answer 85

A

One that could impact the measurement from your dependent variable in addition to your independent variable

Answer 86

A

The difference between the result for a whole
population and the result from our sample or experiment.

Answer 87

A

Sampling error
Bias

Answer 88

A

The possibility that the sample is not a perfect representation of the population

Answer 89

A

Normal (allowing for statistical testing)

Answer 90

A

Replication
Balance
Blocking

Answer 91

A

The more data we collect he more insignificant errors become

Answer 92

A

Technical
Biological

Answer 93

A

These are additional measurements or analyses taken from the same sample. They help account for variability introduced by the measurement process itself.

Answer 94

A

These involve separate samples that are independently manipulated or tested under identical conditions

Answer 95

A

Grouping experimental units with similar properties

Answer 96

A

This is the process of comparing groups of similar sizes

Answer 97

A

Error caused by a systematic difference in the estimation of the sample and the whole population

Answer 98

A

Any
(Design, data collection, analysis, publication etc…)

Answer 99

A

Simultaneous control groups
Blinding
Randomisation

Answer 100

A

A group of subjects not exposed to the experimental treatment but are treater the same in all other ways

Answer 101

A

Untreated control
Vehicle control

Answer 102

A

Subject in it’s native state with no treatment

Answer 103

A

Subject undergoes treatment with everything but the exact thing being tested (e.g.: the drug)

Answer 104

A

Testing against a pre-existing drug as opposed to a vehicle control

Answer 105

A

A control which defines what a positive result looks like

Answer 106

A

Result which defines what a negative result looks like

Answer 107

A

The process of obscuring whom has which treatment to limit the placebo effect

Answer 108

A

Assigning random places to random individuals such to not introduce further sampling bias

Answer 109

A

Correlation
Regression

Answer 110

A

It’s strength and direction

Answer 111

A

Correlation coefficient

Answer 112

A

Very weak correlation or negligible between the two variables

Answer 113

A

Weak or low correlation between the two variables

Answer 114

A

Moderate correlation between the two variables

Answer 115

A

Strong, high and marked correlation between the two variables

Answer 116

A

Very strong and very high correlation between the two variables

Answer 117

A

How much of the variation in one variable can be explained by the other

Answer 118

A

Looking for an association between variables where neither is experimentally manipulated
Experimentally manipulating one variable and looking to see whether the other variable changes too

Answer 119

A

Regression

Answer 120

A

A higher correlation coefficient

Answer 121

A

There is little variability about the line of best fit

Answer 122

A

When there is a linear correlation

Answer 123

A

Assessment of how well a linear regression line fits data

Answer 124

A

Using the r^2 value
Looking at the residuals

Answer 125

A

As a straight line through the data points

Answer 126

A

The point (y) a dataset at a given is expected to be seen on a regression line

Answer 127

A

The distance between a given point and it’s fitted value

Answer 128

A

Plot a residual plot - residual against fitted value - and observe if there are any patterns

Answer 129

A

A linear equation may not be appropriate for the data presented

Answer 130

A

Plots are evenly scattered about the line on either side with even distribution

Answer 131

A

Yes using the linear regression

Answer 132

A

No, we need to create a regression in the other direction to describe b in terms of a

Answer 133

A

Refers to a number of activities, often related to the misinterpretation of statistics, that occur in published scientific work

Answer 134

A

The practice of cherry picking refers broadly to only presenting one side of the story. Specifically in relation to statistics, this translates as choosing not to report parts of your analysis which do not agree with the story you are trying to tell.

This is often used to “tidy up” or create a “convincing” story

Answer 135

A

Ultimately manipulating your data or analysis to result in a significant p value

Answer 136

A

check the statistical significance before deciding whether to collect more data
stopping data collection as soon as results reflect those desired
excluding data after checking impact on significance
adjust models on the basis of whether or not a significant result is obtained without proper justification
rounding a p-value to the threshold
hidden multiple testing and therefore no p value adjustments

Answer 137

A

Hypothesis after results are known is presenting results that have been discovered as if they were expected or as if they were the main study aim (overstating prior knowledge of the study).
Presenting ad hoc or unexpected results in this way is misleading

Answer 138

A

An unplanned or supplementary analyses conducted to explore specific aspects of data that weren’t the primary focus of the study. This is done on an as-needed basis to investigate particular comparisons or relationships not initially accounted for in the main analysis.

Answer 139

A

No, they are questionable but not misconduct

Answer 140

A

Fabrication and falsification

Answer 141

A

Making up data or results

Answer 142

A

The manipulation of research materials, data or results

Answer 143

A

Data needs to be normally distributed

Data should be from independent observations, which means that there is no relationship between the observations in each group or between the groups themselves.

Equal variances between groups (Homogeneity of variances, Homoscedasticity)

Answer 144

A

The fundamental assumption that the variance of the errors (or residuals) should be constant across all levels of the independent variable(s)

(Violated homoscedasticity is known as heteroscedasticity)

Answer 145

A

Refers to the similarity or uniformity of certain characteristics within a group or between groups.

Answer 146

A

K-1
Where K is the number of groups being compared

Answer 147

A

N-K
Where K is the number of groups being compared and N is the total number of observations/data points collected.

Answer 148

A

Quantifies variability between the groups of interest and within groups of interest in separate rows

Answer 149

A

The square of the difference between each datapoint and the overall mean, also called SST, for sum of squares (total).

Answer 150

A

The sum of squares within the groups is defined as the square of the difference between each datapoint and the mean of the group it belongs to. This shows the variation among each single groups.

Answer 151

A

The sum of squares within the groups is defined as the square of the difference between each mean of the groups and the overall mean for each datapoint. This shows the variation among between the groups.

Answer 152

A

Q3+1.5 IQR

Answer 153

A

Q1-1.5 IQR

Answer 154

A

The binomial distribution is discrete, dealing with the number of successes in a fixed number of trials.

Answer 155

A

The normal distribution is continuous and is often associated with the distribution of measurements in a population.

Answer 156

A

The binomial distribution is characterized by the number of trials (n) and the probability of success (p).

Answer 157

A

The normal distribution is characterized by the mean (μ) and standard deviation (σ).

Answer 158

A

Use 1-(alpha/2) at each end

Answer 159

A

A boxplot is a qualitative analysis whilst an ANOVA is quantitative

Answer 160

A

ANOVA output
This is a variance estimate and what is used to calculate the F-statistic, the next column.
Calculated by taking the Sum of Squares divided by DF on the same row

Answer 161

A

This is defined as the ratio between the Mean Squares between and within.
Calculated by Mean squares of row 1/mean squares of row 2.

Answer 162

A

If it is below a threshold value, the NULL hypothesis can be rejected

Answer 163

A

More likely to be a statistically relevant difference between groups.

Answer 164

A

F(dfbetween, dfwithin) = F Statistic, p =

Answer 165

A

post-hoc tests such as the Tukey Honest Significance test