Statistics Presentation Notes Flashcards

1
Q

In statistics, the same inputs and process should have only one output.
What term describes this?

A

deterministic

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the opposite something deterministic – in other words, what term refers to a process where the same inputs and factors produce multiple outputs?

A

stochastic

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Multiple outputs may arise from what factor, in which the same results are obtained but the technology used to document the observation is imprecise?

A

measurement error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What stochastic factor describes the variation which exists between subjects of study that gives rise to different results?

A

natural heterogeneity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What stochastic factor includes variables like the disappearance of funding or poor weather?

A

uncontrollable factors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What part of data analysis refers to visually displaying observations, removing the outliers, and subsetting the data?

A

preprocess

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the two goals of exploratory data analysis (EDA)?

A
  1. identifying potential issues with the observed data
  2. taking note of tends which intuition of the scientist doesn’t observe
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the long-term frequency of an event taking place known as?

A

probability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What word refers to probabilities associated with integers and categories (i.e., number of oranges on a tree)?

A

discrete

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What do statisticians employ to analyze discrete outcomes?

A

probability mass functions (PMFs)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the four common kinds of distribution associated with probability mass functions?

A
  1. J. Bernoulli distribution
  2. S. Poisson distribution
  3. binomial distribution
  4. multinomial distribution
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What word refers to probabilities associated with non-integer numbers (i.e., likelihood that someone was born exactly 250 years after Horatio Nelson)?

A

continuous

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What do statisticians employ to analyze continuous outcomes?

A

probability density functions (PDFs)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the three common kinds of distribution associated with probability density functions?

A
  1. beta distribution
  2. gamma distribution
  3. normal distribution
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What kinds of distributions do statisticians employ for continuous outcomes without either positive or negative constraints?

A

normal distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Normal distributions form the cornerstone of what three kinds of data analyses?

A
  1. t-tests
  2. ANOVAs
  3. regression analysis (simple/multiple)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

The t-test, ANOVA, simple regression analysis and multiple regression analysis are all considered what?

A

linear models

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What theorem says that even if data is not technically normally distributed, the samples which are very large will move towards normality?

A

central limit theorem

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the formula used to express a normal distribution?

A

y ~ N(mu, sigma^2)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

In the formula for the null distribution, what do the following variables refer to:
1. “y”
2. “N”
3. “mu”
4. “sigma^2”

A
  1. outputs based on inputs
  2. normal distribution
  3. mean/median of the data
  4. frequencies around the center
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

The t-distribution is widely used in many statistical models and looks like the normal distribution, but becomes more divergent with smaller sample sizes due to the influence of what parameter?

A

degrees of freedom (df)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

The normal distribution is useful with what thing, which assumes the mean/median (mu) varies linearly?

A

regression models

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What does an analysis of variance show about the treatment?

A

whether the treatment effects the results relative to the control

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is a necessary characteristic of a valid hypothesis?

A

falsifiability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is an example of a non-falsifiable hypothesis (HINT: it remains a popular idea in most people’s heads nonetheless)?

A

God created the Universe

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Information collected by hypothesis testing can cause what three things to subsequently occur?

A
  1. rejection of original claim
  2. modification of original claim
  3. confirmation of original claim
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What are the four steps statistical hypothesis testing is often broken between?

A
  1. development of null and alternative hypotheses
  2. calculation of a test statistic
  3. converting the test statistic to a P-value
  4. deriving a conclusion
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

A (1)__________ hypothesis may be defined as the theory of no (2)_____________ or the absence of any (3)________________; it contradicts the notion of the (4)____________________ relationship.

A

(1) null
(2) difference
(3) pattern
(4) cause-and-effect

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

It is thought the “Ghostbuster” eggplant is larger than the “Night Shadow” variety. What would be the null (Ho) and alternative (Ha) hypotheses?

A

Ho = “Ghostbuster” and “Night Shadow” eggplants are the same size
Ha = “Ghostbuster” eggplants are larger than “Night Shadow” fruits

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Considering the question of whether or not “Ghostbuster” eggplants are larger or the same size as “Night Shadow” fruits, how might one go about establishing a histogram which shows the distribution curve?

A

to establish the distribution curve, we could measure 1,000 “Ghostbuster” and 1,000 “Night Shadow” eggplants and take their average masses (mu[G] and mu[NS]), then take their difference (mu[G] - mu[NS])

next, we mix the 2,000 observations and pull 1,000 of them at random and assign them as the average mass for a hypothetical “Ghostbuster” group (mu[g1]), giving the other 1,000 the distinction of a hypothetical “Night Shadow” average (mu[ns1]), after which we take their averages (mu[g1] - mu[ns1])

repeat the previous step 999 times (mu[g2] - mu[ns2], mu[g3] - mu[ns3]), … mu[g999] - mu[ns999], mu[g1000] - mu[ns1000]) and plot the frequency of the differences as a histogram

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Considering the question of whether or not “Ghostbuster” eggplants are larger or the same size as “Night Shadow” fruits, how might one go about converting the distribution curve to a useful P-value?

A

plot the true difference in mass (mu[G] - mu[NS]) on the histogram with the frequency of simulated mass differences

count the number of observations which are larger than the true difference; the P-value will be the number of observations divided by 1,000

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What reason does the textbook (Gotelli & Ellison, 2013) give for the establishment of 0.05 as the orthodox critical P-value?

A

“… after many decades of custom, tradition and vigilant enforcement by editors and journal reviewers”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

While the P-value is important, what are three aspects of the data which need also be considered?

A

[1] sample size (n)
[2] the measurement under investigation (such as difference in size between two cultivars)
[3] level of variation (sigma^2)

34
Q

What causes larger sample sizes to result in smaller P-values than smaller sample sizes, even if both groups have a high number of observations (i.e., comparing a 1,000 size sample to a 10,000 size sample)?

A

larger sample sizes are more reflective of the overall population or true value

35
Q

Which groups will produce smaller P-values, those with a large amount of variation between them or a small amount of variation?

A

the lower the amount of variation between groups, the lower the resulting P-values

36
Q

What is the downside of the inferences made from hypothesis testing, even with very large sample sizes?

A

all conclusions are based on incomplete information, which may not actually reflect the true situation

37
Q

What are the two possible correct decisions when testing a null hypothesis?

A

(1) failure to reject a null hypothesis which is true
(2) rejection of a null hypothesis which is false

38
Q

What occurs when someone commits a Type I Error (alpha)?

A

false rejection of a null hypothesis which is true

39
Q

When is the definition of a Type I Error in simple English?

A

something is thought to be occurring when nothing actually is

40
Q

What occurs when someone commits a Type II Error (beta)?

A

failure to reject a null hypothesis which is actually false

41
Q

When is the definition of a Type II Error in simple English?

A

nothing is thought to be occurring when something actually is

42
Q

What is statistical power?

A

the probability of correctly rejecting a null hypothesis which is false

43
Q

What is the relationship between statistical power and the Type II Error?

A

statistical power = 1.0 - beta (probability of a type II error)

44
Q

If one fails to find significant results in their data, what might be the cause, which is related to statistical power?

A

the study may have been under-powered, meaning it had too few samples

45
Q

What is the relationship between Type I and Type II Errors?

A

they are inverse – as one grows, the other shrinks

46
Q

Why are the stakes of a Type II Error less than a Type I Error, generally speaking?

A

a type I error proclaims there is a phenomenon where none actually exists, which can lead people to do foolish things for bad reasons

a type II error, as a false negative, doesn’t generally mobilize people, and can often be corrected in the future with more sensitive tech

47
Q

What statistical projection represents the simplest starting place for describing some relationship between one more more explanatory variables (x1, x2, … x[n]) and the response variable (y)?

A

linear model

48
Q

Some non-linear relationships can be approximately linear if what is done to them (think a relationship with a demi-circle curve)?

A

over a very narrow range of x values, the function can appear more linear

49
Q

What assumption for a linear model refers to the necessary existence for explanatory (x) and response (y) variables to have a linear relationship between them?

A

linearity

50
Q

What assumption for a linear model refers to the necessary existence of normally distributed errors for a given value in the explanatory (x) variable?

A

normality

51
Q

What assumption for a linear model refers to the necessary existence for variance in the response variable which is constant across the explanatory variables?

A

homogeneity of variance

52
Q

What key assumption of linear models says that, for any given value of explanatory variables, the responses will have independent errors?

A

independence

53
Q

What is the difference between the observed value of the response variable (y) and the value of response predicted by the linear model (y-hat) known as?

A

residual

54
Q

When testing linearity with the residual~scatter-plot method, we expect that residuals should be (1)___________________ about the line (2)_________, and that the values are (3)____________ with predictions.

A

(1) evenly distributed
(2) y = 0
(3) uncorrelated

55
Q

If a non-linear model is a better representation of a phenomenon which is erroneously being described with a linear model, what can be a consequence?

A

inflated estimate of variance

56
Q

What kind of graph is invoked to check the normality assumption?

A

normal qq-plot

57
Q

What allows the statistician to apply a relative significance to a set of values in order to make errors standard to test for normality?

A

weighted standard deviation

58
Q

What are the steps necessary to establish a qq-plot to check for normality?

A

(1) calculate the residuals (e) for each value of the response variable (y - y-hat)

(2) make each residual (e) standardized by dividing it against the weighted standard deviation (sigma^2)

(3) for the scatter-plot, set the theoretical quantiles of the differences on the x-axis and the standardized residuals on the y-axis

59
Q

In a normal qq-plot, what does the hypothetical linear abline (which we code for separately and is not based on the real values) represent?

A

the linear abline shows where standardized residuals would fall if they were perfectly normal

60
Q

If weighted residuals are normally distributed, what should their associated qq-plot look like?

A

all the values fall nicely on the abline

61
Q

If weighted residuals are skewed to the left, what should their associated qq-plot look like?

A

all the values are linear early on and curve upward later (think of the curve of a circle with center at (0, 0) which is being looked at in Quadrant IV)

62
Q

If weighted residuals are skewed to the right, what should their associated qq-plot look like?

A

all the values curve upward early on and become more linear later (think of the curve of a circle with center at (0, 0) which is being looked at in Quadrant II)

63
Q

What type of distribution in the errors produces a qq-plot what roughly looks like an “S”?

A

distribution of errors with fat-tailed residuals

64
Q

What type of distribution in the errors produces a qq-plot what roughly looks like a backwards “S”?

A

distribution of errors with thin-tailed residuals

65
Q

Normality in the data is important because it allows use to use (1)_______________________ to construct (2)_______________________ and to engage in (3)___________________________.

A

(1) parametric theory
(2) confidence intervals
(3) hypothesis testing

66
Q

To examine homogeneity of variance, what kind of graph do we establish?

A

t-plot of residual values versus the fitted data

67
Q

In a t-plot of predicted values (y-hat) against the residuals (e), what kind of behavior do we want the values to appear as across the line e = 0?

A

homoscedastic

68
Q

What is the kind of distribution in the residual values we do not wish to see when testing for homogeneity of variance?

A

heteroscedastic

69
Q

How can statisticans best ensure that errors are independent of one another?

A

maintain sampling design with subjects that are indepedent spatially and/or temporally from one another

70
Q

What are the four common kinds of data transformation used by statisticians to normalize the data?

A

(1) base-10 (“common”) logarithmic
(2) base-e (“natural”) logarithmic
(3) square-root
(4) arcsine square root

71
Q

Log transformations are useful if the ratio between the smallest and largest outputs is what?

A

orders of magnitudes in size

72
Q

Square-root transformations can be used when all of the values are greater than what?

A

values need to be greater than zero

73
Q

What is the term used for the kind of data square-root transformations are used with, which is related to the Poisson distribution?

A

count data

74
Q

Under what circumstances would someone transform the data with the arcsine square root?

A

outputs are compressed between (0, 1.0)

75
Q

How many kinds of t-test are there? What are they called? What do they have in common?

A

three kinds of t-test:

(1) one sample t-test
(2) two-sample t-test
(3) paired t-test

all t-tests compare the means of the data

76
Q

To derive a t-statistic (z[obs]), what is the relationship between the observed data (x-bar), average variation (sigma / n^0.5), and the expected value under the null hypothesis (mu)?

A

z[obs] = (x-bar - mu) / (sigma / n^0.5)

(In English, the test statistic is equal to the difference of the means of the samples minus the expectation under the null hypothesis, divided by the average amount of variation between the samples)

77
Q

What is the difference between a t-statistic and a z-score?

A

the t-statistic is used when the sample size is small or the standard deviation of the population is not known

78
Q

Both the two-sample t-test and the paired t-test compare the means of two sample groups. What are the two main reasons to use a paired t-test over a two-sample t-test?

A

(1) paired t-tests are used if two measurements are taken on the same unit
(2) paired t-tests can remove variability between units

79
Q

What is the underlying reason for which random block design is undertaken?

A

control of sources of variation

80
Q
A