Statistics Presentation Notes Flashcards
In statistics, the same inputs and process should have only one output.
What term describes this?
deterministic
What is the opposite something deterministic – in other words, what term refers to a process where the same inputs and factors produce multiple outputs?
stochastic
Multiple outputs may arise from what factor, in which the same results are obtained but the technology used to document the observation is imprecise?
measurement error
What stochastic factor describes the variation which exists between subjects of study that gives rise to different results?
natural heterogeneity
What stochastic factor includes variables like the disappearance of funding or poor weather?
uncontrollable factors
What part of data analysis refers to visually displaying observations, removing the outliers, and subsetting the data?
preprocess
What are the two goals of exploratory data analysis (EDA)?
- identifying potential issues with the observed data
- taking note of tends which intuition of the scientist doesn’t observe
What is the long-term frequency of an event taking place known as?
probability
What word refers to probabilities associated with integers and categories (i.e., number of oranges on a tree)?
discrete
What do statisticians employ to analyze discrete outcomes?
probability mass functions (PMFs)
What are the four common kinds of distribution associated with probability mass functions?
- J. Bernoulli distribution
- S. Poisson distribution
- binomial distribution
- multinomial distribution
What word refers to probabilities associated with non-integer numbers (i.e., likelihood that someone was born exactly 250 years after Horatio Nelson)?
continuous
What do statisticians employ to analyze continuous outcomes?
probability density functions (PDFs)
What are the three common kinds of distribution associated with probability density functions?
- beta distribution
- gamma distribution
- normal distribution
What kinds of distributions do statisticians employ for continuous outcomes without either positive or negative constraints?
normal distribution
Normal distributions form the cornerstone of what three kinds of data analyses?
- t-tests
- ANOVAs
- regression analysis (simple/multiple)
The t-test, ANOVA, simple regression analysis and multiple regression analysis are all considered what?
linear models
What theorem says that even if data is not technically normally distributed, the samples which are very large will move towards normality?
central limit theorem
What is the formula used to express a normal distribution?
y ~ N(mu, sigma^2)
In the formula for the null distribution, what do the following variables refer to:
1. “y”
2. “N”
3. “mu”
4. “sigma^2”
- outputs based on inputs
- normal distribution
- mean/median of the data
- frequencies around the center
The t-distribution is widely used in many statistical models and looks like the normal distribution, but becomes more divergent with smaller sample sizes due to the influence of what parameter?
degrees of freedom (df)
The normal distribution is useful with what thing, which assumes the mean/median (mu) varies linearly?
regression models
What does an analysis of variance show about the treatment?
whether the treatment effects the results relative to the control
What is a necessary characteristic of a valid hypothesis?
falsifiability
What is an example of a non-falsifiable hypothesis (HINT: it remains a popular idea in most people’s heads nonetheless)?
God created the Universe
Information collected by hypothesis testing can cause what three things to subsequently occur?
- rejection of original claim
- modification of original claim
- confirmation of original claim
What are the four steps statistical hypothesis testing is often broken between?
- development of null and alternative hypotheses
- calculation of a test statistic
- converting the test statistic to a P-value
- deriving a conclusion
A (1)__________ hypothesis may be defined as the theory of no (2)_____________ or the absence of any (3)________________; it contradicts the notion of the (4)____________________ relationship.
(1) null
(2) difference
(3) pattern
(4) cause-and-effect
It is thought the “Ghostbuster” eggplant is larger than the “Night Shadow” variety. What would be the null (Ho) and alternative (Ha) hypotheses?
Ho = “Ghostbuster” and “Night Shadow” eggplants are the same size
Ha = “Ghostbuster” eggplants are larger than “Night Shadow” fruits
Considering the question of whether or not “Ghostbuster” eggplants are larger or the same size as “Night Shadow” fruits, how might one go about establishing a histogram which shows the distribution curve?
to establish the distribution curve, we could measure 1,000 “Ghostbuster” and 1,000 “Night Shadow” eggplants and take their average masses (mu[G] and mu[NS]), then take their difference (mu[G] - mu[NS])
next, we mix the 2,000 observations and pull 1,000 of them at random and assign them as the average mass for a hypothetical “Ghostbuster” group (mu[g1]), giving the other 1,000 the distinction of a hypothetical “Night Shadow” average (mu[ns1]), after which we take their averages (mu[g1] - mu[ns1])
repeat the previous step 999 times (mu[g2] - mu[ns2], mu[g3] - mu[ns3]), … mu[g999] - mu[ns999], mu[g1000] - mu[ns1000]) and plot the frequency of the differences as a histogram
Considering the question of whether or not “Ghostbuster” eggplants are larger or the same size as “Night Shadow” fruits, how might one go about converting the distribution curve to a useful P-value?
plot the true difference in mass (mu[G] - mu[NS]) on the histogram with the frequency of simulated mass differences
count the number of observations which are larger than the true difference; the P-value will be the number of observations divided by 1,000
What reason does the textbook (Gotelli & Ellison, 2013) give for the establishment of 0.05 as the orthodox critical P-value?
“… after many decades of custom, tradition and vigilant enforcement by editors and journal reviewers”