STATS (BIOL 243) FALL 2024 Flashcards
five Hierarchical scales
sample unit
sample
observation unit
statistical population
population of interest
sampling unit
the unit being selected at random, it may be the same as the observation unit or contain multiple observation unit
sample
collection of the sampling units
observation unit
scale of data collection, subject of study
statistical population
collection of all sampling units that could have been in your sample, and represents the true scale in which your statistical conclusions are valid
population of interest
collection of sampling units that you hope to draw conclusion about
scope of the research question
ideally the same as your statistical population
measurement variable
what we want to know/measure about the observation unit
measurement unit
scale
descriptive stats
set of tools used to describe data
inferential statistics
uses information from the sample to make a probolistic statement about the statistical population
what is the rule for descriptive and inferential stats when there are multiple groups i a statistical population
descriptive stats are repeated for each group but inferential stats are only done once and can be used to make statements about the differences between groups
ideal sampling design
- all sample units have a probability of being included
- selection of sampling units must be unbiased
- selection of sampling units are independent
- each possible sample has an equal chance of being selected
observational studies
- researchers have no control over the variables
- it characterizes something about an existing statistical pop
- a tool for discovering associations, but can not make statements about the involvement of the sampling unit (cannot establish causation cause there is no way to know if the the factor is governed by something else
response variables
variable the investigators are interested in
explanatory variable
variable that the investigator believes may explain the response variable
confounding variables
unobserved variables that affect the response variable
simple random
starts by identifying every sampling unit in the statistical population and then selecting a random subset of those to be in your sample. Each sampling unit has the same probability of being included in your sample.
stratified
used when the statistical population has some grouping (strata)
clustered
observation units are contained within a larger group that we can randomly sample (geographicl or organizational)
case control
when there is a known outcome we are trying to explain
cohort
select a sampling unit, follow them through time to see if they developed the result we want
retrospective
studies where the results are already known
ie. case control studies
prospective
outcome is not yet known
ie. cohort studies
cross-sectional
study a response variable at only a single snap shot of time
ie. simple random
longitudinal surveys
study a response variable at multiple points of time
which of the following distinguishes case control from cohort surveys
a. Whether the survey is cross-sectional or longitudinal
b. Whether strata are defined ahead of time or not
c. Whether the survey design is retrospective or prospective
d. Whether clusters of observation units were selected at random or not
c. Whether the survey design is retrospective or prospective (correct)
Which of the following distinguish stratified from clustered surveys?
Whether the survey is cross-sectional or longitudinal
Whether strata are defined ahead of time or not
Whether the survey design is retrospective or prospective
Whether clusters of observation units were selected at random or not
Whether strata are defined ahead of time or not
You design a study where you randomly select 10 car models from within each category of electric, hybrid electric-gas, gasoline, or diesel. For each model, you find the purchase cost and estimate how much it will cost you to drive the vehicle for the next 10 years. What type of survey design is this?
Stratified survey
Your children are young teenagers and you hear them listening to an entirely new genre of music called Korean Pop. You are curious whether it is just your kids that are listening to Korean Pop or if other kids their age are as well. You decide to find out by approaching 15 parents at the next Parent Teacher Night. Being a bit of a statistical geek, you mentally number each of the parents while they are talking to teachers. You pull out your cell phone with a list of random numbers and use these numbers to randomly select the parents that you approach to ask. What type of survey design is this?
Simple random survey
You are a researcher interested in the rates of mental illness in Canadian cities. You randomly select 120 cities across Canada, and conduct a survey of each to get a single estimate of per capita incidence of mental illness. The design of this surveying method is best characterized as:
cluster survey
corner stone of experimental studies
replication
number of sample units =?
number of replicates
pseudoreplicates
an error in the design of an experimental study where the observation units are analyzed instead
the common design elements/types
- control
- blocking
- blinded (single and double)
- placebo
- sham treatment
control treatment
reference treatment to compare against the treatment levels
blocking
used to control variation among the sampling units (similar to stratified sampling it forms subgroups or “blocks”)
single blinded
when the sampling unit does not know what treatment they are being exposed to
double blinded
both researcher and sample unit are unaware
placebo
often used in medical trials as the control treatment that helps accomplish a blinded design (has no effect)
sham treatment
method used in control treatments, accounts for the affect of delivery of a treatment that is not of interest
compare and contrast between sham and treatment
Imagine a study that evaluates the effectiveness of different over-the-counter pain relievers in alleviating the symptoms of arthritis: acetaminophen, ibuprofen and acetylsalicylic acid. Two hundred patients are randomly assigned to receive one of these three pain relievers, or to receive a placebo (control). How many factors and levels are evident in this study?
1 factor with 4 levels
Patients who are blinded to the experimental treatment is a crucial part of a randomized clinical trial. Why?
Reduces the possibility of placebo effects
Reduces biases in measurements stemming from the anticipation of a treatment effect
What is the reason for blinding the researcher to what experimental treatment a patient is going to receive?
Reduces biases in measurements stemming from the anticipation of a treatment effect
Reduces the possibility of placebo effects
What design characteristic distinguish experimental studies from observational studies?
Whether sampling units are randomly assigned to treatments or not.
A researcher studied the effect of the prescription drug raloxifene on fracture risk in postmenopausal women. They found that women who took raloxifene over a five year period reduced their risk of clinical vertebrate fracture compared to women who did not take the drug. What are the factors and levels in this experiment?
There is one Factor (drug) with two Levels (raloxifene, no raloxifene).
variable
any measurable characteristic of an observation
datum
value of the variable
continuous numerical variable
can take on any value (1.2 or 1/4 etc.)
discrete numerical
can only be whole numbers
ordinal categorical variable
can take on qualitative values but the values are on a ranked scale
nominal categorical variable
takes on qualitative values but they do not have any particular order
eg. types of fruit
What is the data type for describing your age
Continuous numerical
What is the data type for the description: child, teenager, adult?
Ordinal categorical
What is the data type for the number of students in a class?
Discrete numerical
What is the data type for the letter grade on your exam?
Ordinal categorical
What is the data type for the percent grade on your exam?
Continuous numerical
central tendency
describes the typical values in our sample (eg. mean)
the second quartile
dispersion
describes the spread of the values
counts
categorical variable
of observations in your sample that fall within a particular category
proportions
percentages
variance
variance measures the amount of variation
the average squared distance of each data point from the sample mean
σ^2
calculating variance
calculate the mean
find the diff between each data point and the mean
square the value
sum the squares and divideby the # of observation points
Quartiles
ranked bins of data
1. sort from lowest to highest
finding the second quartile
split the data in half, according to
a. if you have a odd data set then quartile 2 is the middle value
b. if a even data set the the second quartile is the average of the two middle values
finding the first quartile
subset the lower-valued half of observations, then use the rules in the second quartile to find the middle value
note the 2nd quartile is included if the # of observations is odd
3rd quartile
repeat steps for quartile 1 in the upper valued half
dispersion aka interquartile range
range of inner-most 50% of the data
between Q1 and Q3 (Q3-Q1)
Calculate the mean & median of the following data:
7.5 9.9 8.6 10.3 8.5 9.4 15.1
Mean is 9.9, median is 9.4
Would the mean or median be a better descriptor of the ‘middle’ value for this set of data?
7.5 9.9 8.6 10.3 8.5 9.4 15.1
Median
Calculate the population variance & interquartile range (IQR) of the following data:
7.5 8.6 8.9 8.5 9.4 10.7 15.1
Variance is 5.5, IQR is 1.5
Calculate the interquartile range (IQR) for the following set of numbers and indicate what range the answer lies within.
10.1, 18.6, 19.8, 15.7, 21.9, 12.9, 11.8, 26.0, 13.0, 12.9
5 ≤ ANSWER < 7
Calculate the interquartile range (IQR) for the following set of data and indicate what range the answer lies within.
46.7, 18.7, 39.4, 7.2, 19.8, 42.1, 2.6, 17.1, 30.7, 21.9
19 ≤ ANSWER <23
meaningfulness
the difference among groups important to your study
effect size
whether the change in the response variables is meaningful for a practical study
The rate of home ownership in Canada decreased from 46% in 2004 to 44% in 2011. What is the effect size as a difference between the years?
-2%
do relative effect sizes have units
no
In the United Kingdom, 56% of older adults (55+ years) get their news from the television whereas only 12% of youth (18-24 years) do. What is the relative effect size of youth compared to older adults?
4.7 (0.56/0.12)
absolute effect size
the actual difference in outcomes
ie. 80%-60%=20%
relative effect size
Relative effect size compares the outcomes between two groups as a ratio or percentage.
(80% / 60%) = 1.33, or a 33% increase
marginal distributions
sum the values in each row
sum the values in each column
in the last box add up every row and column, this helps make proportions
shows how many sampling units are in each level of one categorical variable
good way to describe patterns
conditional distributions
shows the relationship between the columns and the rows
take the value of the cell you are interested in and divide by the total amount of the column or row
characteristics of single variable bar graphs
- gaps show the levels are categorical
- which ever variable you are most interested in goes on the x axis
- each bar is a level
two variable bar graphs
- visualizes interactions between data sets
types of two variable bar graphs
grouped bar graph
stacked bar graph
histograms
bars are side by side (no gap)
represent a small numerical range
box plots and its parts
based on quartiles and used when you have numerical data and categorical groups
- whisks
- median: solid line
- box: drawn from the first quartile to the 3rd
- extreme threshold
whisks
drawn from the box to the last data point before the extrem threshold
extreme thresholds
Q3 + (1.5IQR) and Q1-(1.5IQR)
scatter plot
when you have two numerical variables and you want to look at the relationship between them
x axis is the independent variable
y axis is the dependent variable
in an observation study the x and y axis are covariates
line plots
two numerical variables that have been measured repeatably from the same sampling unit
each line is a different sampling unit
Identify which type of summary information would answer the following question “What proportion of people like cookies when playing poker?”
Conditional distribution with game as the primary variable
standard normal distribution
z = (x-u)/σ
sample space
set of all possible outcomes
event
a subset of a sample space (2,4,6 of 1 through 6)
random trial
procedure or action that produces one outcome from a set of possible outcomes, where the result is uncertain and cannot be predicted in advance.
frequentist probability
probability based on the frequency of events occurring in repeated experiments or trials
P(A)= Totalnumberoftrials/
NumberoftimeseventAoccurs
random variable
numerical outcome of a random phenomenon. It assigns a number to each outcome in a sample space, allowing for the analysis of probabilities associated with different outcomes.
probability distribution
the probability of different possible values of a variable.
discrete distributions
a function that gives the probability of a discrete random variable, X, being exactly equal to some value
define bias and sampling independence
systematic error in a study or analysis that leads to incorrect conclusions or inferences about a population.
the selection of one sample unit does not influence the selection of another.
4 goals of an ideal sampling design
- all sampling units are selectable
- selection is unbiased
- selection is independent
- all samples are possible
spurious relationships
a situation where two variables appear to be correlated with each other but, in fact, are not directly related
one way contingency table
are for data with a single categorical variable and are shown as a one-dimensional table of columns.
marginal distributions
are for data with two categorical variables and are shown as a two-dimensional table of rows and columns.
You have been asked by a regional conservation authority to design a study to evaluate the risk that a tick will bite someone walking at one of the parks. They provide you enough money to survey 15 parks out of the 60 that are in the region. Your plan is to spend a day at each of the selected parks and survey all the people leaving the park to assess whether a tick bit them or not. You will then calculate the proportion of people bitten for each park sampled.
the 60 parks in the region
According to USA Today (Dec 30, 1999), the average age of viewers of MSNBC cable television news programming is 50 years old. A Canadian network executive thinks this might not be true in Canada, and believes that the average age of these viewers in Canada is significantly less than 50 years old.
To test her hypothesis, the Canadian executive obtains a list of Bell satellite subscribers who included MSNBC in their channel package, and then conducts a phone poll of 2,000 of these subscribers across Canada. Anyone called who reports not watching MSNBC news programming at least once a week is left out of the survey; in the end 287 respondents watch MSNBC news programming at least weekly, and report their ages as part of the survey.
What is the variable of interest?
viewer age
According to USA Today (Dec 30, 1999), the average age of viewers of MSNBC cable television news programming is 50 years old. A Canadian network executive thinks this might not be true in Canada, and believes that the average age of these viewers in Canada is significantly less than 50 years old.
To test her hypothesis, the Canadian executive obtains a list of Bell satellite subscribers who included MSNBC in their channel package, and then conducts a phone poll of 2,000 of these subscribers across Canada. Anyone called who reports not watching MSNBC news programming at least once a week is left out of the survey; in the end 287 respondents watch MSNBC news programming at least weekly, and report their ages as part of the survey.
What is the statistical population for this study?
all at-least-weekly Canadian Viewers of MSNBC news programming who watch using bell satellite
A medical study wants to relate consumption of fat to heart conditions. 100 patients with heart conditions are randomly selected from clinics in the Kingston area, and each patient is asked to track their food consumption for 6 weeks. After the six weeks, each patient’s heart health is evaluated using a standard array of test (blood pressure, heart rate, ECG, etc.)
What term best describes each patient in this study design?
both sampling and observation unit
An ornithologist at Queen’s University is studying the development time of recently hatched black-capped chickadees on Wolfe Island. He randomly samples 20 nests from across the island and measures the weight of each new hatchling in the nest. He repeats this sampling after 1 week, and then again after 2 weeks.
What term best describes each nest included in this study?
sampling unit
Lyme disease is caused by the bacterium Borrelia burgdorferi, carried primarily by black-legged ticks. A recent study assessed the percentage of black-legged ticks that carry Borrelia from 10 random sites across North American spanning a range of mean annual temperatures. The number of ticks carrying Borrelia was quantified by collecting 100 ticks from each site and screening each tick for the bacterium (either YES or NO). The goal was to quantify the relation between annual temperature among sites and the percentage of ticks with Borrelia.
What is the observation unit in this study?
the individual tick
A medical study wants to relate consumption of fat to heart conditions. 100 patients with heart conditions are randomly selected from clinics in the Kingston area, and each patient is asked to track their food consumption for 6 weeks. After the six weeks, each patient’s heart health is evaluated using a standard array of test (blood pressure, heart rate, ECG, etc.)
What term best describes the beats per minute of heart rate in this study design?
measurement unit
You are interested in the growth potential of a new seed variety. You gather a random selection of 1,000 seeds from a field where the new variety is growing, and measure the final height of all the resulting plants.
What kind of study design is this?
simple random
You are the quality assurance manager for a company that produces toasters. In post-production testing, you find that more toasters are failing than expected; the cause or source of the failures is not immediately clear though.
You ask your intern to gather a random selection of failed toasters, and a selection of toasters that do not fail in testing, and then to trace all those toasters back through the production process (employees that did which installation, source of the particular components, etc.)
What kind of study design is this?
case control study
A psychology professor recruits 50 randomly selected Queen’s undergraduates, and ask them to recommend friends who would also be willing to participate in an introvert/extrovert personality study; overall, 93 students complete the study.
The results are 73% of the students are extroverts, 17% are introverts, and 10% are a mix.
What would the biggest concern or risk be about this sampling strategy?
sample unit selection is not independent
A medical experiment, in which a treatment group is compared to a control group, is carried out to reduce the effect of
confounding factors
Consider a survey being designed for customers of a tour company in Paris.
Determine whether the possible responses to the following question on their survey should be classified as categorical, continuous numerical or discrete numerical.
“How many escorted vacations have your taken prior to this one?”
discrete numerical
Determine whether the possible responses to the following question should be classified as categorical, discrete numerical or continuous numerical.
“Whether you are a Canadian citizen.”
categorical
Determine whether the possible responses to the following question should be classified as categorical, discrete numerical or continuous numerical.
“The number of students in a statistics course.”
discrete numerical
number of observation units in a table
of rows
number of variables in a table
number of columns
Customers finishing a free sample at Costco are asked to complete a survey asking whether they would be “Very interested”, “Interested” or “Not interested” in buying the food product in the future. In one day, 357 customers complete the survey.
What graph type would be most appropriate for displaying the resulting data all at once?
a bar graph
two way contingency table
What is the sample space for determining the probability of drawing a Jack of Clubs from a deck of cards in a game of poker?
list of all cards in a deck
What is the event for drawing an ace from a deck of cards in a game of poker?
list of all aces
Which of the following statements reflects a correct definition of probability?
There is a good probability of rain tomorrow
Roughly 1 in a million people have won a national lottery over hundreds of draws, which means the probability is p=0.0000001.
The probability that a product fails can be calculated directly from repeated testing in a factory.
The probability that I will buy my lunch today is 100%
Roughly 1 in a million people have won a national lottery over hundreds of draws, which means the probability is p=0.0000001. (correct)
The probability that a product fails can be calculated directly from repeated testing in a factory. (correct)
The probability that I will buy my lunch today is 100% (correct)
Which of the following statements describe a random trial?
The weight of an orange in measured in grams.
Observing a random shopped how much they spent in a particular store.
Playing a ‘scratch and win’ lottery ticket.
Finding out that your neighbour won a million dollars in the lotto
Playing a crossword puzzle
Rolling a die in a board game
Observing a random shopped how much they spent in a particular store. (correct)
Playing a ‘scratch and win’ lottery ticket. (correct)
Rolling a die in a board game (correct)
Question 1:User Answer Incorrect
Would the following be a continuous or discrete distribution? ‘Length of time between shots on net in a soccer game’
Continuous distribution
Would the following be a continuous or discrete distribution? ‘Number of shots on net in a soccer game’
Discrete distribution
Which of the following statements about probability distributions are TRUE?
Can be used to describe both discrete and continuous numerical variables
The area beneath the function always sums to one
The y-axis of a continuous distribution is called probability mass
The x-axis is the outcome, or event, of interest
Probability distributions show the probability of some events, but they do not have to account for all possible events from a random trial.
Can be used to describe both discrete and continuous numerical variables (correct)
The area beneath the function always sums to one (correct)
The x-axis is the outcome, or event, of interest (correct)
Which of the following statements about probability distributions are FALSE?
The probability of a single event in a continuous distribution is always zero
The probability of a single event in a discrete distribution is always zero
Regardless of whether the distribution is discrete or continuous, probability is the area under the curve.
Probability distributions cannot be used for a range of events.
The probability of a single event in a discrete distribution is always zero (correct)
Probability distributions cannot be used for a range of events. (correct)
Null hypothesis
statement or position that is the skeptical view-point of the research question.
Null distribution
sampling distribution from an imaginary statistical population where the null hypothesis is true
statistical significance
conclusion that is unlikely to come from the null
hypothesis testing
used to evaluate statistical significance
P
the probability of seeing your data, or something more extreme, under the null hypothesis
helps quantify the evidence against the null hypothesis
It measures how compatible your data is with the assumption that the null is true.
If α=0.05, a p-value below 0.05 means rejecting 𝐻0 is justified.
p=0.03, 𝛼=0.05
α=0.05: The result is statistically significant because
𝑝<0.05
p<0.05. You reject
𝐻0
.
𝑝=0.10, 𝛼=0.05
α=0.05: The result is not statistically significant because
𝑝>0.05
p>0.05. You fail to reject
𝐻0
.
type one error rate
probability of rejecting the null when it is true (false positive)
type two error
probability of failing to reject the null when its false (false negative)
error rates
probability of making a mistake
population parameters
descriptive statistics of the sample
quantifiable characteristics of a statistical pop
labeled using the Greek alphabet
values are fixed
sampling distributions
shape is independent of the statistical pop if the sample size is sufficiently large
bell shaped curve
taking the mean of multiple sampling
units averages out asymmetries in the statistical population
the variance of a sampling distribution increases as the # of sampling units decreases
central limit theorem
given a sufficiently large sample size, the distribution of the sample mean will approximate a normal distribution, regardless of the original population’s distribution
standard error can be calculates from the sd of the statistical pop and the sample size
SE =
theta (sd) / sqrt (n)
student t distribution
shape depends on size of sample (influential when size is small)
has fatter tails to accunt for the uncertainty in estimating the sd
continuous probability distribution
sample size is small, and the population standard deviation is unknown.
As df increases, the t-distribution approaches the normal distribution.
confidence intervals
the range over a sampling distribution that brackets the center most probability of interest
confidence interval formulas
t = (x-m)/SE
x = m + t * SE
single sample t-test
evaluates if the mean of your sample is different from some reference value
compares numerical variable to a reference
(sample mean - reference) / SE
paired sample t-test
if the difference in paired data of numerical variables is different from some reference value
looks at how sampling units change across factors
t= (mean of differences-reference)/SE
two sample t-test
determines if the means of two groups are different from each other
(m1-m2)/SEs
contingency table
summarized categorical data
expected contingency table
the contingency table of expected frequencies under the null hypothesis
compare observed vs. expected
expected 1-way table
one categorical variable with levels
sum of observed counts must be the same as expected
expectation counts are distributed equally
is there a difference in counts among the level of that variable?
expected 2-way table
two categorical variable
expected counts are distributed independently
are the counts independent between variables?
calculating independence
calculate marginal distribution ………
calculating expected frequencies
(row total * column total) / table total, do it for each cell
Chi-square test
used to determine whether there is a significant association between categorical variables or whether observed data matches expected data under a certain hypothesis. It works by comparing observed frequencies (data collected) to expected frequencies (based on a hypothesis).
chi-square distribution
distribution of chi-square scores expected from repeatedly sampling a statistical pop where the null is true
can only have positive values (square everything)
shape will vary depending on df’s
calculating chi-square (X^2)
take the difference between each observed and expected cell
square the difference
divide by the expected value
sum over all cells in the table
dfs for 1 - way tables
n-1
dfs for 2-way tables
(r-1)(c-1)
names for the variable used to explain the change in the outcome of an experiment
X - Variable
independent variable
predictor variable
names for the variable used to explain the change in the outcome of an observational study
the x variable
the predictor variable
The relationship between number of beers consumed (x) and blood alcohol content (y) was studied in 16 adults by using linear regression. The following regression equation was obtained from the study:
y= -0.0127 + 0.0180x
If a individual had 4 beers and scored a blood alcohol content of 0.085, what is their residual variation?
+0.0257 (correct)
Linearity
response variable is a linear function of the predictor variable (well describes by a linear relationship)
the effect of the predictor variable on the response is additive and proportional
normality
assumption that residuals are normally distributed
Independence
assumes that the residuals a sequentially independent of each other (vary between + and - numbers seemingly at random)
when residuals are not independent there will be adjacent runs of positive and negative runs
prevent violations by making sure units are selected at random and independently of each other
Homoscedasticity
the variance of residuals (errors) should be constant across all levels of the predictor variable (spread should be equal)
bivariate normal distribution
3D normal distribution graph depicted as contours
Pearsons correlation coefficient
r or roe
measures the strength of association
p = -1, p=0, p=1 (negative, no, positive association)
linear regression
evaluates if changes in one numerical variable can predict changes in another
linear regression equation
y = a (intercept) + b (slope) x
systematic component
describes the function used for predictions
random component
describes the probability distribution for sampling error ( only occurs in the y variable)
link function
connects the systematic to the random component
3 parts of the statistical model
systematic component
random component
link function
minimizing residual variance
calculate residual for each data point
take the square of each residual
sum the squared residuals across all data points
divide by dfs (n-2)
what are the four steps to the hypothesis test
define the null and alternative hypothesis
establish the null distribution
conduct the statistical test
draw scientific conclusions
F-test
determines the ration of variance between two variables ( no variance, F = 1)
which sum of squares measures the variability of the observes values of the response variable around their respective treatment means in ANOVA
residual variation (MSE) (correct)
contrast statement
test the difference in means between groups in an ANOVA test
post Hoc test
secondary test used to evaluate what groups have different means in ANOVA
only used if the F-test indicates to reject the null hypothesis
TukeyHSD test
compares the means of all possible combinations of categorical levels in an ANOVA
controls the family wise error rate by using a specialized null distribution that accounts for the number of contrasts
family wise error rate
type 1 error rate for the family of contrasts
used to evaluate the adjusted p-values returned from the TukeyHSD test
P>FWER (0.05) we fail to reject
P<FWER (0.05) we reject
Two factor ANOVA
looks at the effect of two categorical variable on a numerical variable
main effects A
questions about the differences among the levels of factor A averaging across the levels of factor B. These are comparisons among full columns
main effects B
questions about the differences among the levels of factor B averaging across the levels of factor A. These are comparisons among full rows
Interactions
differences among the levels of one factor with each level of the other factor
deviation from the assumption that the levels of each factor simply ass together
additivity
response from the two variables is the sum of the two
synergistic interaction
response is more than the two variables added together
antagonistic interaction
response is less than the two variable combined
What does a significant AB interaction mean in a two-way ANOVA?
The affect of factor A depends on the level of factor B. (correct)
What type of sum of squares measures the variability of the observed values of the response variable around their respective cell means?
residual
Mean sum of squares for groups
MsG = SSG(sum of squares)/dfG (k-1)
k = number of groups
mean square Error
residual variation
MSE = SSE / dfE (n-k)
what happens when the sample size increases
variance reduces
standard error becomes smaller
population distribution
distribution values produced from the measurement of some parameter about each individual of a population
If the coefficient of correlation r = ± 1, then the best-fit linear equation will actually include all of the data points?
true
The coefficient of correlation r is a number that indicates the direction and the strength of the relationship between the variable y and the variable x?
true
We anticipate a small P value for an ANOVA F statistic if the box plots for the samples are
wide and similarly located
narrow and located differently
identical
symmetrical
wide and have similar medians
t distributions can be used to test whether the difference between two sample means is different from zero?
true
df formulas
K-1: variation between groups (ANOVA, MSG)
N-K: variation within groups (ANOVA, residual variation (MSE))
n-1: one-way table
n-2: confidence intervals and residual analysis
(r/a-1)(c/b-1): 2-way table
ab(n-1): residual analysis (variation among sampling units within a cell)
n1+n2-2 = two sample t-test
what is the F-score
the ratio of the variation among categorical groups divided by the residual variation within a group
what is the null distribution of the F-score
represents the variation in a ratio you would expect from repeated sampling of a population where there was no true difference in means.