STATS (BIOL 243) FALL 2024 Flashcards

1
Q

five Hierarchical scales

A

sample unit
sample
observation unit
statistical population
population of interest

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

sampling unit

A

the unit being selected at random, it may be the same as the observation unit or contain multiple observation unit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

sample

A

collection of the sampling units

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

observation unit

A

scale of data collection, subject of study

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

statistical population

A

collection of all sampling units that could have been in your sample, and represents the true scale in which your statistical conclusions are valid

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

population of interest

A

collection of sampling units that you hope to draw conclusion about

scope of the research question

ideally the same as your statistical population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

measurement variable

A

what we want to know/measure about the observation unit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

measurement unit

A

scale

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

descriptive stats

A

set of tools used to describe data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

inferential statistics

A

uses information from the sample to make a probolistic statement about the statistical population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

what is the rule for descriptive and inferential stats when there are multiple groups i a statistical population

A

descriptive stats are repeated for each group but inferential stats are only done once and can be used to make statements about the differences between groups

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

ideal sampling design

A
  1. all sample units have a probability of being included
  2. selection of sampling units must be unbiased
  3. selection of sampling units are independent
  4. each possible sample has an equal chance of being selected
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

observational studies

A
  • researchers have no control over the variables
  • it characterizes something about an existing statistical pop
  • a tool for discovering associations, but can not make statements about the involvement of the sampling unit (cannot establish causation cause there is no way to know if the the factor is governed by something else
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

response variables

A

variable the investigators are interested in

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

explanatory variable

A

variable that the investigator believes may explain the response variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

confounding variables

A

unobserved variables that affect the response variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

simple random

A

starts by identifying every sampling unit in the statistical population and then selecting a random subset of those to be in your sample. Each sampling unit has the same probability of being included in your sample.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

stratified

A

used when the statistical population has some grouping (strata)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

clustered

A

observation units are contained within a larger group that we can randomly sample (geographicl or organizational)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

case control

A

when there is a known outcome we are trying to explain

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

cohort

A

select a sampling unit, follow them through time to see if they developed the result we want

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

retrospective

A

studies where the results are already known
ie. case control studies

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

prospective

A

outcome is not yet known
ie. cohort studies

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
cross-sectional
study a response variable at only a single snap shot of time ie. simple random
25
longitudinal surveys
study a response variable at multiple points of time
26
which of the following distinguishes case control from cohort surveys a. Whether the survey is cross-sectional or longitudinal b. Whether strata are defined ahead of time or not c. Whether the survey design is retrospective or prospective d. Whether clusters of observation units were selected at random or not
c. Whether the survey design is retrospective or prospective (correct)
27
Which of the following distinguish stratified from clustered surveys? Whether the survey is cross-sectional or longitudinal Whether strata are defined ahead of time or not Whether the survey design is retrospective or prospective Whether clusters of observation units were selected at random or not
Whether strata are defined ahead of time or not
28
You design a study where you randomly select 10 car models from within each category of electric, hybrid electric-gas, gasoline, or diesel. For each model, you find the purchase cost and estimate how much it will cost you to drive the vehicle for the next 10 years. What type of survey design is this?
Stratified survey
28
Your children are young teenagers and you hear them listening to an entirely new genre of music called Korean Pop. You are curious whether it is just your kids that are listening to Korean Pop or if other kids their age are as well. You decide to find out by approaching 15 parents at the next Parent Teacher Night. Being a bit of a statistical geek, you mentally number each of the parents while they are talking to teachers. You pull out your cell phone with a list of random numbers and use these numbers to randomly select the parents that you approach to ask. What type of survey design is this?
Simple random survey
29
You are a researcher interested in the rates of mental illness in Canadian cities. You randomly select 120 cities across Canada, and conduct a survey of each to get a single estimate of per capita incidence of mental illness. The design of this surveying method is best characterized as:
cluster survey
30
corner stone of experimental studies
replication
31
number of sample units =?
number of replicates
31
pseudoreplicates
an error in the design of an experimental study where the observation units are analyzed instead
32
the common design elements/types
1. control 2. blocking 3. blinded (single and double) 4. placebo 5. sham treatment
33
control treatment
reference treatment to compare against the treatment levels
34
blocking
used to control variation among the sampling units (similar to stratified sampling it forms subgroups or "blocks")
35
single blinded
when the sampling unit does not know what treatment they are being exposed to
36
double blinded
both researcher and sample unit are unaware
37
placebo
often used in medical trials as the control treatment that helps accomplish a blinded design (has no effect)
38
sham treatment
method used in control treatments, accounts for the affect of delivery of a treatment that is not of interest compare and contrast between sham and treatment
39
Imagine a study that evaluates the effectiveness of different over-the-counter pain relievers in alleviating the symptoms of arthritis: acetaminophen, ibuprofen and acetylsalicylic acid. Two hundred patients are randomly assigned to receive one of these three pain relievers, or to receive a placebo (control). How many factors and levels are evident in this study?
1 factor with 4 levels
40
Patients who are blinded to the experimental treatment is a crucial part of a randomized clinical trial. Why?
Reduces the possibility of placebo effects Reduces biases in measurements stemming from the anticipation of a treatment effect
41
What is the reason for blinding the researcher to what experimental treatment a patient is going to receive?
Reduces biases in measurements stemming from the anticipation of a treatment effect Reduces the possibility of placebo effects
42
What design characteristic distinguish experimental studies from observational studies?
Whether sampling units are randomly assigned to treatments or not.
43
A researcher studied the effect of the prescription drug raloxifene on fracture risk in postmenopausal women. They found that women who took raloxifene over a five year period reduced their risk of clinical vertebrate fracture compared to women who did not take the drug. What are the factors and levels in this experiment?
There is one Factor (drug) with two Levels (raloxifene, no raloxifene).
44
variable
any measurable characteristic of an observation
45
datum
value of the variable
46
continuous numerical variable
can take on any value (1.2 or 1/4 etc.)
47
discrete numerical
can only be whole numbers
48
ordinal categorical variable
can take on qualitative values but the values are on a ranked scale
49
nominal categorical variable
takes on qualitative values but they do not have any particular order eg. types of fruit
50
What is the data type for describing your age
Continuous numerical
51
What is the data type for the description: child, teenager, adult?
Ordinal categorical
52
What is the data type for the number of students in a class?
Discrete numerical
53
What is the data type for the letter grade on your exam?
Ordinal categorical
54
What is the data type for the percent grade on your exam?
Continuous numerical
55
central tendency
describes the typical values in our sample (eg. mean) the second quartile
56
dispersion
describes the spread of the values
57
counts
categorical variable of observations in your sample that fall within a particular category
58
proportions
percentages
59
variance
variance measures the amount of variation the average squared distance of each data point from the sample mean σ^2
60
calculating variance
calculate the mean find the diff between each data point and the mean square the value sum the squares and divideby the # of observation points
61
Quartiles
ranked bins of data 1. sort from lowest to highest
62
finding the second quartile
split the data in half, according to a. if you have a odd data set then quartile 2 is the middle value b. if a even data set the the second quartile is the average of the two middle values
63
finding the first quartile
subset the lower-valued half of observations, then use the rules in the second quartile to find the middle value note the 2nd quartile is included if the # of observations is odd
64
3rd quartile
repeat steps for quartile 1 in the upper valued half
65
dispersion aka interquartile range
range of inner-most 50% of the data between Q1 and Q3 (Q3-Q1)
66
Calculate the mean & median of the following data: 7.5 9.9 8.6 10.3 8.5 9.4 15.1
Mean is 9.9, median is 9.4
67
Would the mean or median be a better descriptor of the ‘middle’ value for this set of data? 7.5 9.9 8.6 10.3 8.5 9.4 15.1
Median
68
Calculate the population variance & interquartile range (IQR) of the following data: 7.5 8.6 8.9 8.5 9.4 10.7 15.1
Variance is 5.5, IQR is 1.5
69
Calculate the interquartile range (IQR) for the following set of numbers and indicate what range the answer lies within. 10.1, 18.6, 19.8, 15.7, 21.9, 12.9, 11.8, 26.0, 13.0, 12.9
5 ≤ ANSWER < 7
70
Calculate the interquartile range (IQR) for the following set of data and indicate what range the answer lies within. 46.7, 18.7, 39.4, 7.2, 19.8, 42.1, 2.6, 17.1, 30.7, 21.9
19 ≤ ANSWER <23
71
meaningfulness
the difference among groups important to your study
72
effect size
whether the change in the response variables is meaningful for a practical study
73
The rate of home ownership in Canada decreased from 46% in 2004 to 44% in 2011. What is the effect size as a difference between the years?
-2%
74
do relative effect sizes have units
no
75
In the United Kingdom, 56% of older adults (55+ years) get their news from the television whereas only 12% of youth (18-24 years) do. What is the relative effect size of youth compared to older adults?
4.7 (0.56/0.12)
76
absolute effect size
the actual difference in outcomes ie. 80%-60%=20%
77
relative effect size
Relative effect size compares the outcomes between two groups as a ratio or percentage. (80% / 60%) = 1.33, or a 33% increase
78
marginal distributions
sum the values in each row sum the values in each column in the last box add up every row and column, this helps make proportions shows how many sampling units are in each level of one categorical variable good way to describe patterns
79
conditional distributions
shows the relationship between the columns and the rows take the value of the cell you are interested in and divide by the total amount of the column or row
80
characteristics of single variable bar graphs
- gaps show the levels are categorical - which ever variable you are most interested in goes on the x axis - each bar is a level
81
two variable bar graphs
- visualizes interactions between data sets
82
types of two variable bar graphs
grouped bar graph stacked bar graph
83
histograms
bars are side by side (no gap) represent a small numerical range
84
box plots and its parts
based on quartiles and used when you have numerical data and categorical groups - whisks - median: solid line - box: drawn from the first quartile to the 3rd - extreme threshold
85
whisks
drawn from the box to the last data point before the extrem threshold
86
extreme thresholds
Q3 + (1.5IQR) and Q1-(1.5IQR)
87
scatter plot
when you have two numerical variables and you want to look at the relationship between them x axis is the independent variable y axis is the dependent variable in an observation study the x and y axis are covariates
88
line plots
two numerical variables that have been measured repeatably from the same sampling unit each line is a different sampling unit
89
Identify which type of summary information would answer the following question "What proportion of people like cookies when playing poker?"
Conditional distribution with game as the primary variable
90
standard normal distribution
z = (x-u)/σ
91
sample space
set of all possible outcomes
92
event
a subset of a sample space (2,4,6 of 1 through 6)
93
random trial
procedure or action that produces one outcome from a set of possible outcomes, where the result is uncertain and cannot be predicted in advance.
94
frequentist probability
probability based on the frequency of events occurring in repeated experiments or trials P(A)= Total number of trials/ Number of times event A occurs ​
95
random variable
numerical outcome of a random phenomenon. It assigns a number to each outcome in a sample space, allowing for the analysis of probabilities associated with different outcomes.
96
probability distribution
the probability of different possible values of a variable.
97
discrete distributions
a function that gives the probability of a discrete random variable, X, being exactly equal to some value
98
define bias and sampling independence
systematic error in a study or analysis that leads to incorrect conclusions or inferences about a population. the selection of one sample unit does not influence the selection of another.
99
4 goals of an ideal sampling design
1. all sampling units are selectable 2. selection is unbiased 3. selection is independent 4. all samples are possible
100
spurious relationships
a situation where two variables appear to be correlated with each other but, in fact, are not directly related
101
one way contingency table
are for data with a single categorical variable and are shown as a one-dimensional table of columns.
102
marginal distributions
are for data with two categorical variables and are shown as a two-dimensional table of rows and columns.
103
You have been asked by a regional conservation authority to design a study to evaluate the risk that a tick will bite someone walking at one of the parks. They provide you enough money to survey 15 parks out of the 60 that are in the region. Your plan is to spend a day at each of the selected parks and survey all the people leaving the park to assess whether a tick bit them or not. You will then calculate the proportion of people bitten for each park sampled.
the 60 parks in the region
104
According to USA Today (Dec 30, 1999), the average age of viewers of MSNBC cable television news programming is 50 years old. A Canadian network executive thinks this might not be true in Canada, and believes that the average age of these viewers in Canada is significantly less than 50 years old. To test her hypothesis, the Canadian executive obtains a list of Bell satellite subscribers who included MSNBC in their channel package, and then conducts a phone poll of 2,000 of these subscribers across Canada. Anyone called who reports not watching MSNBC news programming at least once a week is left out of the survey; in the end 287 respondents watch MSNBC news programming at least weekly, and report their ages as part of the survey. What is the variable of interest?
viewer age
105
According to USA Today (Dec 30, 1999), the average age of viewers of MSNBC cable television news programming is 50 years old. A Canadian network executive thinks this might not be true in Canada, and believes that the average age of these viewers in Canada is significantly less than 50 years old. To test her hypothesis, the Canadian executive obtains a list of Bell satellite subscribers who included MSNBC in their channel package, and then conducts a phone poll of 2,000 of these subscribers across Canada. Anyone called who reports not watching MSNBC news programming at least once a week is left out of the survey; in the end 287 respondents watch MSNBC news programming at least weekly, and report their ages as part of the survey. What is the statistical population for this study?
all at-least-weekly Canadian Viewers of MSNBC news programming who watch using bell satellite
106
A medical study wants to relate consumption of fat to heart conditions. 100 patients with heart conditions are randomly selected from clinics in the Kingston area, and each patient is asked to track their food consumption for 6 weeks. After the six weeks, each patient's heart health is evaluated using a standard array of test (blood pressure, heart rate, ECG, etc.) What term best describes each patient in this study design?
both sampling and observation unit
107
An ornithologist at Queen’s University is studying the development time of recently hatched black-capped chickadees on Wolfe Island. He randomly samples 20 nests from across the island and measures the weight of each new hatchling in the nest. He repeats this sampling after 1 week, and then again after 2 weeks. What term best describes each nest included in this study?
sampling unit
108
Lyme disease is caused by the bacterium Borrelia burgdorferi, carried primarily by black-legged ticks. A recent study assessed the percentage of black-legged ticks that carry Borrelia from 10 random sites across North American spanning a range of mean annual temperatures. The number of ticks carrying Borrelia was quantified by collecting 100 ticks from each site and screening each tick for the bacterium (either YES or NO). The goal was to quantify the relation between annual temperature among sites and the percentage of ticks with Borrelia. What is the observation unit in this study?
the individual tick
109
A medical study wants to relate consumption of fat to heart conditions. 100 patients with heart conditions are randomly selected from clinics in the Kingston area, and each patient is asked to track their food consumption for 6 weeks. After the six weeks, each patient's heart health is evaluated using a standard array of test (blood pressure, heart rate, ECG, etc.) What term best describes the beats per minute of heart rate in this study design?
measurement unit
110
You are interested in the growth potential of a new seed variety. You gather a random selection of 1,000 seeds from a field where the new variety is growing, and measure the final height of all the resulting plants. What kind of study design is this?
simple random
111
You are the quality assurance manager for a company that produces toasters. In post-production testing, you find that more toasters are failing than expected; the cause or source of the failures is not immediately clear though. You ask your intern to gather a random selection of failed toasters, and a selection of toasters that do not fail in testing, and then to trace all those toasters back through the production process (employees that did which installation, source of the particular components, etc.) What kind of study design is this?
case control study
112
A psychology professor recruits 50 randomly selected Queen's undergraduates, and ask them to recommend friends who would also be willing to participate in an introvert/extrovert personality study; overall, 93 students complete the study. The results are 73% of the students are extroverts, 17% are introverts, and 10% are a mix. What would the biggest concern or risk be about this sampling strategy?
sample unit selection is not independent
113
A medical experiment, in which a treatment group is compared to a control group, is carried out to reduce the effect of
confounding factors
114
Consider a survey being designed for customers of a tour company in Paris. Determine whether the possible responses to the following question on their survey should be classified as categorical, continuous numerical or discrete numerical. "How many escorted vacations have your taken prior to this one?"
discrete numerical
115
Determine whether the possible responses to the following question should be classified as categorical, discrete numerical or continuous numerical. "Whether you are a Canadian citizen."
categorical
116
Determine whether the possible responses to the following question should be classified as categorical, discrete numerical or continuous numerical. "The number of students in a statistics course."
discrete numerical
117
number of observation units in a table
of rows
118
number of variables in a table
number of columns
119
Customers finishing a free sample at Costco are asked to complete a survey asking whether they would be "Very interested", "Interested" or "Not interested" in buying the food product in the future. In one day, 357 customers complete the survey. What graph type would be most appropriate for displaying the resulting data all at once?
a bar graph
120
two way contingency table
121
What is the sample space for determining the probability of drawing a Jack of Clubs from a deck of cards in a game of poker?
list of all cards in a deck
122
What is the event for drawing an ace from a deck of cards in a game of poker?
list of all aces
123
Which of the following statements reflects a correct definition of probability? There is a good probability of rain tomorrow Roughly 1 in a million people have won a national lottery over hundreds of draws, which means the probability is p=0.0000001. The probability that a product fails can be calculated directly from repeated testing in a factory. The probability that I will buy my lunch today is 100%
Roughly 1 in a million people have won a national lottery over hundreds of draws, which means the probability is p=0.0000001. (correct) The probability that a product fails can be calculated directly from repeated testing in a factory. (correct) The probability that I will buy my lunch today is 100% (correct)
124
Which of the following statements describe a random trial? The weight of an orange in measured in grams. Observing a random shopped how much they spent in a particular store. Playing a 'scratch and win' lottery ticket. Finding out that your neighbour won a million dollars in the lotto Playing a crossword puzzle Rolling a die in a board game
Observing a random shopped how much they spent in a particular store. (correct) Playing a 'scratch and win' lottery ticket. (correct) Rolling a die in a board game (correct)
125
Question 1:User Answer Incorrect Would the following be a continuous or discrete distribution? ‘Length of time between shots on net in a soccer game’
Continuous distribution
126
Would the following be a continuous or discrete distribution? ‘Number of shots on net in a soccer game’
Discrete distribution
127
Which of the following statements about probability distributions are TRUE? Can be used to describe both discrete and continuous numerical variables The area beneath the function always sums to one The y-axis of a continuous distribution is called probability mass The x-axis is the outcome, or event, of interest Probability distributions show the probability of some events, but they do not have to account for all possible events from a random trial.
Can be used to describe both discrete and continuous numerical variables (correct) The area beneath the function always sums to one (correct) The x-axis is the outcome, or event, of interest (correct)
128
Which of the following statements about probability distributions are FALSE? The probability of a single event in a continuous distribution is always zero The probability of a single event in a discrete distribution is always zero Regardless of whether the distribution is discrete or continuous, probability is the area under the curve. Probability distributions cannot be used for a range of events.
The probability of a single event in a discrete distribution is always zero (correct) Probability distributions cannot be used for a range of events. (correct)
129
Null hypothesis
statement or position that is the skeptical view-point of the research question.
130
Null distribution
sampling distribution from an imaginary statistical population where the null hypothesis is true
131
statistical significance
conclusion that is unlikely to come from the null
132
hypothesis testing
used to evaluate statistical significance
133
P
the probability of seeing your data, or something more extreme, under the null hypothesis helps quantify the evidence against the null hypothesis It measures how compatible your data is with the assumption that the null is true. If α=0.05, a p-value below 0.05 means rejecting 𝐻0 is justified. p=0.03, 𝛼=0.05 α=0.05: The result is statistically significant because 𝑝<0.05 p<0.05. You reject 𝐻0 ​ . 𝑝=0.10, 𝛼=0.05 α=0.05: The result is not statistically significant because 𝑝>0.05 p>0.05. You fail to reject 𝐻0 ​ .
134
type one error rate
probability of rejecting the null when it is true (false positive)
135
type two error
probability of failing to reject the null when its false (false negative)
136
error rates
probability of making a mistake
137
population parameters
descriptive statistics of the sample quantifiable characteristics of a statistical pop labeled using the Greek alphabet values are fixed
138
sampling distributions
shape is independent of the statistical pop if the sample size is sufficiently large bell shaped curve taking the mean of multiple sampling units averages out asymmetries in the statistical population the variance of a sampling distribution increases as the # of sampling units decreases
139
central limit theorem
given a sufficiently large sample size, the distribution of the sample mean will approximate a normal distribution, regardless of the original population's distribution standard error can be calculates from the sd of the statistical pop and the sample size
140
SE =
theta (sd) / sqrt (n)
141
student t distribution
shape depends on size of sample (influential when size is small) has fatter tails to accunt for the uncertainty in estimating the sd continuous probability distribution sample size is small, and the population standard deviation is unknown. As df increases, the t-distribution approaches the normal distribution.
142
confidence intervals
the range over a sampling distribution that brackets the center most probability of interest
143
confidence interval formulas
t = (x-m)/SE x = m + t * SE
144
single sample t-test
evaluates if the mean of your sample is different from some reference value compares numerical variable to a reference (sample mean - reference) / SE
145
paired sample t-test
if the difference in paired data of numerical variables is different from some reference value looks at how sampling units change across factors t= (mean of differences-reference)/SE
146
two sample t-test
determines if the means of two groups are different from each other (m1-m2)/SEs
147
contingency table
summarized categorical data
148
expected contingency table
the contingency table of expected frequencies under the null hypothesis compare observed vs. expected
149
expected 1-way table
one categorical variable with levels sum of observed counts must be the same as expected expectation counts are distributed equally is there a difference in counts among the level of that variable?
150
expected 2-way table
two categorical variable expected counts are distributed independently are the counts independent between variables?
151
calculating independence
calculate marginal distribution .........
152
calculating expected frequencies
(row total * column total) / table total, do it for each cell
153
Chi-square test
used to determine whether there is a significant association between categorical variables or whether observed data matches expected data under a certain hypothesis. It works by comparing observed frequencies (data collected) to expected frequencies (based on a hypothesis).
154
chi-square distribution
distribution of chi-square scores expected from repeatedly sampling a statistical pop where the null is true can only have positive values (square everything) shape will vary depending on df's
155
calculating chi-square (X^2)
take the difference between each observed and expected cell square the difference divide by the expected value sum over all cells in the table
156
dfs for 1 - way tables
n-1
157
dfs for 2-way tables
(r-1)(c-1)
158
names for the variable used to explain the change in the outcome of an experiment
X - Variable independent variable predictor variable
159
names for the variable used to explain the change in the outcome of an observational study
the x variable the predictor variable
160
The relationship between number of beers consumed (x) and blood alcohol content (y) was studied in 16 adults by using linear regression. The following regression equation was obtained from the study: y= -0.0127 + 0.0180x If a individual had 4 beers and scored a blood alcohol content of 0.085, what is their residual variation?
+0.0257 (correct)
161
Linearity
response variable is a linear function of the predictor variable (well describes by a linear relationship) the effect of the predictor variable on the response is additive and proportional
162
normality
assumption that residuals are normally distributed
163
Independence
assumes that the residuals a sequentially independent of each other (vary between + and - numbers seemingly at random) when residuals are not independent there will be adjacent runs of positive and negative runs prevent violations by making sure units are selected at random and independently of each other
164
Homoscedasticity
the variance of residuals (errors) should be constant across all levels of the predictor variable (spread should be equal)
165
bivariate normal distribution
3D normal distribution graph depicted as contours
166
Pearsons correlation coefficient
r or roe measures the strength of association p = -1, p=0, p=1 (negative, no, positive association)
167
linear regression
evaluates if changes in one numerical variable can predict changes in another
168
linear regression equation
y = a (intercept) + b (slope) x
169
systematic component
describes the function used for predictions
170
random component
describes the probability distribution for sampling error ( only occurs in the y variable)
171
link function
connects the systematic to the random component
172
3 parts of the statistical model
systematic component random component link function
173
minimizing residual variance
calculate residual for each data point take the square of each residual sum the squared residuals across all data points divide by dfs (n-2)
174
what are the four steps to the hypothesis test
define the null and alternative hypothesis establish the null distribution conduct the statistical test draw scientific conclusions
175
F-test
determines the ration of variance between two variables ( no variance, F = 1)
176
which sum of squares measures the variability of the observes values of the response variable around their respective treatment means in ANOVA
residual variation (MSE) (correct)
177
contrast statement
test the difference in means between groups in an ANOVA test
178
post Hoc test
secondary test used to evaluate what groups have different means in ANOVA only used if the F-test indicates to reject the null hypothesis
179
TukeyHSD test
compares the means of all possible combinations of categorical levels in an ANOVA controls the family wise error rate by using a specialized null distribution that accounts for the number of contrasts
180
family wise error rate
type 1 error rate for the family of contrasts used to evaluate the adjusted p-values returned from the TukeyHSD test P>FWER (0.05) we fail to reject P
181
Two factor ANOVA
looks at the effect of two categorical variable on a numerical variable
182
main effects A
questions about the differences among the levels of factor A averaging across the levels of factor B. These are comparisons among full columns
183
main effects B
questions about the differences among the levels of factor B averaging across the levels of factor A. These are comparisons among full rows
184
Interactions
differences among the levels of one factor with each level of the other factor deviation from the assumption that the levels of each factor simply ass together
185
additivity
response from the two variables is the sum of the two
186
synergistic interaction
response is more than the two variables added together
187
antagonistic interaction
response is less than the two variable combined
188
What does a significant AB interaction mean in a two-way ANOVA?
The affect of factor A depends on the level of factor B. (correct)
189
What type of sum of squares measures the variability of the observed values of the response variable around their respective cell means?
residual
190
Mean sum of squares for groups
MsG = SSG(sum of squares)/dfG (k-1) k = number of groups
191
mean square Error
residual variation MSE = SSE / dfE (n-k)
192
what happens when the sample size increases
variance reduces standard error becomes smaller
193
population distribution
distribution values produced from the measurement of some parameter about each individual of a population
194
If the coefficient of correlation r = ± 1, then the best-fit linear equation will actually include all of the data points?
true
195
The coefficient of correlation r is a number that indicates the direction and the strength of the relationship between the variable y and the variable x?
true
196
We anticipate a small P value for an ANOVA F statistic if the box plots for the samples are
wide and similarly located narrow and located differently identical symmetrical wide and have similar medians
197
t distributions can be used to test whether the difference between two sample means is different from zero?
true
198
df formulas
K-1: variation between groups (ANOVA, MSG) N-K: variation within groups (ANOVA, residual variation (MSE)) n-1: one-way table n-2: confidence intervals and residual analysis (r/a-1)(c/b-1): 2-way table ab(n-1): residual analysis (variation among sampling units within a cell) n1+n2-2 = two sample t-test
199
what is the F-score
the ratio of the variation among categorical groups divided by the residual variation within a group
200
what is the null distribution of the F-score
represents the variation in a ratio you would expect from repeated sampling of a population where there was no true difference in means.