Biostats Test 1 Flashcards

Question

Standard deviation

Answer 1

The standard deviation of a set of sample values is a measure of variation of values about the mean - The standard deviation "s" is used to describe the variation around the mean. - Like the mean, it is NOT resistant to skew or outliers. (used to estimate population)

Answer 2

The variance of a set of values is a measure of variation equal to the square of the standard deviation (s^2) (used to estimate population)

Answer 3

(also known as standardized score) A z-score can be used to compare values from different data sets. - The Z-score is the number of standard deviations that a given value x is above or below the mean. - If z-score is 1.....1 standard deviation above mean. If z score is -1.5.....1.5 standard deviations below. use when we want to compare different populations - need to standardize data to make comparable

Answer 4

Positive z-score: indicates that the value is above the mean | Negative z-score: indicates that the value is below the mean.

Answer 5

Quartiles divide the sorted data values into four equal parts. - The median divides the data into two equal components. - Q1: 25% of values are less than or equal to Q1, and 75% of values are greater than or equal to Q1 - Q2: equal to the median - Q3: 75% of values are less than or equal to Q3, and 25% of values are greater than or equal to Q3

Answer 6

the process of using statistical tools to investigate data sets in order to understand their important characteristics, including: center, variation, distribution, outliers and time

Answer 7

An outlier is a value that is located very far away from almost all of the other values. Relative to the other data, an outlier is an extreme value - An outlier can have a dramatic effect on the mean, the standard deviation, and the scale of the histogram so that the true nature of the distribution is obscured

Answer 8

``` min Q1 M (median) Q3 max ```

Answer 9

The IQR is the distance between the first and third quartiles (the length of the box in the box plot) - IQR = Q3-Q1 - used to find suspected high and low outliers

Answer 10

An outlier is an individual value that falls outside the overall pattern. How far outside the overall pattern does a value have to fall to be considered a suspected outlier? - Suspected low outlier: any value < Q1 - 1.5 IQR - Suspected high outlier: any value > Q3 + 1.5 IQR

Answer 11

1. Find the 5 number summary 2. Construct a scale with values that includes the minimum and maximum data values 3. Construct a box extending from Q1 to Q3, and draw a line in the box at the median values 4. Draw lines extending outward from the box to the minimum and maximum data values (won't be asked to do this on exam)

Answer 12

can be analyzed to determine if there is an association between the two variables. - We explore only linear associations within quantitative data

Answer 13

A correlation exists between two variables when one of them is linearly related to the other in some way - must be quantitative variables

Answer 14

1. Make a scatterplot - - What type of relationship is there? linear or nonlinear - - Direction of relationship? positive (as x increases, y increases) or negative (as x increases, y decreases) - - How strong is the relationship? strong (if you can connect dots), weak (if scattered) - - Look for potential outliers

Answer 15

The correlation measures the strength of the linear association between paired x and y quantitative values in a sample. r is a sample statistic representing the population correlation coefficient, p. Requirements for making inferences about p, using r: 1. Paired data (x, y) must be a ramble sample 2. A scatterplot must confirm that the points approximate a straight-line pattern 3. Outliers should be removed if they are known to be errors

Answer 16

- The value of r is always between -1 and 1, inclusive (-1 less than or equal to r less than or equal to 1) - The value of r does not change if all values of either variable are converted to a different scale - The value of r is not affected by the choice of x and y (ex: doesn't matter if BMI x or y, cholesterol, y or x) - r measures the strength and direction of a linear association Negative correlation: - slope Positive correlation: + slope

Answer 17

If r is closer to zero, we can conclude that there is no significant linear correlation between x and y. If r is close to -1 or 1, we conclude that there is significant linear correlation (values closer to -1 or 1 indicate stronger correlation) * CANNOT conclude that there is no relationship at all (there could be another relationship like a parabola)

Answer 18

If we conclude that there is a linear correlation between x and y, we can find a linear equation that expresses y in terms of x and that equation can be used to predict values of y for given values of x. (Simple Linear Regression) The value of r^2 is the proportion of variation in y that is explained by x. In addition to x, there may be a variety of other factors affecting y, such as random variation or other factors not included in the study. We will explore this in more detail with linear regression.

Answer 19

- Concluding that correlation implies causality (ex: shark attacks and ice cream consumption) - Data based on averages: Averages suppress individual variation and may inflate the correlation coefficient (averages may make things look better than they are) - Linearity: An association may exist between x and y even when there is no significant linear correlation. r is not resistant to outliers: - Correlations are calculated using means and standard deviations, and thus are NOT resistant to outliers - Outliers will make a relationship look stronger/weaker than it actually is

Answer 20

The regression equation expresses an association between x and y. Variable x: the independent, predictor*, or explanatory* variable Variable y: the dependent or response* variable Data comes in pairs (xi, yi) where xi is the ith observation for variable x and yi is the ith observation for variable y A linear regression model with one predictor variable is a simple linear regression (SLR) model x is what we are using to predict y

Answer 21

the unique line such that the sum of the vertical distances between the data points and the line is zero, and the sum of the squared vertical distances is the smaller possible - same as line of best fit - line: smallest amount of vertical distances squared (minimizes error) - sum of all vertical distances has to = 0 - always has to pass through the point (x bar, y bar) * Only for linear associations - Don't compute the regression line until you have confirmed that there is a linear relationship between x and y - always plot the raw data first to confirm linear association (always do a scatterplot and correlation coefficient first)

Answer 22

y hat: the predicted value of y for a given value of x y hat = intercept + slope x *always have to write y hat slope of the regression line: describes how much we expect y to change, on average, for every unit change in z intercept: a necessary mathematical descriptor of the regression line (it does not describe a specific property of the data)

Answer 23

b1 = r (sy/sx) - r: the correlation coefficient between x and y - sy: standard deviation of the response variable y - sx: standard deviation of the explanatory variable x

Answer 24

b0 = y bar - b1 (x bar) | - x and y bar are the respective means fo the x and y variables

Answer 25

coefficient of determination = r^2 It is the square of the correlation coefficient. It represents the fraction of the variation (%) in y that is explained by the regression model. - Always between 0 and 1 - The closer r^2 gets to 1, the better the model explains (fits) the data Interpretation ex: - If r=0.87 then r^2 = 0.76...... About 76% of the variation in children's heights is explained by the regression model with FEV/ The regression model explains 76% of the variations in y.

Answer 26

Outlier: an observation that lies outside the overall pattern "Influential individual/point": an observation that markedly changes the regression if removed. This is often an isolated point

Answer 27

the vertical distances from each point to the least-squares regression line - The sum of all the residuals is by definition 0 - Outliers have unusually large residuals (in absolute value) - residuals can be positive or negative

Answer 28

Association, however strong, does NOT imply causation. - The observed association could have an external cause. - A lurking variable is a variable that is not among the explanatory or response variables in a study, and yet may influence the relationship between the variables studied - We say that two variables are confounded when their effects on a response variable cannot be distinguished from each other.

Answer 29

Establishing causation from an observed association can be done if: - The association is strong - The association is consistent - Higher doses are associated with stronger responses - The alleged cause precedes the effect - The alleged cause is plausible

Answer 30

record data on individuals without attempting to influence the responses - not imposing anything on anyone/manipulating antyhignabout individuals

Answer 31

deliberately imposing or assigning a treatment on individuals and record their responses - influential factors can be controlled - implement something that changes person's lifestyle, etc.

Answer 32

Two variables are confounded when their effects on a response variable cannot be distinguished - Observational studies often fail to yield clear causal conclusions because the explanatory variable is confounded with lurking variables Lurking variables: didn't collect info on it (Can't account for it) Confounding variables: connected the info while doing the study Can control for confounding variables in experiments generally but not in observational studies

Answer 33

Population: the entire group of individuals in which we are interested but can't usually assess directly Sample: the part of the population we actually examine and for which we do have data

Answer 34

Parameter: a number summarizing a characteristic of the Population Statistic: a number summarizing a characteristic of a Sample

Answer 35

Probability sampling: individuals or units are randomly selective; the sampling process is unbiased Methods that are NOT ideal but only done if probability sampling not possible: - Voluntary random sampling: individuals choose to be involved - Convenience sampling: ask whoever is around (mail, street) or take the next 10 units

Answer 36

Made of randomly selected individuals - Each individual in the population has the same probability of being in the sample - All possible samples of size n have the same chance of being drawn How to choose an SRS: - draw from a hat (lottery style) - flip a coin - use a table of published random numbers - use software that generates random numbers

Answer 37

an observational study that relies on a random sample drawn from the entire population - Opinion polls are sample surveys that typically use voter registries or telephone numbers to select their samples - In epidemiology, sample surveys are used to establish the incidence (rate of new cases per year) and the prevalence (rate of all cases at one point in time) of various medical conditions, diseases, and lifestyles.

Answer 38

- undercoverage or selection bias: parts of the population are systematically left out (based on the way you choose to distribute the survey) - nonresponse: some people choose not to answer/participate - wording effects: biased or leading questions, complicated/confusing statements can influence survey results - response bias: fancy term for lying or forgetting (especially on sensitive/personal issues) - can be exacerbated by survey method (in person vs. by phone or online) - more likely to lie to someone's face than online

Answer 39

start with 2 random samples of individuals with different outcomes and look for exposure factors in the subjects' past ("retrospective") - common for rare diseases

Answer 40

enlist individuals of common demographic, and keep track of them over a long period of time ("prospective") - individuals who later develop a condition are compared with those who don't

Answer 41

measure the exposure and the outcome at the same time (i.e. surveys)

Answer 42

experimental units | - if they are human, we call them subjects

Answer 43

any specific experimental condition applied to the subjects - if an experiment has several factors, a treatment is a combination of specific levels of each factors ex: the factor may be the administration of a drug

Answer 44

compare the response to a given treatment versus: - another experiment - the absence of treatment (often called a control) - a placebo (a fake treatment) Experiments randomize the assignment of subjects to treatments. Experiments use replication: several or many individuals are studied

Answer 45

negative control: expect outcome to stay the same (expect that not going to help/hurt) positive control: expect outcome to change

Answer 46

improvement in health or perceived condition due not to any active treatment but only to the patient's belief that he or she is being cared for or helped - therapeutic results on up to 35% of patients - neural response to the placebo effect seen as early as the spinal cord

Answer 47

term used to describe a type of bias that may occur due to behavior modification because of study enrollment - also known as "observer effect" - blinding can help against bias if people know what group they are in or doctor knows who's in placebo group may cause change in individual or doctor's treatment - people behave differently if they know they are being observed by a doctor

Answer 48

one in which neither the subjects nor the experimenter(s) know which individuals received which treatment until the experiment is completed

Answer 49

individuals are randomly assigned to groups then the groups are assigned to treatments completely at random

Answer 50

choose pairs of subjects that are closely matched (like twins but doesn't have to be) - within each pair, randomly assign who will receive which treatment (each gets a different treatment)

Answer 51

give the two (or more) treatments to each subject over time, in random order, so we have repeated measures for each subject

Answer 52

established IRB (partly in response to Tuskegee Syphilis study - also Stanford Prison experiment, Milgram experiment) 3 main aims: - respect for persons (consent) - beneficence (maximize benefit while minimizing harm - study will be beneficial) - justice (will the study be worthwhile?)

Answer 53

summarize data about two categorical variables (or factors) collected on the same set of individuals - Each factor can have any number of levels. If the row factor has "r" levels and the column factor has "c" levels, we say that the two-way table is an "r by c" table

Answer 54

- We can examine each factor in a two-way table separately by studying the row totals and the column totals. They represent the marginal distributions, expressed in counts or percents - total all rows or total all columns

Answer 55

is the distribution of one factor for each level of the other factor - fix either a row or column and calculate percentages across that fixed/raw column - A conditional percent is computed using the counts within a single row or a single column. The denominator is the corresponding row or column total (rather than the table grand total)

Answer 56

outcomes are uncertain, but there is nonetheless a regular distribution of outcomes in a large number of repetitions

Answer 57

We define the probability of any outcome of a random phenomenon as the proportion of times the outcome would occur in a very long series of repetitions (number of times outcome will occur in long series of replications) - description of the pattern in the LONG RUN

Answer 58

Probability models mathematically describe the outcome of random processes. They consist of two parts: 1) S= Sample Space: This is a list or description of ALL possible outcomes of a random process. An event is a subset of the sample space. 2) A probability assigned for each possible simple event in the sample space S

Answer 59

Discrete variables that can take on only certain values (a whole number or a descriptor) - sample with finite number of outcomes (ex: blood types - there are only 4)

Answer 60

Continuous variables that can take on any one of an infinite number of possible values over an interval (ex: height, weight, BMI generally continuous - can be any number within lower/upper bound) - have a minimum/lower bound, upper bound but there are unlimited values within

Answer 61

- Probabilities range from 0 (no chance) to 1 (event has to happen) -- For any event A, P(A) is between 0 and 1 - The probability of the complete sample space S must equal 1: P(sample space) = 1 (probabilities of all outcomes must add up to 1) - Complement rule: The probability that an event A does not occur (not A) equals 1 minus the probability that it does not occur: P(not A) = 1 - P(A)

Answer 62

Two events are disjoint or mutually exclusive if they can never happen together (have no outcome in common)

Answer 63

Addition rules for disjoint events: When two events A and B are disjoint, P(A or B) = P(A) + P(B) General addition rule for ANY two events A and B: P(A or B) = P(A) + P(B) - P(A and B)

Answer 64

contain an infinite number of events - We use density curves to model continuous probability distributions - They assign probabilities over the range of values making up the sample space

Answer 65

Events are defined over intervals of values - The total area under a density curve represents the whole population (sample space) and equals 1 (100%) - Probabilities are computed as areas under the corresponding portion of the density curve for the chosen interval - The probability of an event being equal to a single numerical value is zero when the sample space is continuous Area under curve between specified points = P(A) - probability of event A happening

Answer 66

two events are independent if knowing that one event is true or has happened does not change the probability of the other event - If knowledge of the first event affects the second -> dependent

Answer 67

reflect how the probability of an event can be different if we know that some other event has occurred or is true The conditional probability of event B, given event A is: P(B|A) = P(A and B)/P(A) (this is the probability B will occur given that A has already occurred) When two events A and B are independent, P(B|A) = P(B). No information is gained from he knowledge of event A.

Answer 68

P(B|A) = P(B) = P(B|A complement)

Answer 69

General multiplication rule: The probability that ANY two events, A and B both occur is: P(A and B) = P(A)P(B|A) Multiplication rule for independent events: If A and B are independent, then: P(A and B) = P(A)P(B)

Answer 70

are used to represent probabilities graphically and facilitate computations

Answer 71

P(A|B) not equal to P(B|A) If we know the conditional probability P(B|A) and the individual probability P(A), we can use Baye's theorem to find the conditional probability P(A|B) equation is on equation sheet

Answer 72

probability you get negative result when you don't have the disease - P(negative result given negative disease status)

Answer 73

how likely it is to give positive result when you have the disease (want this to be close to 1) - how accurate the test is P( positive result given positive disease status)

Answer 74

a family of symmetrical, bell-shaped density curves defined by a mean u ("mu") and a standard deviation (sigma): N(u, sigma) Normal curves are used to model many biological variables. They can describe a population distribution or a probability distribution.

Answer 75

68-95-99.7 rule All normal curves share the same properties: - About 68% of all observations are within 1 standard deviation (sigma) of the mean (mu) ----- mu - sigma to sigma + mu range - About 95% of all observations are within 2 sigma of the mean mu - Almost all (99.7%) observations are within 3 sigma of the mean -- probably outliers To obtain any other area under a normal curve, use Table B

Answer 76

We can standardize data by computing a z score: z = (x-mu)/sigma where x= an observation - If a has the N(mu, sigma) distribution, then z has the N(0,1) distribution - Mean of 0, sd of 1 is normal distribution

Answer 77

measures the number of standard deviations that a data value x is from the mu - When x is 1 standard deviation larger than the mean, then z = 1 - When x is 2 standard deviations larger than the mean, then z = 2 When x is larger than the mean, z is positive. When x is smaller than the mean, z is negative. The area under N(0,1) for a single value of z is zero

Answer 78

area to the right of z = 1 - area left of z OR = area left of -z

Answer 79

- find the desired area/proportion in the body of the table - then read the corresponding z-value from the left column and top row - percentile corresponds to area under curve

Answer 80

One way to assess if a data set has an approximately Normal distribution is to plot the data on a QQ Plot (assess normality of data) - The data points are ranked and the percentile ranks are converted to z-scores. The z-scores are then used for the horizontal axis and the actual data values are used for the vertical axis. Use technology to obtain normal quantile plots - If the data have approximately a Normal distribution, the Normal quantile plot will have roughly a straight-line pattern (if straight line, then data probably normally distributed)

Answer 81

- Different random samples taken from the same population will give different statistics, but there is a predictable pattern in the long run - A statistic computed from a random sample is a random variable The sampling distribution of a statistic is the probability distribution of that statistic for samples of a given size n taken from a given population - Every time you do simple random sample, get slightly different average (due to sampling error and non-sampling error (any error involving human - data collection, etc.))

Answer 82

The mean of the sampling distribution of x bar is mu. - There is no tendency for a sample average to fall systematically above or below mu, even if the population distribution is skewed. - x bar is an unbiased estimate of the population mean mu. The standard deviation of the sampling distribution of means is sigma/square root of n. - The standard deviation of the sampling distribution measures how much the sample statistic x bar varies from sample to sample - Averages are less variable than individual observations

Answer 83

When a variable in a population is Normally distributed, the sampling distribution of the sample mean x bar is also Normally distributed Population: N(mu, sigma) Sampling distribution: N(mu, sigma/square root of n)

Answer 84

When the sampling distribution is Normal, we can standardize the value of a sample mean x bar to obtain a z-score. This z-score can then be used to find areas under the sampling distribution from the Normal probability table. z= x bar - mu / sigma* square root of n Here we work with the sampling distribution sigma/square root of n is its standard deviation (indicative of spread)

Answer 85

When randomly sampling from ANY population with mean mu and standard deviation (sigma) when N is large enough, the sampling distribution of x bar is approximately Normal: N(mu, sigma/square root of n) - The larger the sample size n, the better the approximation of Normality - This is very useful in inference: Many statistical tests assume Normality for the sampling distribution. The central limit theorem tells us that, if the sample size is large enough, we can safely make this assumption even if the raw data appear non-Normal

Answer 86

It depends on the population distribution. More observations are required if the population distribution is far from Normal. - A sample size of 25 or more is generally enough to obtain a Normal sampling distribution from a skewed population, even with mild outliers in the sample - A sample size of 40 or more will typically be good enough to overcome an extremely skewed population and mild (but not extreme) outliers in the sample In many cases, n=25 isn't a huge sample. Thus, even for strange population distributions, we can assume a Normal sampling distribution of the sample mean and work with it to solve problems

Answer 87

- Sometimes we are told that a variable has an approximately Normal distribution (e.g. large studies on human height or bone density) - Most of the time, we just don't know. All we have is sample data. - We can summarize the data with a histogram and describe its shape. - If the sample is random, the shape of the histogram should be similar to the shape of the population distribution. - The central limit theorem can help guess whether the sampling distribution should look roughly Normal or not

Answer 88

As the number of randomly drawn observations (n) in a sample increases: - the mean of the sample (x bar) gets closer and closer to the population mean mu (quantitative variable) - the sample proportion (p hat) gets closer and closer to the population proportion p (categorical variable) x bar should be getting closer to population mean as sample size increases

Answer 89

When sampling randomly from a given population: - The law of large numbers describes what would happen if we took samples of increasing size n - A sampling distribution describes what would happen if we took all possible random samples of a fixed size n Both are conceptual ideas with many important practical applications. We rely on their known mathematical properties but we don't actually build them from data

Answer 90

- distribution of x bar -> normally distributed - mean of x bar -> close to the population mean mu (mean of sample mean close to mean of population) - spread/standard deviation of sampling distribution of x bar ALWAYS smaller than the population spread/standard deviation ***

Answer 91

P (A|+) | probability that you have the disease given + result on test (if positive result, how likely that you have the disease?)

Answer 92

P(Disease complement | negative)

Answer 93

P (not A) P(Ac) P(A bar)

Answer 94

For her: strong correlation is r of (-1 to -.7) or (.7 to 1) moderate correlation: (.4 to .69) or (-.4 to -.69) weak correlation: (-3.9 to 3,9)

Answer 95

y value when x=0 | note, it may not make sense in the context

Answer 96

b0: When the ___ is 0, the predicted ____ is ___. b1: For a ___ unit increase in____, there is, on average, a ____ increase in_____.

Answer 97

variation: spread (how spread out the data is) standard deviation: measures spread (large standard deviation = more spread out data) variance: standard deviation squared

Biostats Test 1 Flashcards

(122 cards)