Biostats Test 1 Flashcards

1
Q

Statistics

A

the science of data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Data

A

numbers with a context

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Biostatistics

A

the application of statistics to topics in biology, including, but not limited to the design and analysis of biological experiments and observational studies

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Descriptive Statistics

A

Methods of organizing, summarizing and presenting data in an informative way

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Inferential Statistics

A

Methods for drawing conclusions about a phenomenon (population) on the basis of data (sample
- draw conclusions about hypotheses

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Population vs. sample

A

Population: all subjects or items of interest (whose size, the number of subjects in the population, is denoted by N)
Sample: a group (or subset) selected from a population whose size is denoted by n

  • Many different samples can be selected from any given population
  • The number of distinct samples depends on the size of both the population and the sample
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Data

A

observations (such as measurements, genders or survey responses) that have been collected

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Parameter

A

a number that describes a characteristic of a population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Statistic

A

a number that describes a characteristic of a sample (aka sample statistics)
- The observed value of a statistic is used to estimate the unobserved value of a parameter

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Unbiased statistic

A

A statistic is unbiased if the mean of its sampling distribution is the same as the parameter it is intended to estimate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Individuals

A

Individuals are the objects described in a set of data

- Individuals may be people, animals, plants or things (ex: freshmen, newborns, fields of corn, cells)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Variable

A

A variable is any property that characterizes an individual.

  • A variable can take different values for different individuals (ex: age, gender, blood pressure, blood types, flower color)
  • two types: quantitative, categorical
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Quantitative variable

A

Some quantity assessed or measured for each individual. We can then report the average of all individuals.
- Numeric (ex: age in years, blood pressure)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Categorical variable

A

Some characteristic describing each individual. We can then report the count or proportion of individuals with that characteristic.

  • Gender (male, female), blood type (A, AB, O, B), flower color (white, yellow, red)
  • finite number of categories
  • don’t calculate averages for categorical variables - instead, often calculate proportions

pie charts, bar graphs often used to represent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Histograms

A

This is a summary graph for a single variable. Histograms are useful to understand the pattern of variability in the data, especially for large data sets

A histogram is a graph in which the horizontal scale represents classes of data values and the vertical scale represents frequencies. The heights of the bars correspond to the frequency values and the bars are drawn adjacent to each other.

  • tells us shape and distribution
  • break data into bins/ranges of equal length
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Dotplots and stemplots

A

These are graphs for the raw data. They are useful to describe the pattern of variability in the data, especially for small data sets

  • Also called stem and leaf plots
  • usually when 20 or fewer observations (if 21+, use histogram)

A graph in which each data value is plotted as a point along a scale of values. Dots representing equal values are stacked.

  • not recommended unless small sample size (few observations)
  • dots: where the observations are located along the line
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Measures of center

A

The center of a data set is a representative or average value that indicates where the middle of the data set is located.

  • Mean
  • Median
  • Mode
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Mean

A

The mean or arithmetic average of a data set is the measure of center found by adding the values and dividing the total by the number of values
- sample mean = summation of all observations / number of values in sample

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Median

A

The median of a data set is the measure of center that is the middle value when the data values are arranged in increasing or decreasing order.

To find the median, first sort the values, then:

  1. If the number of values is odd, the median is the number located in the exact middle of the list
  2. If the number of values is even, the median is found by computing the mean of the two middle numbers
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Mode

A

The mode of a data set is the value that occurs most frequently.

  • When two values occur with the same (greatest) frequency, each one is a mode and the data set in bimodal.
  • When more than two values occur with the same (greatest) frequency, each is a mode and the data set is multimodal.
  • When no value is repeated, there is no mode.
  • One mode: unimodal
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Skewed data distribution

A

A distribution of data is skewed if it is not symmetric and extends more to one side than the other

  • If tail is on left (skinny side), mean pulled towards left
  • If tail is on left, mean pulled towards right (mean > median)

Left skew (negative skew): the mean and median are to the LEFT of the mode (mean < median)
Symmetric (zero skew): the mean, median, and mode are the same
Right-skew (positive skew): the mean and median are to the RIGHT of the mode (mean > median)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

The Best Measure of Center

A

Each measure of center has advantages and disadvantages

  • Mean: is unique in that it takes all data values into account. However, it is NOT resistant to skew and extreme values (outliers)
  • Median: is resistant to skew and outliers
  • For data that is approximately symmetric with only one mode, the mean, median, mode and midrange will be approximately the same
  • For data that is obviously asymmetric, you should report both the mean and the median
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Variation

A

a measure of the amount that values within a data set vary among themselves

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Range

A

The range of a set of data is the difference between the maximum value and the minimum value
- Range = max - min

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Standard deviation

A

The standard deviation of a set of sample values is a measure of variation of values about the mean

  • The standard deviation “s” is used to describe the variation around the mean.
  • Like the mean, it is NOT resistant to skew or outliers.

(used to estimate population)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Variance

A

The variance of a set of values is a measure of variation equal to the square of the standard deviation (s^2)

(used to estimate population)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Z-score

A

(also known as standardized score)

A z-score can be used to compare values from different data sets.

  • The Z-score is the number of standard deviations that a given value x is above or below the mean.
  • If z-score is 1…..1 standard deviation above mean. If z score is -1.5…..1.5 standard deviations below.

use when we want to compare different populations
- need to standardize data to make comparable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

positive vs. negative z-score

A

Positive z-score: indicates that the value is above the mean

Negative z-score: indicates that the value is below the mean.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Quartiles

A

Quartiles divide the sorted data values into four equal parts.

  • The median divides the data into two equal components.
  • Q1: 25% of values are less than or equal to Q1, and 75% of values are greater than or equal to Q1
  • Q2: equal to the median
  • Q3: 75% of values are less than or equal to Q3, and 25% of values are greater than or equal to Q3
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Exploratory data analysis

A

the process of using statistical tools to investigate data sets in order to understand their important characteristics, including: center, variation, distribution, outliers and time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Outlier

A

An outlier is a value that is located very far away from almost all of the other values. Relative to the other data, an outlier is an extreme value
- An outlier can have a dramatic effect on the mean, the standard deviation, and the scale of the histogram so that the true nature of the distribution is obscured

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Five-number summary

A
min
Q1
M (median)
Q3
max
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Inter-quartile range (IQR)

A

The IQR is the distance between the first and third quartiles (the length of the box in the box plot)

  • IQR = Q3-Q1
  • used to find suspected high and low outliers
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Calculating outliers

A

An outlier is an individual value that falls outside the overall pattern. How far outside the overall pattern does a value have to fall to be considered a suspected outlier?

  • Suspected low outlier: any value < Q1 - 1.5 IQR
  • Suspected high outlier: any value > Q3 + 1.5 IQR
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

How to draw a boxplot

A
  1. Find the 5 number summary
  2. Construct a scale with values that includes the minimum and maximum data values
  3. Construct a box extending from Q1 to Q3, and draw a line in the box at the median values
  4. Draw lines extending outward from the box to the minimum and maximum data values

(won’t be asked to do this on exam)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Bivariate (or paired) data

A

can be analyzed to determine if there is an association between the two variables.
- We explore only linear associations within quantitative data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Correlation

A

A correlation exists between two variables when one of them is linearly related to the other in some way
- must be quantitative variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

How to investigate the association between two variables

A
  1. Make a scatterplot
    - - What type of relationship is there? linear or nonlinear
    - - Direction of relationship? positive (as x increases, y increases) or negative (as x increases, y decreases)
    - - How strong is the relationship? strong (if you can connect dots), weak (if scattered)
    - - Look for potential outliers
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Linear correlation coefficient (r) - definition + requirements

A

The correlation measures the strength of the linear association between paired x and y quantitative values in a sample. r is a sample statistic representing the population correlation coefficient, p.

Requirements for making inferences about p, using r:

  1. Paired data (x, y) must be a ramble sample
  2. A scatterplot must confirm that the points approximate a straight-line pattern
  3. Outliers should be removed if they are known to be errors
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

Properties of the correlation coefficient

A
  • The value of r is always between -1 and 1, inclusive (-1 less than or equal to r less than or equal to 1)
  • The value of r does not change if all values of either variable are converted to a different scale
  • The value of r is not affected by the choice of x and y (ex: doesn’t matter if BMI x or y, cholesterol, y or x)
  • r measures the strength and direction of a linear association

Negative correlation: - slope
Positive correlation: + slope

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

Interpreting the correlation coefficient

A

If r is closer to zero, we can conclude that there is no significant linear correlation between x and y.

If r is close to -1 or 1, we conclude that there is significant linear correlation (values closer to -1 or 1 indicate stronger correlation)

  • CANNOT conclude that there is no relationship at all (there could be another relationship like a parabola)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

Interpreting r

A

If we conclude that there is a linear correlation between x and y, we can find a linear equation that expresses y in terms of x and that equation can be used to predict values of y for given values of x. (Simple Linear Regression)

The value of r^2 is the proportion of variation in y that is explained by x. In addition to x, there may be a variety of other factors affecting y, such as random variation or other factors not included in the study. We will explore this in more detail with linear regression.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

Interpreting r - common errors

A
  • Concluding that correlation implies causality (ex: shark attacks and ice cream consumption)
  • Data based on averages: Averages suppress individual variation and may inflate the correlation coefficient (averages may make things look better than they are)
  • Linearity: An association may exist between x and y even when there is no significant linear correlation.

r is not resistant to outliers:

  • Correlations are calculated using means and standard deviations, and thus are NOT resistant to outliers
  • Outliers will make a relationship look stronger/weaker than it actually is
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

Simple linear regression

A

The regression equation expresses an association between x and y.

Variable x: the independent, predictor, or explanatory variable
Variable y: the dependent or response* variable

Data comes in pairs (xi, yi) where xi is the ith observation for variable x and yi is the ith observation for variable y

A linear regression model with one predictor variable is a simple linear regression (SLR) model

x is what we are using to predict y

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

least-squares regression line: definition

A

the unique line such that the sum of the vertical distances between the data points and the line is zero, and the sum of the squared vertical distances is the smaller possible

  • same as line of best fit
  • line: smallest amount of vertical distances squared (minimizes error)
  • sum of all vertical distances has to = 0
  • always has to pass through the point (x bar, y bar)
  • Only for linear associations
  • Don’t compute the regression line until you have confirmed that there is a linear relationship between x and y - always plot the raw data first to confirm linear association (always do a scatterplot and correlation coefficient first)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

least squares regression line: notation and interpretations

A

y hat: the predicted value of y for a given value of x
y hat = intercept + slope x

*always have to write y hat

slope of the regression line: describes how much we expect y to change, on average, for every unit change in z
intercept: a necessary mathematical descriptor of the regression line (it does not describe a specific property of the data)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

Slope of the regression line:

A

b1 = r (sy/sx)

  • r: the correlation coefficient between x and y
  • sy: standard deviation of the response variable y
  • sx: standard deviation of the explanatory variable x
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

Intercept:

A

b0 = y bar - b1 (x bar)

- x and y bar are the respective means fo the x and y variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

coefficient of determination

A

coefficient of determination = r^2

It is the square of the correlation coefficient. It represents the fraction of the variation (%) in y that is explained by the regression model.

  • Always between 0 and 1
  • The closer r^2 gets to 1, the better the model explains (fits) the data

Interpretation ex:
- If r=0.87 then r^2 = 0.76…… About 76% of the variation in children’s heights is explained by the regression model with FEV/ The regression model explains 76% of the variations in y.

50
Q

Outliers vs. influential points

A

Outlier: an observation that lies outside the overall pattern
“Influential individual/point”: an observation that markedly changes the regression if removed. This is often an isolated point

51
Q

Residuals

A

the vertical distances from each point to the least-squares regression line

  • The sum of all the residuals is by definition 0
  • Outliers have unusually large residuals (in absolute value) - residuals can be positive or negative
52
Q

Association + Causation (and reason)

A

Association, however strong, does NOT imply causation.

  • The observed association could have an external cause.
  • A lurking variable is a variable that is not among the explanatory or response variables in a study, and yet may influence the relationship between the variables studied
  • We say that two variables are confounded when their effects on a response variable cannot be distinguished from each other.
53
Q

Establishing causation

A

Establishing causation from an observed association can be done if:

  • The association is strong
  • The association is consistent
  • Higher doses are associated with stronger responses
  • The alleged cause precedes the effect
  • The alleged cause is plausible
54
Q

Observational study

A

record data on individuals without attempting to influence the responses
- not imposing anything on anyone/manipulating antyhignabout individuals

55
Q

Experimental study

A

deliberately imposing or assigning a treatment on individuals and record their responses

  • influential factors can be controlled
  • implement something that changes person’s lifestyle, etc.
56
Q

confounded variables

A

Two variables are confounded when their effects on a response variable cannot be distinguished
- Observational studies often fail to yield clear causal conclusions because the explanatory variable is confounded with lurking variables

Lurking variables: didn’t collect info on it (Can’t account for it)
Confounding variables: connected the info while doing the study

Can control for confounding variables in experiments generally but not in observational studies

57
Q

Population vs. sample

A

Population: the entire group of individuals in which we are interested but can’t usually assess directly

Sample: the part of the population we actually examine and for which we do have data

58
Q

Parameter vs. Statistic

A

Parameter: a number summarizing a characteristic of the Population
Statistic: a number summarizing a characteristic of a Sample

59
Q

Randomization in sampling

A

Probability sampling: individuals or units are randomly selective; the sampling process is unbiased

Methods that are NOT ideal but only done if probability sampling not possible:

  • Voluntary random sampling: individuals choose to be involved
  • Convenience sampling: ask whoever is around (mail, street) or take the next 10 units
60
Q

Simple random sample (SRS)

A

Made of randomly selected individuals

  • Each individual in the population has the same probability of being in the sample
  • All possible samples of size n have the same chance of being drawn

How to choose an SRS:

  • draw from a hat (lottery style)
  • flip a coin
  • use a table of published random numbers
  • use software that generates random numbers
61
Q

Sample Survey

A

an observational study that relies on a random sample drawn from the entire population

  • Opinion polls are sample surveys that typically use voter registries or telephone numbers to select their samples
  • In epidemiology, sample surveys are used to establish the incidence (rate of new cases per year) and the prevalence (rate of all cases at one point in time) of various medical conditions, diseases, and lifestyles.
62
Q

Some survey challenges

A
  • undercoverage or selection bias: parts of the population are systematically left out (based on the way you choose to distribute the survey)
  • nonresponse: some people choose not to answer/participate
  • wording effects: biased or leading questions, complicated/confusing statements can influence survey results
  • response bias: fancy term for lying or forgetting (especially on sensitive/personal issues) - can be exacerbated by survey method (in person vs. by phone or online) - more likely to lie to someone’s face than online
63
Q

case control studies

A

start with 2 random samples of individuals with different outcomes and look for exposure factors in the subjects’ past (“retrospective”)
- common for rare diseases

64
Q

cohort studies

A

enlist individuals of common demographic, and keep track of them over a long period of time (“prospective”)
- individuals who later develop a condition are compared with those who don’t

65
Q

cross-sectional studies

A

measure the exposure and the outcome at the same time (i.e. surveys)

66
Q

the individuals in an experiment are called the….

A

experimental units

- if they are human, we call them subjects

67
Q

the explanatory variables in an experiment are often called the

A

factors

68
Q

treatments

A

any specific experimental condition applied to the subjects
- if an experiment has several factors, a treatment is a combination of specific levels of each factors

ex: the factor may be the administration of a drug

69
Q

experiments

A

compare the response to a given treatment versus:

  • another experiment
  • the absence of treatment (often called a control)
  • a placebo (a fake treatment)

Experiments randomize the assignment of subjects to treatments.

Experiments use replication: several or many individuals are studied

70
Q

Negative vs. Positive control in cellular biology experiments

A

negative control: expect outcome to stay the same (expect that not going to help/hurt)

positive control: expect outcome to change

71
Q

placebo effect

A

improvement in health or perceived condition due not to any active treatment but only to the patient’s belief that he or she is being cared for or helped

  • therapeutic results on up to 35% of patients
  • neural response to the placebo effect seen as early as the spinal cord
72
Q

Hawthorne effect

A

term used to describe a type of bias that may occur due to behavior modification because of study enrollment

  • also known as “observer effect”
  • blinding can help against bias

if people know what group they are in or doctor knows who’s in placebo group may cause change in individual or doctor’s treatment
- people behave differently if they know they are being observed by a doctor

73
Q

double blind experiment

A

one in which neither the subjects nor the experimenter(s) know which individuals received which treatment until the experiment is completed

74
Q

completely randomized experimental design

A

individuals are randomly assigned to groups then the groups are assigned to treatments completely at random

75
Q

matched pairs design

A

choose pairs of subjects that are closely matched (like twins but doesn’t have to be) - within each pair, randomly assign who will receive which treatment (each gets a different treatment)

76
Q

repeated measures design

A

give the two (or more) treatments to each subject over time, in random order, so we have repeated measures for each subject

77
Q

Belmont report 1979

A

established IRB (partly in response to Tuskegee Syphilis study - also Stanford Prison experiment, Milgram experiment)

3 main aims:

  • respect for persons (consent)
  • beneficence (maximize benefit while minimizing harm - study will be beneficial)
  • justice (will the study be worthwhile?)
78
Q

two-way tables

A

summarize data about two categorical variables (or factors) collected on the same set of individuals
- Each factor can have any number of levels. If the row factor has “r” levels and the column factor has “c” levels, we say that the two-way table is an “r by c” table

79
Q

marginal distributions

A
  • We can examine each factor in a two-way table separately by studying the row totals and the column totals. They represent the marginal distributions, expressed in counts or percents
  • total all rows or total all columns
80
Q

conditional distribution

A

is the distribution of one factor for each level of the other factor

  • fix either a row or column and calculate percentages across that fixed/raw column
  • A conditional percent is computed using the counts within a single row or a single column. The denominator is the corresponding row or column total (rather than the table grand total)
81
Q

Random event

A

outcomes are uncertain, but there is nonetheless a regular distribution of outcomes in a large number of repetitions

82
Q

probability

A

We define the probability of any outcome of a random phenomenon as the proportion of times the outcome would occur in a very long series of repetitions (number of times outcome will occur in long series of replications)
- description of the pattern in the LONG RUN

83
Q

probability models

A

Probability models mathematically describe the outcome of random processes. They consist of two parts:

1) S= Sample Space: This is a list or description of ALL possible outcomes of a random process. An event is a subset of the sample space.
2) A probability assigned for each possible simple event in the sample space S

84
Q

Discrete sample space

A

Discrete variables that can take on only certain values (a whole number or a descriptor) - sample with finite number of outcomes (ex: blood types - there are only 4)

85
Q

Continuous sample space

A

Continuous variables that can take on any one of an infinite number of possible values over an interval (ex: height, weight, BMI generally continuous - can be any number within lower/upper bound)
- have a minimum/lower bound, upper bound but there are unlimited values within

86
Q

Probability tules

A
  • Probabilities range from 0 (no chance) to 1 (event has to happen) – For any event A, P(A) is between 0 and 1
  • The probability of the complete sample space S must equal 1: P(sample space) = 1 (probabilities of all outcomes must add up to 1)
  • Complement rule: The probability that an event A does not occur (not A) equals 1 minus the probability that it does not occur: P(not A) = 1 - P(A)
87
Q

disjoint events

A

Two events are disjoint or mutually exclusive if they can never happen together (have no outcome in common)

88
Q

Addition Rules

A

Addition rules for disjoint events: When two events A and B are disjoint, P(A or B) = P(A) + P(B)

General addition rule for ANY two events A and B:
P(A or B) = P(A) + P(B) - P(A and B)

89
Q

Continuous sample spaces

A

contain an infinite number of events

  • We use density curves to model continuous probability distributions
  • They assign probabilities over the range of values making up the sample space
90
Q

Continuous Probabilities + intervals (density curves)

A

Events are defined over intervals of values

  • The total area under a density curve represents the whole population (sample space) and equals 1 (100%)
  • Probabilities are computed as areas under the corresponding portion of the density curve for the chosen interval
  • The probability of an event being equal to a single numerical value is zero when the sample space is continuous

Area under curve between specified points = P(A) - probability of event A happening

91
Q

Independent events

A

two events are independent if knowing that one event is true or has happened does not change the probability of the other event
- If knowledge of the first event affects the second -> dependent

92
Q

Conditional probabilities

A

reflect how the probability of an event can be different if we know that some other event has occurred or is true

The conditional probability of event B, given event A is: P(B|A) = P(A and B)/P(A) (this is the probability B will occur given that A has already occurred)

When two events A and B are independent, P(B|A) = P(B). No information is gained from he knowledge of event A.

93
Q

To show independence, you have to show:

A

P(B|A) = P(B) = P(B|A complement)

94
Q

Multiplication rule

A

General multiplication rule: The probability that ANY two events, A and B both occur is: P(A and B) = P(A)P(B|A)

Multiplication rule for independent events: If A and B are independent, then: P(A and B) = P(A)P(B)

95
Q

Tree diagrams

A

are used to represent probabilities graphically and facilitate computations

96
Q

Baye’s Theorem

A

P(A|B) not equal to P(B|A)
If we know the conditional probability P(B|A) and the individual probability P(A), we can use Baye’s theorem to find the conditional probability P(A|B)

equation is on equation sheet

97
Q

Specificity

A

probability you get negative result when you don’t have the disease
- P(negative result given negative disease status)

98
Q

Sensitivity

A

how likely it is to give positive result when you have the disease (want this to be close to 1) - how accurate the test is
P( positive result given positive disease status)

99
Q

Normal (Gaussian) distributions

A

a family of symmetrical, bell-shaped density curves defined by a mean u (“mu”) and a standard deviation (sigma): N(u, sigma)

Normal curves are used to model many biological variables. They can describe a population distribution or a probability distribution.

100
Q

Rule for any N(mu, sigma) - all normal curves:

A

68-95-99.7 rule

All normal curves share the same properties:

  • About 68% of all observations are within 1 standard deviation (sigma) of the mean (mu) —– mu - sigma to sigma + mu range
  • About 95% of all observations are within 2 sigma of the mean mu
  • Almost all (99.7%) observations are within 3 sigma of the mean – probably outliers

To obtain any other area under a normal curve, use Table B

101
Q

Standard Normal Distribution

A

We can standardize data by computing a z score: z = (x-mu)/sigma where x= an observation

  • If a has the N(mu, sigma) distribution, then z has the N(0,1) distribution
  • Mean of 0, sd of 1 is normal distribution
102
Q

z score

A

measures the number of standard deviations that a data value x is from the mu

  • When x is 1 standard deviation larger than the mean, then z = 1
  • When x is 2 standard deviations larger than the mean, then z = 2

When x is larger than the mean, z is positive.
When x is smaller than the mean, z is negative.

The area under N(0,1) for a single value of z is zero

103
Q

Z table: finding area to the right of a z-value

A

area to the right of z
= 1 - area left of z OR
= area left of -z

104
Q

Area -> Z score

A
  • find the desired area/proportion in the body of the table
  • then read the corresponding z-value from the left column and top row
  • percentile corresponds to area under curve
105
Q

Normal Quantile Plots (QQ Plots)

A

One way to assess if a data set has an approximately Normal distribution is to plot the data on a QQ Plot (assess normality of data)

  • The data points are ranked and the percentile ranks are converted to z-scores. The z-scores are then used for the horizontal axis and the actual data values are used for the vertical axis. Use technology to obtain normal quantile plots
  • If the data have approximately a Normal distribution, the Normal quantile plot will have roughly a straight-line pattern (if straight line, then data probably normally distributed)
106
Q

Sampling Distributions

A
  • Different random samples taken from the same population will give different statistics, but there is a predictable pattern in the long run
  • A statistic computed from a random sample is a random variable

The sampling distribution of a statistic is the probability distribution of that statistic for samples of a given size n taken from a given population

  • Every time you do simple random sample, get slightly different average (due to sampling error and non-sampling error (any error involving human - data collection, etc.))
107
Q

Sampling distribution of x bar (the sample mean)

A

The mean of the sampling distribution of x bar is mu.

  • There is no tendency for a sample average to fall systematically above or below mu, even if the population distribution is skewed.
  • x bar is an unbiased estimate of the population mean mu.

The standard deviation of the sampling distribution of means is sigma/square root of n.

  • The standard deviation of the sampling distribution measures how much the sample statistic x bar varies from sample to sample
  • Averages are less variable than individual observations
108
Q

Sample mean for normally distributed populations

A

When a variable in a population is Normally distributed, the sampling distribution of the sample mean x bar is also Normally distributed

Population: N(mu, sigma)
Sampling distribution: N(mu, sigma/square root of n)

109
Q

Standardizing a normal sample distribution

A

When the sampling distribution is Normal, we can standardize the value of a sample mean x bar to obtain a z-score. This z-score can then be used to find areas under the sampling distribution from the Normal probability table.

z= x bar - mu / sigma* square root of n

Here we work with the sampling distribution
sigma/square root of n is its standard deviation (indicative of spread)

110
Q

Central Limit Theorem

A

When randomly sampling from ANY population with mean mu and standard deviation (sigma) when N is large enough, the sampling distribution of x bar is approximately Normal: N(mu, sigma/square root of n)

  • The larger the sample size n, the better the approximation of Normality
  • This is very useful in inference: Many statistical tests assume Normality for the sampling distribution. The central limit theorem tells us that, if the sample size is large enough, we can safely make this assumption even if the raw data appear non-Normal
111
Q

How large a sample size

A

It depends on the population distribution. More observations are required if the population distribution is far from Normal.

  • A sample size of 25 or more is generally enough to obtain a Normal sampling distribution from a skewed population, even with mild outliers in the sample
  • A sample size of 40 or more will typically be good enough to overcome an extremely skewed population and mild (but not extreme) outliers in the sample

In many cases, n=25 isn’t a huge sample. Thus, even for strange population distributions, we can assume a Normal sampling distribution of the sample mean and work with it to solve problems

112
Q

How do we know if the population is normal

A
  • Sometimes we are told that a variable has an approximately Normal distribution (e.g. large studies on human height or bone density)
  • Most of the time, we just don’t know. All we have is sample data.
  • We can summarize the data with a histogram and describe its shape.
  • If the sample is random, the shape of the histogram should be similar to the shape of the population distribution.
  • The central limit theorem can help guess whether the sampling distribution should look roughly Normal or not
113
Q

Law of large numbers

A

As the number of randomly drawn observations (n) in a sample increases:

  • the mean of the sample (x bar) gets closer and closer to the population mean mu (quantitative variable)
  • the sample proportion (p hat) gets closer and closer to the population proportion p (categorical variable)

x bar should be getting closer to population mean as sample size increases

114
Q

Law of large numbers and sampling distribution (when sampling randomly from a given population)

A

When sampling randomly from a given population:

  • The law of large numbers describes what would happen if we took samples of increasing size n
  • A sampling distribution describes what would happen if we took all possible random samples of a fixed size n

Both are conceptual ideas with many important practical applications. We rely on their known mathematical properties but we don’t actually build them from data

115
Q

sampling distribution of x bar

A
  • distribution of x bar -> normally distributed
  • mean of x bar -> close to the population mean mu (mean of sample mean close to mean of population)
  • spread/standard deviation of sampling distribution of x bar ALWAYS smaller than the population spread/standard deviation
116
Q

PPV (Positive predictive value)

A

P (A|+)

probability that you have the disease given + result on test (if positive result, how likely that you have the disease?)

117
Q

negative predictive value (NPV)

A

P(Disease complement | negative)

118
Q

notation for complement rule

A

P (not A)
P(Ac)
P(A bar)

119
Q

Strong vs. weak vs. moderate correlation

A

For her:
strong correlation is r of (-1 to -.7) or (.7 to 1)
moderate correlation: (.4 to .69) or (-.4 to -.69)
weak correlation: (-3.9 to 3,9)

120
Q

y intercept

A

y value when x=0

note, it may not make sense in the context

121
Q

Interpretation of b0 and b1

A

b0: When the ___ is 0, the predicted ____ is ___.
b1: For a ___ unit increase in____, there is, on average, a ____ increase in_____.

122
Q

variation vs. standard deviation vs. variance

A

variation: spread (how spread out the data is)
standard deviation: measures spread (large standard deviation = more spread out data)
variance: standard deviation squared