Biostats Test 1 Flashcards
Statistics
the science of data
Data
numbers with a context
Biostatistics
the application of statistics to topics in biology, including, but not limited to the design and analysis of biological experiments and observational studies
Descriptive Statistics
Methods of organizing, summarizing and presenting data in an informative way
Inferential Statistics
Methods for drawing conclusions about a phenomenon (population) on the basis of data (sample
- draw conclusions about hypotheses
Population vs. sample
Population: all subjects or items of interest (whose size, the number of subjects in the population, is denoted by N)
Sample: a group (or subset) selected from a population whose size is denoted by n
- Many different samples can be selected from any given population
- The number of distinct samples depends on the size of both the population and the sample
Data
observations (such as measurements, genders or survey responses) that have been collected
Parameter
a number that describes a characteristic of a population
Statistic
a number that describes a characteristic of a sample (aka sample statistics)
- The observed value of a statistic is used to estimate the unobserved value of a parameter
Unbiased statistic
A statistic is unbiased if the mean of its sampling distribution is the same as the parameter it is intended to estimate
Individuals
Individuals are the objects described in a set of data
- Individuals may be people, animals, plants or things (ex: freshmen, newborns, fields of corn, cells)
Variable
A variable is any property that characterizes an individual.
- A variable can take different values for different individuals (ex: age, gender, blood pressure, blood types, flower color)
- two types: quantitative, categorical
Quantitative variable
Some quantity assessed or measured for each individual. We can then report the average of all individuals.
- Numeric (ex: age in years, blood pressure)
Categorical variable
Some characteristic describing each individual. We can then report the count or proportion of individuals with that characteristic.
- Gender (male, female), blood type (A, AB, O, B), flower color (white, yellow, red)
- finite number of categories
- don’t calculate averages for categorical variables - instead, often calculate proportions
pie charts, bar graphs often used to represent
Histograms
This is a summary graph for a single variable. Histograms are useful to understand the pattern of variability in the data, especially for large data sets
A histogram is a graph in which the horizontal scale represents classes of data values and the vertical scale represents frequencies. The heights of the bars correspond to the frequency values and the bars are drawn adjacent to each other.
- tells us shape and distribution
- break data into bins/ranges of equal length
Dotplots and stemplots
These are graphs for the raw data. They are useful to describe the pattern of variability in the data, especially for small data sets
- Also called stem and leaf plots
- usually when 20 or fewer observations (if 21+, use histogram)
A graph in which each data value is plotted as a point along a scale of values. Dots representing equal values are stacked.
- not recommended unless small sample size (few observations)
- dots: where the observations are located along the line
Measures of center
The center of a data set is a representative or average value that indicates where the middle of the data set is located.
- Mean
- Median
- Mode
Mean
The mean or arithmetic average of a data set is the measure of center found by adding the values and dividing the total by the number of values
- sample mean = summation of all observations / number of values in sample
Median
The median of a data set is the measure of center that is the middle value when the data values are arranged in increasing or decreasing order.
To find the median, first sort the values, then:
- If the number of values is odd, the median is the number located in the exact middle of the list
- If the number of values is even, the median is found by computing the mean of the two middle numbers
Mode
The mode of a data set is the value that occurs most frequently.
- When two values occur with the same (greatest) frequency, each one is a mode and the data set in bimodal.
- When more than two values occur with the same (greatest) frequency, each is a mode and the data set is multimodal.
- When no value is repeated, there is no mode.
- One mode: unimodal
Skewed data distribution
A distribution of data is skewed if it is not symmetric and extends more to one side than the other
- If tail is on left (skinny side), mean pulled towards left
- If tail is on left, mean pulled towards right (mean > median)
Left skew (negative skew): the mean and median are to the LEFT of the mode (mean < median)
Symmetric (zero skew): the mean, median, and mode are the same
Right-skew (positive skew): the mean and median are to the RIGHT of the mode (mean > median)
The Best Measure of Center
Each measure of center has advantages and disadvantages
- Mean: is unique in that it takes all data values into account. However, it is NOT resistant to skew and extreme values (outliers)
- Median: is resistant to skew and outliers
- For data that is approximately symmetric with only one mode, the mean, median, mode and midrange will be approximately the same
- For data that is obviously asymmetric, you should report both the mean and the median
Variation
a measure of the amount that values within a data set vary among themselves
Range
The range of a set of data is the difference between the maximum value and the minimum value
- Range = max - min
Standard deviation
The standard deviation of a set of sample values is a measure of variation of values about the mean
- The standard deviation “s” is used to describe the variation around the mean.
- Like the mean, it is NOT resistant to skew or outliers.
(used to estimate population)
Variance
The variance of a set of values is a measure of variation equal to the square of the standard deviation (s^2)
(used to estimate population)
Z-score
(also known as standardized score)
A z-score can be used to compare values from different data sets.
- The Z-score is the number of standard deviations that a given value x is above or below the mean.
- If z-score is 1…..1 standard deviation above mean. If z score is -1.5…..1.5 standard deviations below.
use when we want to compare different populations
- need to standardize data to make comparable
positive vs. negative z-score
Positive z-score: indicates that the value is above the mean
Negative z-score: indicates that the value is below the mean.
Quartiles
Quartiles divide the sorted data values into four equal parts.
- The median divides the data into two equal components.
- Q1: 25% of values are less than or equal to Q1, and 75% of values are greater than or equal to Q1
- Q2: equal to the median
- Q3: 75% of values are less than or equal to Q3, and 25% of values are greater than or equal to Q3
Exploratory data analysis
the process of using statistical tools to investigate data sets in order to understand their important characteristics, including: center, variation, distribution, outliers and time
Outlier
An outlier is a value that is located very far away from almost all of the other values. Relative to the other data, an outlier is an extreme value
- An outlier can have a dramatic effect on the mean, the standard deviation, and the scale of the histogram so that the true nature of the distribution is obscured
Five-number summary
min Q1 M (median) Q3 max
Inter-quartile range (IQR)
The IQR is the distance between the first and third quartiles (the length of the box in the box plot)
- IQR = Q3-Q1
- used to find suspected high and low outliers
Calculating outliers
An outlier is an individual value that falls outside the overall pattern. How far outside the overall pattern does a value have to fall to be considered a suspected outlier?
- Suspected low outlier: any value < Q1 - 1.5 IQR
- Suspected high outlier: any value > Q3 + 1.5 IQR
How to draw a boxplot
- Find the 5 number summary
- Construct a scale with values that includes the minimum and maximum data values
- Construct a box extending from Q1 to Q3, and draw a line in the box at the median values
- Draw lines extending outward from the box to the minimum and maximum data values
(won’t be asked to do this on exam)
Bivariate (or paired) data
can be analyzed to determine if there is an association between the two variables.
- We explore only linear associations within quantitative data
Correlation
A correlation exists between two variables when one of them is linearly related to the other in some way
- must be quantitative variables
How to investigate the association between two variables
- Make a scatterplot
- - What type of relationship is there? linear or nonlinear
- - Direction of relationship? positive (as x increases, y increases) or negative (as x increases, y decreases)
- - How strong is the relationship? strong (if you can connect dots), weak (if scattered)
- - Look for potential outliers
Linear correlation coefficient (r) - definition + requirements
The correlation measures the strength of the linear association between paired x and y quantitative values in a sample. r is a sample statistic representing the population correlation coefficient, p.
Requirements for making inferences about p, using r:
- Paired data (x, y) must be a ramble sample
- A scatterplot must confirm that the points approximate a straight-line pattern
- Outliers should be removed if they are known to be errors
Properties of the correlation coefficient
- The value of r is always between -1 and 1, inclusive (-1 less than or equal to r less than or equal to 1)
- The value of r does not change if all values of either variable are converted to a different scale
- The value of r is not affected by the choice of x and y (ex: doesn’t matter if BMI x or y, cholesterol, y or x)
- r measures the strength and direction of a linear association
Negative correlation: - slope
Positive correlation: + slope
Interpreting the correlation coefficient
If r is closer to zero, we can conclude that there is no significant linear correlation between x and y.
If r is close to -1 or 1, we conclude that there is significant linear correlation (values closer to -1 or 1 indicate stronger correlation)
- CANNOT conclude that there is no relationship at all (there could be another relationship like a parabola)
Interpreting r
If we conclude that there is a linear correlation between x and y, we can find a linear equation that expresses y in terms of x and that equation can be used to predict values of y for given values of x. (Simple Linear Regression)
The value of r^2 is the proportion of variation in y that is explained by x. In addition to x, there may be a variety of other factors affecting y, such as random variation or other factors not included in the study. We will explore this in more detail with linear regression.
Interpreting r - common errors
- Concluding that correlation implies causality (ex: shark attacks and ice cream consumption)
- Data based on averages: Averages suppress individual variation and may inflate the correlation coefficient (averages may make things look better than they are)
- Linearity: An association may exist between x and y even when there is no significant linear correlation.
r is not resistant to outliers:
- Correlations are calculated using means and standard deviations, and thus are NOT resistant to outliers
- Outliers will make a relationship look stronger/weaker than it actually is
Simple linear regression
The regression equation expresses an association between x and y.
Variable x: the independent, predictor, or explanatory variable
Variable y: the dependent or response* variable
Data comes in pairs (xi, yi) where xi is the ith observation for variable x and yi is the ith observation for variable y
A linear regression model with one predictor variable is a simple linear regression (SLR) model
x is what we are using to predict y
least-squares regression line: definition
the unique line such that the sum of the vertical distances between the data points and the line is zero, and the sum of the squared vertical distances is the smaller possible
- same as line of best fit
- line: smallest amount of vertical distances squared (minimizes error)
- sum of all vertical distances has to = 0
- always has to pass through the point (x bar, y bar)
- Only for linear associations
- Don’t compute the regression line until you have confirmed that there is a linear relationship between x and y - always plot the raw data first to confirm linear association (always do a scatterplot and correlation coefficient first)
least squares regression line: notation and interpretations
y hat: the predicted value of y for a given value of x
y hat = intercept + slope x
*always have to write y hat
slope of the regression line: describes how much we expect y to change, on average, for every unit change in z
intercept: a necessary mathematical descriptor of the regression line (it does not describe a specific property of the data)
Slope of the regression line:
b1 = r (sy/sx)
- r: the correlation coefficient between x and y
- sy: standard deviation of the response variable y
- sx: standard deviation of the explanatory variable x
Intercept:
b0 = y bar - b1 (x bar)
- x and y bar are the respective means fo the x and y variables