Biostats Test 1 Flashcards
Statistics
the science of data
Data
numbers with a context
Biostatistics
the application of statistics to topics in biology, including, but not limited to the design and analysis of biological experiments and observational studies
Descriptive Statistics
Methods of organizing, summarizing and presenting data in an informative way
Inferential Statistics
Methods for drawing conclusions about a phenomenon (population) on the basis of data (sample
- draw conclusions about hypotheses
Population vs. sample
Population: all subjects or items of interest (whose size, the number of subjects in the population, is denoted by N)
Sample: a group (or subset) selected from a population whose size is denoted by n
- Many different samples can be selected from any given population
- The number of distinct samples depends on the size of both the population and the sample
Data
observations (such as measurements, genders or survey responses) that have been collected
Parameter
a number that describes a characteristic of a population
Statistic
a number that describes a characteristic of a sample (aka sample statistics)
- The observed value of a statistic is used to estimate the unobserved value of a parameter
Unbiased statistic
A statistic is unbiased if the mean of its sampling distribution is the same as the parameter it is intended to estimate
Individuals
Individuals are the objects described in a set of data
- Individuals may be people, animals, plants or things (ex: freshmen, newborns, fields of corn, cells)
Variable
A variable is any property that characterizes an individual.
- A variable can take different values for different individuals (ex: age, gender, blood pressure, blood types, flower color)
- two types: quantitative, categorical
Quantitative variable
Some quantity assessed or measured for each individual. We can then report the average of all individuals.
- Numeric (ex: age in years, blood pressure)
Categorical variable
Some characteristic describing each individual. We can then report the count or proportion of individuals with that characteristic.
- Gender (male, female), blood type (A, AB, O, B), flower color (white, yellow, red)
- finite number of categories
- don’t calculate averages for categorical variables - instead, often calculate proportions
pie charts, bar graphs often used to represent
Histograms
This is a summary graph for a single variable. Histograms are useful to understand the pattern of variability in the data, especially for large data sets
A histogram is a graph in which the horizontal scale represents classes of data values and the vertical scale represents frequencies. The heights of the bars correspond to the frequency values and the bars are drawn adjacent to each other.
- tells us shape and distribution
- break data into bins/ranges of equal length
Dotplots and stemplots
These are graphs for the raw data. They are useful to describe the pattern of variability in the data, especially for small data sets
- Also called stem and leaf plots
- usually when 20 or fewer observations (if 21+, use histogram)
A graph in which each data value is plotted as a point along a scale of values. Dots representing equal values are stacked.
- not recommended unless small sample size (few observations)
- dots: where the observations are located along the line
Measures of center
The center of a data set is a representative or average value that indicates where the middle of the data set is located.
- Mean
- Median
- Mode
Mean
The mean or arithmetic average of a data set is the measure of center found by adding the values and dividing the total by the number of values
- sample mean = summation of all observations / number of values in sample
Median
The median of a data set is the measure of center that is the middle value when the data values are arranged in increasing or decreasing order.
To find the median, first sort the values, then:
- If the number of values is odd, the median is the number located in the exact middle of the list
- If the number of values is even, the median is found by computing the mean of the two middle numbers
Mode
The mode of a data set is the value that occurs most frequently.
- When two values occur with the same (greatest) frequency, each one is a mode and the data set in bimodal.
- When more than two values occur with the same (greatest) frequency, each is a mode and the data set is multimodal.
- When no value is repeated, there is no mode.
- One mode: unimodal
Skewed data distribution
A distribution of data is skewed if it is not symmetric and extends more to one side than the other
- If tail is on left (skinny side), mean pulled towards left
- If tail is on left, mean pulled towards right (mean > median)
Left skew (negative skew): the mean and median are to the LEFT of the mode (mean < median)
Symmetric (zero skew): the mean, median, and mode are the same
Right-skew (positive skew): the mean and median are to the RIGHT of the mode (mean > median)
The Best Measure of Center
Each measure of center has advantages and disadvantages
- Mean: is unique in that it takes all data values into account. However, it is NOT resistant to skew and extreme values (outliers)
- Median: is resistant to skew and outliers
- For data that is approximately symmetric with only one mode, the mean, median, mode and midrange will be approximately the same
- For data that is obviously asymmetric, you should report both the mean and the median
Variation
a measure of the amount that values within a data set vary among themselves
Range
The range of a set of data is the difference between the maximum value and the minimum value
- Range = max - min