elementary statistics vocabulary CH 1-3 Flashcards
CH 1-3
Data
Collections of observations (such as measurements, genders, survey responses).
CH 1-3
Statistics
The science of planning studies and experiments, obtaining data, and then organizing, summarizing, presenting, analyzing, interpreting, and drawing conclusions based on the data.
CH 1-3
Population
The complete collection of all individuals (scores, people, measurements, and so on) to be studied. The collection is complete in the sense that it includes all of the individuals to be studied.
CH 1-3
Census
The collection of data from every member of the population.
CH 1-3
Sample
A subcollection of members selected from a population.
CH 1-3
Collection of sample data
Sample data must be collected in an appropriate way, such as through a process of random selection.
CH 1-3
Inappropriate collection of sample data
If sample data are not collected in an appropriate way, the data may be so completely useless that no amount of statistical torturing can salvage them.
CH 1-3
Statistical thinking - factors
- Context of the data
- Source of the data
- Sampling method
- Conclusions
- Practical implications
CH 1-3
Practical implications
Statistical significance vs. practical significance
CH 1-3
Parameter
A numerical measurement describing some characteristic of a population.
CH 1-3
Statistic
A numerical measurement describing some characteristic of a sample.
CH 1-3
Quantitative (numerical) data
Numbers representing counts or measurements.
CH 1-3
Categorical (qualitative, attribute) data
Names or labels that are not numbers representing counts or measurements.
CH 1-3
Discrete data
Result when the number of possible values is either a finite number or a “countable” number. (That is, the number of possible values is 0 or 1 or 2, and so on.)
CH 1-3
Continuous (numerical) data
Result from infinitely many possible values that correspond to some continuous scale that covers a range of values without gaps, interruptions, or jumps.
CH 1-3
Nominal level of measurement
Is characterized by data that consist of names, labels, or categories only. The data cannot be arranged in an ordering scheme (such as low to high).
CH 1-3
Ordinal level of measurement
Data can be arranged in some order, but differences (obtained by subtraction) between data values either cannot be determined or are meaningless.
CH 1-3
Interval level of measurement
Is like the ordinal level, with the additional property that the difference between any two data values is meaningful. However, data at this level do not have a natural zero staring point (where none of the quantity is present).
CH 1-3
Ratio level of measurement
The interval level with the additional property that there is also a natural zero starting point (where zero indicates that none of the quantity is present). For values at this level, differences and ratios are both meaningful.
CH 1-3
Voluntary response sample (self-selected sample)
One in which the respondents themselves decide whether to be included.
Cannot be used for making conclusions about a population.
CH 1-3
Correlation
A statistical association between two variables.
CH 1-3
Causality
The dependence of one variable upon another.
CH 1-3
Correlation caveat
Correlation does not imply causality.
CH 1-3
Observational study
Subjects are observed and specific characteristics are measured, but there is no attempt to modify the subjects being studied.
CH 1-3
Experiment
Some treatment is applied to the subjects (experimental units), and its effects upon them are observed.
CH 1-3
Experimental units
The subjects of an experiment.
CH 1-3
Simple random sample
A sample of n subjects is selected in such a way that every possible sample of the same size n has the same chance of being chosen.
CH 1-3
Random sample
Each member of the population has an equal chance of being selected. Computers are often used to generate random samples.
CH 1-3
Probability sample
Involves selecting members of a population in such a way that each member of the population has a known (but not necessarily the same) chance of being selected.
CH 1-3
Systematic sample
Select some starting point, then select every kth (such as every 50th) element in the population.
CH 1-3
Convenience sampling
Use results that are easy to get.
CH 1-3
Stratified sampling
Subdivide the population into at least two different subgroups (or strata) so that subjects within the same subgroup share the same characteristics (such as gender or age bracket), then draw a sample from each subgroup.
CH 1-3
Cluster sampling
Divide the population into sections (or clusters), then randomly select some of those clusters, and then choose all members from those selected clusters.
CH 1-3
Multistage sampling
Uses come combination of the basic sampling mathods.
CH 1-3
Multistage sample design
Pollsters select a sample in different stages, and each stage might use different sampling methods.
CH 1-3
Cross-sectional study
Data are observed, measured, and collected at one point in time.
CH 1-3
Retrospective (case-control) study
Data are collected from the past by going back in time (through examination of records, interviews, and so on).
CH 1-3
Prospective (longitudinal, cohort) study
Data are collected in the future from groups sharing common factors (called cohorts).
CH 1-3
Cohort
A group sharing common factors.
CH 1-3
Randomization
The assigning of subjects to different groups through a process of random selection.
CH 1-3
Replication
The repetition of an experiment on more than one subject.
Alternately, replication refers to the repetition or duplication of an experiment so that results can be confirmed or verified.
CH 1-3
Blinding
A technique in which the subject doesn’t know whether he or she is receiving a treatment or a placebo.
CH 1-3
Placebo effect
Occurs when an untreated subject reports an improvement in symptoms.
CH 1-3
Double-blind
Blinding occurs at two levels: (1) the subject doesn’t know whether he or she is receiving the treatment or a placebo, and (2) the dispenser of the treatment doesn’t know either.
CH 1-3
Confounding
Occurs in an experiment when the experimenter cannot distinguish among the effects of various factors.
CH 1-3
Completely randomized experimental design
Assign subjects to different treatment groups through a process of random selection.
CH 1-3
Randomized block design
If testing one or more different treatments with different blocks:
(1) Form blocks (or groups) of subjects with similar characteristics.
(2) Randomly assign treatments to the subjects within each block.
CH 1-3
Block
A group of subjects that are similar, but where the groups differ in ways that might affect the outcome of the experiment.
CH 1-3
Rigorously controlled design
Carefully assign subjects to different treatment groups, so that those given each treatment are similar in the ways that are important to the experiment.
Extremely difficult to implement due to possible lack of consideration of all relevant factors.
CH 1-3
Matched pairs design
Compare exactly two treatment groups by using subjects matched in pairs that are somehow related or have similar characteristics.
The matched pairs may also consist of before and after measurements.
CH 1-3
Sampling error
The difference between a sample result and the true population result, resulting from chance sample fluctuations.
CH 1-3
Nonsampling error
Occurs when the sample data are incorrectly collected, recorded, or analyzed.
CH 1-3
Characteristics of data
CVDOT
- Center
- Variation
- Distribution
- Outliers
- Time
CH 1-3
Center
A representative or average value that indicates where the middle of the data set is located.
CH 1-3
Variation
A measure of the amount that the data values vary.
CH 1-3
Distribution
The nature or shape of the spread of the data over the range of values (such as bell-shaped, uniform, or skewed).
CH 1-3
Outliers
Sample values that lie very far away from the vast majority of the other sample values.
CH 1-3
Time
Changing characteristics of the data over time.
CH 1-3
Frequency distribution (frequency table)
Shows how a data set is partitioned among all of several categories (ro classes) by listing all of the categories along with the number of data values in each of the categories.
CH 1-3
Frequency
The number of original values within a particular class.
CH 1-3
Lower class limits
The smallest numbers that can belong to the different classes.
CH 1-3
Upper class limits
The largest numbers that can belong to the different classes.
CH 1-3
Class boundaries
The numbers used to separate classes, but without the gaps caused by class limits.
CH 1-3
Class midpoints
The values in the middle of the classes.
CH 1-3
Class width
The difference between two consecutive lower class limits or two consecutive lower class boundaries in a frequency distribution.
CH 1-3
Relative frequency distribution (percentage frequency distribution)
The frequency of a class is replaced with a relative frequency (a proportion) or a relative frequency (a percent),
CH 1-3
Sum of relative frequencies
The sum of the relative frequencies in a relative frequency distribution must be close to 1 (or 100%).
CH 1-3
Cumulative frequency
The sum of the frequencies for that class and all previous classes.
CH 1-3
Normal frequency distribution
- The frequencies start low, then increase to 1 or 2 high frequencies, then decrease to a low frequency.
- The distribution is approximately symmetric, with frequencies preceding that maximum being roughly a mirror image of those that follow the maximum.
CH 1-3
Histogram
A graph consisting of bars of equal width drawn adjacent to each other (without gaps).
- The horizontal scale represents classes of quantitative data values.
- The vertical scale represents frequencies.
- The heights of the bars correspond to the frequency values.
CH 1-3
Relative frequency histogram
- Same shape and horizontal scale as a histogram
* The vertical scale is marked with relative frequencies, as percentages or proportions, instead of actual frequencies.
CH 1-3
Frequency polygon
Uses line segments connected to points located directly above class midpoint values.
CH 1-3
Relative frequency polygon
Use relative frequencies, either proportions or percentages, for the vertical scale.
CH 1-3
Ogive
A line graph that depicts cumulative frequencies.
CH 1-3
Dotplot
A graph in which each data value is plotted as a point (or dot) along a scale of values. Dots representing equal values are stacked.
CH 1-3
Stemplot (stem-or-leaf plot)
Represents quantitive data by separating each values into 2 parts: the stem (such as the leftmost digit) and the leaf (such as the rightmost digit).
CH 1-3
Bar graph
Uses bars of equal width to show frequencies of categories of qualitative data.
- Vertical scale represents frequencies or relative frequencies.
- Horizontal scale identifies the different categories of qualitative data.
- Bars may or may not be separated by small gaps.
CH 1-3
Multiple bar graph
Has 2 or more sets of bars, and is used to compare 2 or more data sets.
CH 1-3
Pareto chart
A bar graph for qualitative data, with the added stipulation that the bars are arranged in descending order according to frequencies.
- Vertical scale - frequencies or relative frequencies.
- Horizontal scale - different categories of qualitative data.
CH 1-3
Pie chart
A graph that depicts qualitative data as slices of a circle, in which the size of each slice is proportional to the frequency count for the category.
CH 1-3
Scatterplot (scatter diagram)
A plot of paired (x, y) quantitative data with a horizontal x-axis and a vertical y-axis.
- Horizontal axis - first (x) value
- Vertical axis - second (y) value
CH 1-3
Time-series graph
A graph of time-series data, which are quantitative data that have been collected at different points in time.
CH 1-3
Descriptive statistics
Summarize or describe relevant characteristics of data.
CH 1-3
Inferential statistics
Used to make inferences, or generalizations, about a population.
CH 1-3
Mean (arithmetic mean)
The measure of a data set’s center, found by adding the data values and dividing by the number of data values.
CH 1-3
Sample size
The number of data values.
CH 1-3
Measure of center
A value at the center or middle of a data set.
CH 1-3
Median
The measure of center that is the middle value when the original data values are arranged in order of increasing (or decreasing) magnitude.
CH 1-3
Mode
The value that occurs with the greatest frequency.
CH 1-3
Bimodal
A data set with 2 data values that occur with the same greatest frequency.
CH 1-3
Multimodal
A data set with more than 2 data values which occur with the same greatest frequency.
CH 1-3
No mode
A data set has no mode when no data value is repeated.
CH 1-3
Midrange
The measure of center that is the value midway between the maximum and minimum values in the original data set.
CH 1-3
Round-off rule, mean, median and midrange
Carry one more decimal place than is present in the original set of values.
CH 1-3
Weighted mean
The mean calculated when data values are assigned different weights.
CH 1-3
Skewed
A distribution of data is skewed if it is not symmetric and extends more to one side than to the other.
CH 1-3
Symmetric
A distribution of data is symmetric if the left half of its histogram is roughly a mirror image of its right half.
CH 1-3
Negatively skewed (skewed to the left)
Data with a longer tail, and a mean and median to the left of the mode.
CH 1-3
Positively skewed (skewed to the right)
Data with a longer right tail, and a mean and median to the right of the mode.
CH 1-3
Range
The difference between the maximum data value and the minimum data value.
CH 1-3
Standard deviation
A measure of variation of data values about the mean.
CH 1-3
Variance
A measure of variation equal to the square of the standard deviation.
Sample variance is an unbiased estimator of the population variance.
CH 1-3
Unbiased
The sample variance tends to target the population variance instead of systematically over- or underestimating it.
CH 1-3
Range rule of thumb
For many data sets, the vast majority (~95%) of sample values lie within 2 standard deviations of the mean.
CH 1-3
Empirical rule
For data sets having a distribution that is approximately bell-shaped:
- 60% (1 SD)
- 95% (2 SD)
- 99.7% (3 SD)
CH 1-3
Chebyshev’s Theorem
The proportion (or fraction) of any data set lying within K SDs of the mean is always at least 1 - 1/K2, where K is positive and greater than 1.
CH 1-3
Mean absolute deviation (MAD)
The mean distance of data from the mean.
CH 1-3
Coefficient of variation (CV)
Describes the standard deviation relative to the mean, expressed as a percent.
CH 1-3
z score (standardized value)
The number of standard deviations that a given value is above or below the mean.
- Ordinary: z score between -2 and 2
- Unusual: z score less than -2 or greater than 2
CH 1-3
Percentile
A measure of location dividing a set of data into 100 groups with about 1% of the values in each group.
CH 1-3
Quartile
A measure of location, denoted Q1, Q2, and Q3, which divide a set of data into 4 groups with about 25% of the values in each group.
CH 1-3
Q1 (first quartile)
Separates the bottom 25% of the sorted values from the top 75%.
At least 25% of the sorted values are less than or equal to Q 1 , and at least 75% of the values are greater than or equal to Q 1 .
CH 1-3
Q2 (second quartile)
Same as the median; separates the bottom 50% of the sorted values from the top 50%.
CH 1-3
Q3 (third quartile)
Separates the bottom 75% of the sorted values from the top 25%.
At least 75% of the sorted values are less than or equal to Q 3 , and at least 25% of the values are greater than or equal to Q 3 .
CH 1-3
Interquartile range (IQR)
Q3 - Q1
CH 1-3
Semi-quartile range
(Q3 - Q1) / 2
CH 1-3
Midquartile
(Q3 + Q1) / 2
CH 1-3
10-90 percentile range
P90 - P10
CH 1-3
5-number summary
- Minimum
- Q 1 (first quartile)
- Q 2 (median)
- Q 3 (third quartile)
- Maximum
CH 1-3
Boxplot (box-and-whisker diagram)
A graph of a data set that consists of a line extending from the minimum value to the maximum value, and a box with lines drawn at the first quartile Q1, the median, and the third quartile Q3.