AP Stat Vocabulary Flashcards
Individuals
the objects described by a set of data ( people, animals, things, etc.)
Categorical Variable
data that places an individual into 1 of several groups or categories (pie chart, bar graphs, two way tables)
Quantitative Variable
data that takes numerical values for which it makes sense to find an average
Discrete Variables
variables that can only take a finite number of values
Continuous
variables that can take an infinite number of values
Univariate Data
When we conduct a study that looks at only one variable, we say that we are working with univariate data. Suppose, for example, that we conducted a survey to estimate the average weight of high school students. Since we are only working with one variable (weight), we would be working with univariate data.
Bivariate Data
When we conduct a study that examines the relationship between two variables, we are working with bivariate data. Suppose we conducted a study to see if there were a relationship between the height and weight of high school students. Since we are working with two variables (height and weight), we would be working with bivariate data.
Variable
any characteristic of an individual
population
refers to the total set of observations that can be made.
sample
set of observations drawn from the population
census
a study that obtains data from every member of a population. In most studies, a census is not practical, because of the cost and/or time required.
distribution
tells us what values a variable take and how often it takes those values- the pattern of variation of a variable
Inference
drawing conclusions that go beyond the data- making a conclusion on a population based on a set of data
Frequency Table
displays the frequencies counts for categorical variables
Relative Frequency
measure of the number of times that an event occurs.- usually a proportion or percentage
Relative frequency = Subgroup count / Total count
Table
a table that shows relative frequencies for different categories of a categorical variable.
Roundoff Error
when each percent is rounded to the tenth, but the numbers do not equal 100%- points to the effect of rounding results
Pie Chart
shows the distribution of a categorical variable as “pie” whose slices are sized by the counts or percentages for the categories.
- used when you want to emphasize a categories relation to the whole
Bar Graph
represent each category as a bar. The bar height shows the category as counts or percents
- compares quantities by comparing the heights of the bars
Two-way Table
A two-way table (also called a contingency table) is a useful tool for examining relationships between categorical variables. The entries in the cells of a two-way table can be frequency counts or relative frequencies (just like a one-way table ).
Marginal Distribution
Entries in the “Total” row and “Total” column are called marginal frequencies or the marginal distribution. tells us the distribution of values among ALL the individuals
Conditional Distribution
The relative frequencies in the body of the table are called conditional frequencies or the conditional distribution.
- describes the values of variables among individuals who have a specific value of another variable
Segmented Bar Graph
segmented bars on the graph (stacked on top of each other) showing a larger category being divided into a smaller one.
- shows the relationship of the small variable to the category as a whole
Side by Side Bar Graph
the bars are split into colored bar segments (next to each other) the heights are used to compare the variables to the whole. (displays two categorical variables)
Association
when knowing the value of one variable helps you predict the value of another, then the variables have this
Simpson’s Paradox
a condition in which the same set of data can show opposite trends depending on different groups analyzed
- conditional variables can be secretly hidden, and greatly influence the data
Dot Plot
each data value is shown as a dot above its location on the number line.
- used to compare frequency counts within categories or groups. As you might guess, a dotplot is made up of dots plotted on a graph.
- The dots are stacked in a column over a category, so that the height of the column represents the relative or absolute frequency of observations in the category
Shape
describes the type of graph
- the shape is described by whether it is symmetric, how many peaks it has, if it is skewed, or whether it is uniform
- clusters? gaps?
Mode
most frequently appearing value in a population or sample
Center
the midpoint of the data
Spread
the extent to which a distribution is stretched or squeezed
- can be measured by variance, IQR, and standard deviation
Range
the difference between the maximum and minimum value in a data set
Outlier
a data point that diverges greatly from the overall pattern of data is called an outlier.
- “rule of thumb”- an extreme value is considered to be an outlier if it is at least 1.5 interquartile ranges below the first quartile (Q1), or at least 1.5 interquartile ranges above the third quartile (Q3).
Symmetric
if the right and left sides of the graph are approximately mirror images of each other
Skewed Right
when the right side is much longer than the left side, the higher values are on the left side, and the tail of the graph is to the right, the values tend to be lower
Skewed Left
when the left side is much longer than the right side, the higher values are on the right side, and the tail of the graph is to the left, the values tend to be higher
Unimodal
when a graph has a clear single peak “one mode”
Bimodal
when a graph has two clear peaks “two modes”
Multimodal
when a graph has more than two clear peaks “many modes”
Stem Plot
A stem plot is used to display quantitative data, generally from small data sets (50 or fewer observations).
- gives us a the shape of distribution with numerical values
Splitting Stems
for every stem split into 2, the leaves are split 0-4 and 5-9
Back to back Stem Plot
Back-to-back stemplots are a graphic option for comparing data from two populations. The center of a back-to-back stemplot consists of a column of stems, with a vertical line on each side. Leaves representing one data set extend from the right, and leaves representing the other data set extend from the left.
Histogram
made up of columns plotted on a graph
-displays the distribution of a quantitative variable
Mean
A mean score is an average score, often denoted by X. It is the sum of individual scores divided by the number of individuals.
sum of the scores/ number of scores
Median
a simple measure of central tendency; the middle value of a data set
TO FIND:
- arrange data from smallest to largest values
-If there is an odd number of observations, the median is the middle value.
- If there is an even number of observations, the median is the average of the two middle values.
Interquartile Range (IQR)
The interquartile range (IQR) is a measure of variability, based on dividing a data set into quartiles.
-defined as the difference between the largest and smallest values in the middle 50% of a set of data.
- “Q3- Q1”
Q1 is the “middle” value in the first half of the rank-ordered data set.
Q2 is the median value in the set.
Q3 is the “middle” value in the second half of the rank-ordered data set.
Five Number Summary
consists of the smallest observation, first quartile, median, the third quartile, and the largest observation
“minimum Q1 median Q3 maximun”
Box Plot/ Box and Whisker Plot
- type of graph used to display patterns of quantitative data.
- splits the data set into quartiles
- box- goes from first quartile to third quartile
- vertical line is drawn at Q2 (median)
- whiskers go from minimum to Q1, and then maximum to Q3
- If the data set includes one or more outliers, they are plotted separately as points on the chart
Standard Deviation
The standard deviation is a numerical value used to indicate how widely individuals in a group vary. If individual observations vary greatly from the group mean, the standard deviation is big; and vice versa.
It is important to distinguish between the standard deviation of a population and the standard deviation of a sample. They have different notation, and they are computed differently. The standard deviation of a population is denoted by σ and the standard deviation of a sample, by s.
Formula:
σ = sqrt [ Σ ( Xi - X )2 / N ]
where σ is the population standard deviation, X is the population mean, Xi is the ith element from the population, and N is the number of elements in the population.
The standard deviation of a sample is defined by slightly different formula:
s = sqrt [ Σ ( xi - x )2 / ( n - 1 ) ]
where s is the sample standard deviation, x is the sample mean, xi is the ith element from the sample, and n is the number of elements in the sample.
And finally, the standard deviation is equal to the square root of the variance.
Variance
The variance is a numerical value used to indicate how widely individuals in a group vary. If individual observations vary greatly from the group mean, the variance is big; and vice versa.
It is important to distinguish between the variance of a population and the variance of a sample. They have different notation, and they are computed differently. The variance of a population is denoted by σ2; and the variance of a sample, by s2.
The variance of a population is defined by the following formula:
σ2 = Σ ( Xi - X )2 / N
where σ2 is the population variance, X is the population mean, Xi is the ith element from the population, and N is the number of elements in the population.
The variance of a sample is defined by slightly different formula:
s2 = Σ ( xi - x )2 / ( n - 1 )
where s2 is the sample variance, x is the sample mean, xi is the ith element from the sample, and n is the number of elements in the sample. Using this formula, the variance of the sample is an unbiased estimate of the variance of the population.
-the variance is equal to the square of the standard deviation.