Prob/Stats Final Exam Review Flashcards
What is the difference between quantitative and categorical data?
Quantitative: data measure the quantity with units. Categorical: data describe a category that a case falls into.
Is “zip code” a quantitative or categorical variable?
Categorical variable. Zip codes describe a category (city), and they do not measure a quantity.
Is “distance from home” a quantitative or categorical variable?
Quantitative variable. “Distance from home” measures a quantity with units of length.
Is “social security number” a quantitative or categorical variable?
Categorical variable. Social security numbers do not measure a quantity. The categorize people by both state and individual identity.
What types of graphs can represent quantitative data?
Histograms, dot plots, stemplots.
What types of graphs can represent categorical data?
Bar graphs, pie charts.
Imagine a table with two variables: Gender and Handedness. Describe what marginal distributions would be.
Marginal distributions come from numbers in the…drum roll please…MARGINS! In other words, the numbers in the “totals” row or column. These usually are reported as PERCENTS.
Imagine a table with two variables: Gender and Handedness. Describe what conditional distributions would be.
Conditional distributions come from a row or a column “inside” the table (NOT from the margins or totals). For example: the handedness of males would be a conditional distribution. Or the gender of left-handers would be another example. And use PERCENTS!
What is the difference between a population and a sample?
A population is the “larger” group about which we hope to learn. A sample is a subset of the population that is easier to obtain and analyze.
What are some features of a histogram?
It has a continuous and labeled number line on the horizontal axis. Then the data is grouped so that the heights of equally wide bars represent the number of data points in each interval. The vertical axis can represent counts OR percents.
If you are asked to “describe this distribution,” what should you always include?
Always comment about the shape, center, spread and outliers. AND be sure to include CONTEXT (words that show what the data is representing). “SOCS + context” can help you remember.
What is the “5-Number Summary?”
The Five Number Summary is the minimum, quartile 1, median, quartile 3, and maximum.
Describe the features of a box plot.
The box in the middle represents the “middle 50%” of the data. The width of the box is the IQR. If there is a segment inside the box, this is the MEDIAN. Each “whisker” represents another 25% of the data. Each of the five intersection points represents the “5 Number Summary.”
What does a box plot NOT reveal?
Sample size, mean, shape (at least not completely)
If the mean is higher than the median in a data set, then which way is the data likely skewed?
If the mean is higher than the median, typically the data is skewed right. Higher numbers (especially high outliers) will tend to “pull” the mean up. Medias are more resistant to outliers and skewness.
What does a z-score tell us about a data point?
A z-score tells us how many standard deviations it is from the mean. Positive z-scores are above the mean, negative z-scores are below the mean.
How to you calculate a z-score?
z-score = (x – mean) ÷ (standard deviation)
What does the standard deviation of a data set describe?
The standard deviation of a data set is the “typical” (“average”) variation from the mean.
Assume the mean height of students is 170 cm and the standard deviation is 5 cm. If all heights were converted to inches, which statistics would change?
To change to inches, we would need to DIVIDE by 2.54. Therefore ALL statistics would be divided by 2.54, including the mean and standard deviation.
Assume the mean height of students is 170 cm and the standard deviation is 5 cm. If all heights were decreased by 3 cm, which statistics would change?
When 3cm is subtracted from data, ONLY MEASURES OF POSITION will change by 3cm (mean, median, minimum, Q3, etc.). Measures of spread will NOT change (standard deviation, range, IQR, etc.)
If data were graphed on a scatterplot showing outdoor temperature vs. heating costs, which variable would be the explanatory variable?
Temperature would be the explanatory variable since temperature is “explaining” (or perhaps causing) the amount of heating costs. Heating costs are “responding” to the temperatures, so it is the response variable.
Would it be appropriate to use a scatterplot to graph gender vs. height?
No. Scatterplots are only for quantitative variables, and gender is a categorical variable. Parallel dot plots or or parallel box plots would be better.
When describing a scatterplot between two variables, what should you always include?
Strength (weak, moderate, strong)
Direction (positive, negative, none)
Form (linear or not; outliers or not)
Context (words that describe the data story)
If you saw a graph of students’ heights vs. students’ arm spans, what would be a good description of the scatterplot?
“There is a strong, positive, linear association between heights and arm spans of students.”
“If two variables have a strong association, then they have a strong correlation.” True?
NO–False! Correlation (r) only measures the strength and direction of LINEAR data! If data is curved, it can have a strong association, but strong or weak correlation. Correlation should NEVER be used to describe curved data.
If there is a strong correlation between two variables, then the explanatory variable is likely causing the reactions in the response variable. True?
NO–False! For example: there can be a strong correlation (r) between the number of people eating ice cream and the number of drownings, but that does not necessarily mean that eating ice cream causes drownings. The outside temperature might be a third (lurking) variable.
What is the name of the line that we sometimes create to fit onto data?
The Least-Squares Regression Line (LSRL). It’s called LinReg(a+bx) on graphing calculators, and sometimes referred to as the “regression line” or “best-fit line” in other textbooks.
If two variables show a high correlation, then the data must be linear. True?
NO–False! If two variables showed a very slight curvature, then if someone calculated the correlation (r) value, it might still be pretty high even though the data is clearly curved.
Which correlation (r) value shows the strongest linear association? –.08, –.29, –.88, .38, .82
–.88 is stronger than .82 even though it’s negative. The strongest correlations are the ones closest to 1 or –1.
Where do you go on your graphing calculator to enter data?
STAT–EDIT