section 2.1: examining numerical data Flashcards
What is a scatterplot used for?
Visualizing the relationship between two numerical variables
What does a dot plot visualize?
One numerical variable. Darker colors represent areas with more observations
What does a stacked dot plot represent?
Higher bars indicate areas with more observations, aiding in judging the center and shape of the distribution
What is the purpose of a histogram?
Provides a view of data density, showing where data is relatively more common
What does the term ‘center’ refer to in statistics?
Mean or average of the distribution
What is the formula for the sample mean?
x̄ = (x1 + x2 + x3 + … + xn) / n
How is the population mean computed?
Computed the same way as sample mean, usually impossible to calculate due to lack of access to the entire population
What does x̄ represent?
Sample mean
What does μ represent?
Population mean
Define unimodal
A distribution with a single peak
What is the difference between bimodal and multimodal?
Bimodal has two peaks, while multimodal has several prominent peaks
What characterizes a uniform distribution?
No apparent peaks
What does ‘right skewed’ refer to?
A distribution with a tail extending to the right
What does ‘left skewed’ mean?
A distribution with a tail extending to the left
What is the formula for variance?
s^2 = (sum of(x - x̄)^2)/(n-1)
How is standard deviation calculated?
s = √(s^2)
What is the median in a dataset?
The value that splits the data in half when ordered in ascending order
What does Q1 represent?
25th percentile, also called the first quartile
What is the 50th percentile also known as?
The median
What does Q3 represent?
75th percentile, also called the third quartile
Define interquartile range (IQR)
The range where the middle 50% of the data lies, calculated as IQR = Q3 - Q1
What is the maximum upper whisker reach?
Q3 + 1.5 x IQR
What is the maximum lower whisker reach?
Q1 - 1.5 x IQR
Define an outlier
An observation beyond the maximum reach of the whiskers
What are robust statistics?
Median and IQR
What are not robust statistics?
Mean, variance (standard deviation)
When describing distributions, what three aspects do we focus on?
Center, shape, and spread of distributions
Which plots are used for 2-numerical variable distributions?
Scatter plot
Which plots are used for 1-numerical variable distributions?
Dot plot, stacked dot plot, histogram, box plot
Why are histograms important?
They are the most important distributions for analysis
How can the chosen bin width affect a histogram?
It can alter the story the histogram is telling
What does the median represent in relation to data values?
50% of the values are below it and 50% are above
What are ways to measure center?
- Histograms
- Mean (average)
- Median
What are ways to measure shape?
- Modality
- Skewness
What are ways to measure spread?
- Variance (standard deviation)
- IQR
For skewed distributions, which measures are more helpful?
Median and IQR to describe center and spread
For symmetric distributions, which measures are more helpful?
Mean and SD to describe center and spread
Which variable is expected to be uniformly distributed: (a) heights of KSU students, (b) salaries of a random sample of people from North Carolina, (c) house prices in America, (d) birthdays of classmates (day of the month)?
(d) Birthdays of classmates (day of the month)
Why is it important to look for outliers?
- Identify extreme skew in the distribution
- Identify data collection and entry errors
- Provide insight into interesting features of the data
How would replacing the largest value with $10 million affect the mean, median, standard deviation, and IQR of household income?
- Mean: increase
- Median: may not change much
- Standard deviation (variance): increase
- IQR: stay the same
If the smallest value in household income is replaced with $10 million, how does it affect the mean and median?
- Mean: increase
- Median: stay the same or not change by much
For estimating typical household income for a student, is the mean or median more relevant?
The median, because the distribution is skewed