Exploratory Data Anyalsis 6.3 Distributions Flashcards
What is a distribution?
A distribution shows how data points are spread out or arranged. It helps us understand patterns in the data.
What is a probability density function (PDF)?
- What it does: Shows how “spread out” or “concentrated” data values are.
- Example: If you have people’s heights, the PDF tells you how common each height is.
- Think of it like: A histogram with smooth curves! More height at a point = more common.
What is the purpose of a box plot?
Boxplots are great for summarizing and presenting all the data for a single variable clearly.
What does the median line in a box plot represent?
The median line represents the middle datapoint.
What do box plots clearly show?
Box plots clearly show outliers.
What is a limitation of box plots?
Box plots can hide some of the information.
What is a violin plot?
Violin plots demonstrate the density of the data at specific values.
What is the significance of unimodal, bimodal, and multimodal distributions?
- Unimodal has one peak,
- bimodal has two peaks,
- multimodal has multiple peaks, which can indicate different groups.
How do distributions affect statistical analysis?
- Distributions determine which statistical tests are appropriate for your data
- Skewed distributions can lead to misleading means - median may be more representative
- Bimodal distributions suggest two distinct groups within your data
What is the mean?
Mean is the sum of the numbers divided by the number of numbers.
What is the mode?
Mode is the most common number.
What is the median?
Median is the middle number.
What can skewness reveal about data?
- Direction of imbalance: Positive skew means a long tail to the right, while negative skew means a long tail to the left
- Presence of outliers: Extreme values often cause skewness by extending one tail
What is the normal distribution?
- The normal distribution is the most important probability distribution
- (Bell Curve) – Most values are around the average, like human height.
What is variance?
Variance measures how spread out the data is, calculated as the sum of squares of the difference of each value and the mean, divided by the total number of values.
What is standard deviation?
Standard deviation (σ or STD) is the square root of the variance.
What percentage of data lies within 1 standard deviation of the mean in a normal distribution?
Within 1 σ of the mean lies 68% of the data.
What percentage of data lies within 2 standard deviations of the mean in a normal distribution?
Within 2 σ of the mean lies 95% of the data.
What percentage of data lies within 3 standard deviations of the mean in a normal distribution?
Within 3 σ of the mean lies 99.7% of the data.
What is Positive Skew ?
(Right-Skewed) → Tail on the Right
*Most data points are concentrated on the left side
- Mean is typically greater than median
- Examples: income distributions, house prices, reaction times
What is Negative Skew ?
(Left-Skewed) → Tail on the Left
- Most data points are concentrated on the right side
- Examples: exam scores with ceiling effects, age at death, depreciation
What is Zero Skew ?
(Symmetrical / Normal Distribution)
- What it means: Data is evenly distributed around the center.
- Example: Heights of people in a population.