Chapter 2 Flashcards
Data
Facts that convey information
Two parts of Data
Observation and variable
Variable
Name for what is being counted, measured, or observed
Observations
Actual data values observed
Variation
Observations vary
Data Distribution
Pattern summarizing variation
Two main types of Quantitative Variables
Discrete and Continuous
Two main types of Categorical (Qualitative) Variables
Ordinal and Nominal
Discrete
Possible values belong to a set of distinct numbers
Continuous
Possible values belong to an interval (such as 10-50) and can take on any value in that interval.
Ordinal
Ordered categories (education levels)
Nominal
Un-ordered categories (color, marital status)
Two notes on Variable types
A continuous variable may be simplified into a Categorical one for short
What is a Bar Chart used for?
Displaying Categorical variables
Bar Chart x-axis
Categories or classes
Bar Chart y-axis
Count (frequency) or relative frequency
Relative Frequency for sample size n
A percent value find by dividing the frequency of a given class by the sample size (A/n, B/n, …)
What does using Relative Frequency change in a Bar Chart?
Only the y-axis quantities to percents; proportionally the chart remains unchanged
What is a Frequency Histogram used for?
Displaying quantitative variables
Frequency Histogram x-axis
Intervals (bins)
Frequency Histogram y-axis
Counts (frequency) or relative frequency
Sturge’s Rule

Two main types of Variables
Quantitative and Categorical (Qualitative)
Quantitative
Observations which take on numerical values
Categorical (Qualitative)
Each observation belongs to any one set of categories
Different interval locations
Change histogram
How to determine if Frequency Histogram shows true pattern of variation?
Create several histograms and choose one that displays features common to most to analyze
Four patterns to look for in Frequency Histogram
- Modality (# of peaks) 2. Symmetry (is it mirrored?) 3. Center (where is it?) 4. Spread (how spread is data?)
Outliers
Values lying well away from rest of data
Modal Bar
Bar with height greater than or equal to those adjacent to it
Mode
Location of modal bar
Unimodal
Single modal bar
Bi-modal
Exactly two modal bars
Multimodal
More than one modal bar
Symmetric
Bars to left of some point are mirror images of those to right of same point
Skewed
Not symmetric
Right skewed
Tail extends farther right
Left skewed
Tail extends farther left
Symmetric relative to mean and median
Both are equal
Right skewed relative to mean and median
Mean greater than median
Left skewed relative to mean and median
Mean less than median
What causes symmetric unimodal?
Homogeneous populations or measurement errors
What causes right skewness?
Data that is lower bounded but not upper bounded (salaries, age of living adults)
What causes left skewness?
Data that is upper bounded but not lower bounded (age at death, lifespan)
What causes multi-modality?
Non-homogeneous populations (male/female mixed heights)
What casues short tails?
Mixture of streams.
Time Series Plot
Plots data against time
Time Series Plot x-axis
Time or order
Time Series Plot y-axis
Observed value at given times
Stationary process
A data-generating mechanism in which the variation of data does not change over time (graph averages out to a horizontal line)
Stratified Plot
Data broken into groups (strata) and distributions are compared
What is a Stratified Plot used for?
Comparing data from stationary processes
What to look for in a Stratified Plot
Spread within strata and differences in centers between different strata
Within-Variation
Variation or spread within strata
Between-Variation
Differences in center location between strata
Noise
Random variation caused by poor supervision or training
How to reduce Noise?
Change in process
Bias
Systematic variation caused by improper machines or tools
How to reduce Bias?
Correct mistakes
Descriptive Statistics
Numerical summaries reflecting important characteristics of a data set
Three Types of Numerical Summaries
- Measures of center 2. Measures of spread 3. Measures of location
Measures of Center
Mean and Median
Measures of Spread
Range, Variance, Standard Deviation, IQR
Measures of Location
Percentiles and Quartiles
Mean
Sum of observations divided by number of observations
Median
Halfway point of ordered observation values
Median if odd
Middle observation (n+1)/2 th of n observations
Median if even
Average of two middle observations n/2 th and (n/2 + 1)th of n observations
Robust
Median is more robust than mean
Mean relative to tails
Mean always dragged toward the longer tail in observation
Range
Max value minus min value
Variance
Deviation of each observation from the mean combined into a single number reflecting overall spread of data
Formula for Variance

Standard Deviation
Square root of average squared distance from mean.

Empirical Rule
For distributions that are bell shaped and approximately symmetric, of all observations: 68% fall within 1 SD from mean (mean - s , mean + s) 95% fall within 2 SD from mean (mean - 2s, mean + 2s) 99.7% fall within 3 SD from mean (mean - 3s, mean + 3s)
The pth Percentile
A value such that p% of the observations fall below it
Median as a Percentile/Quartile
50th percentile or Q2 (second quartile)
First Second Third Quartile
First - 25% of observations fall below it Second - 50% of observations below/above (median) Third - 75% of observations below
Finding first quartile
Find the median arranging data in increasing order, then find the median of the first half of the data (median excluded)
Finding third quartile
Find the median arranging data in increasing order, then find the median of the second half of the data (median excluded)
Interquartile range (IQR)
Range of middle 50% of data
Finding IQR
IQR = Q3 - Q1
Mathematical determination of a Potential Outlier
Falls more than 1.5(IQR) below first quartile or 1.5(IQR) above third quartile
Five Number Summaries
- Minimum value 2. First Quartile 3. Mean/Second Quartile 4. Third Quartile 5. Maximum value
Box-Plot
Graphical display using Five Number Summaries
Four features of a Box-Plot
- Box goes from Q1 to Q3 and contains central 50% 2. Line inside box marks median 3. Lines extend from box to encompass remaining data for potential outliers 4. Outliers shown separately using another symbol
Whiskers
Lines extending from box that indicate area for potential outliers
Box-Plot skewness
Side with larger part of box and longer whisker usually has skew in that direction
Box-Plot Modality
Good for unimodal but NOT multimodal data
Resistant
Measures that are not seriously affected by outliers
Measures that are resistant
Median, mode, IQR
Measures that are not resistant
Mean, standard deviation, range
Z-Score
measure that specifies number of standard deviations observation falls from mean
Z-Score formula

Quartiles and shape of the distribution
If distance between Q1 and Q2 is greater than that between Q2 and Q3, data is left skewed and vice versa