Unit 1 - Exploring One-Variable Data Flashcards
How can statistics be used to help answer important, real-world questions based on data that vary?
Collect data
Analyze data
Interpret results
Individuals
May be people, animals, or things described by a set of data
Variable
Characteristic that changes from one individual to another
Cateogrial variable
Take values that are category names or labels
Quantitative variable
Takes numerical values for a measured or counted quantity
Not all variables that take numerical values are
quantitative
It is possible to make a quantitative variable categorical by
grouping values
How can we represent categorical data in tabular form
With a frequency table or relative frequency table
How does these tabular representations help us describe categorical data?
Counts & relative frequencies of categorical data reveal information that can be used to justify claims about data in context
Frequency table
gives the number of individuals or counts in each category
Relative frequency table
Gives the proportion or percent of individuals (cases) in each category
How to represent categorical data
Bar chart
Pie chart
Frequency/ relative frequency table
Making bar charts for categorical data
Label axes
Scale axes
Draw bars
Label axes
Variable name on horizontal axis
Frequency/ Relative frequency on vertical axis
Scale axes
Category labels spread out along horizontal axis
Start scaling vertical axis at 0 and go up in equal increments until you equal or exceed maximum frequency or relative frequency
Draw bars
Make the bars equal in width and leave gaps between them
Heights of the bars represent the category frequencies or relative frequencies
Pie charts
Include legend or key to indicate what each part means
Relative frequencies can make it easier to compare
distributions of data with different number of parts
Charts can be made to be based off of whatever variables is
stronger or more supportive of situations
Discrete variable
Usually involves counting
Variable that can take on countable numbers of values with gaps
Example of discrete variable
number of siblings
Continuous variable
Usually involves measuring
Variable that can take on infinitely many values
Example of continuous variable
Height
Dotplot
Shows every single point in a data set
Easy to see shape of distribution
May be difficult to make for large data sets
Stem/ Leaf plot
Shows all points in a data set
Easy to see shape
Histogram
Easier for large data sets
Easy to see shapes
Lose the single point in a data set
4 factors to consider when describing distribution
Shape
Unusual Features
Center
Variability
Shape
Symmetric -> about same data on both sides Skewed left -> more data on high end Skewed right -> more data on low end Unimodal -> one peak Bimodal -> two peak Uniform -> same data across
Unusual Features
Outliers or gaps/clusters
Center
Which value in distribution best describes the typical response?
Variability
Are values in distribution packed close together?
Mean
Sum of all data values divided by number of values
Median
Middle value of an ordered data set (odd number of values)
Average of the two middle values of an ordered data set (even number of values)
Q1
First quartile is median of first half of ordered data set
Q3
Third quartile is median of second half of ordered data set
Range
Difference between the max and min
Interquartile Range
Difference between third and first quartiles
Standard Deviation
Typical distance that each value is away from the mean
Xs = square root (1/n-1 * sum of(observed value - mean)^2)
Square of standard deviation is
variance
What summary stats can be used to describe the center of a distribution of quantitative data?
Mean
Median
Q1 and Q3
What summary stats can be used to describe the variability of a distribution of quantitative data?
Range
IQR
standard deviation
What is the 5 number summary?
Max Min Median Q1 Q3
How do we use the 5 number summary to make a boxplot?
Use it to split data into quartiles
In a skewed right distribution, how does the mean and median compare?
mean > median
In a skewed left distribution, how does the mean and median compare?
mean < median
In a symmetric distribution, how does the mean and median compare?
mean = median
Boxplot
Shows the 5 number summary and outliers
Splits the data into quartiles
Does not show every individual value
Can hid certain features of the shape of a distribution
how can we determine if a value in a data set is an outlier
Less than 1.5 IQR below Q1 or more than 1.5 IQR above Q3
2 ore more standard deviations away from the mean
Which summary statistics are resitant and whicha re not
Resistant - median or IQR
Nonresistant - mean, SD, rnage
Which measures of center & variability are best for describing a skewed distribution?
Median
IQR
Which measures of center & variability are best for describing a symmerticdistribution?
Mean
SD
Low outlier
< Q1 - 1.5 IQR
OR
< mean - 2SD
High outlier
> Q3 + 1.5 IQR
OR
< mean + 2SD
What are important characteristics to discuss when comparing distributions of quantitative data?
think SOCS Shape Outlier / Unusual features Center Spread/ Variability
What is needed for a complete response when comparing distributions of quantitative data?
Address the 4 important characteristics
Use comparative words
Include context
Percentile
Percent of data values less than or equal to a given value
How to interpret the percentile
“The value of _____ is at the pth percentile. About (p) percent of the values are less than or equal to _____”
Standardized score
Calculated as data value - mean / standard deviation
How to interpret the z-score
“The value of _____ is (z-score) standard deviations above or below the mean.”
Percentiles and z-scores can be calculated for
Distributions with any shape
If a number repeats, use the last value of the repeated number to
calculate the percentile.
Exp - 2 2 2 2 2 3 4
Use the 5th “2”
Normal distribution
Mound-shaped and symmetric
Determined by mean and standard deviation
Many quantitative variables in the real world can be modeled by
normal distribution
Within 1 SD of the mean, about
68% of the data exists
Within 2 SD of the mean, about
95% of the data exists
Within 3 SD of the mean, about
99.7% of the data exists
Empirical rule
68-95-99.7
How can we use the z-score to find the percent of data values in a given interval for a normal distribution?
Calculate a z-score and then use Table A
How can we use z-score to find the percent of data values in a given interval for a normal distribution?
Left : get area from Table A
Right : 1 - area from Table A
Between: subtract two areas from Table A
How do we find a value, given an area from a normal distribution
Use Table A to find z-score
Set up equation and solve