Chpt 3 - Numerical Descriptive Measures Flashcards
How can we organize numerical data?
Graphical Methods
Numerical Methods
How does a histogram compare to a bar chart in what data they are representing?
They are similar but bar charts are for categorical data and histograms are for numerical data
How does a histogram compare to a bar chart in how close the bars are to each other?
Bars are touching in a histogram, but not a bar chart
How does a histogram compare to a bar chart in what each bar represents?
Bar charts have each bar representing a different variable, but in a histogram each bar represents a group of values that the variable can take
How does a histogram compare to a bar chart in the height of each bar?
In a bar chart, the height of a bar is determined by frequency or relative frequency.
In a histogram, the height of the bar is the frequency or relative frequency of the group of values that the bar represents
How should we group the values when making a histogram for discrete data with only a small number of distinct values?
Single value grouping
When should single value grouping be applied to a histogram?
When using discrete data with only a small number of distinct values
What is single value grouping for a histogram?
Each bar represents a distinct value (similar to bar charts)
The height of the bar is determined by the frequency or relative frequency of the corresponding values in the sample
These would be called a frequency histogram or a relative frequency histogram respectively.
What type of histogram uses the height of the bar to represent relative frequency?
relative frequency histogram :)
How should we group the values when making a histogram for discrete data with many distinct values?
Limit grouping
What are the steps to making a histogram using limit grouping?
- Choose an appropriate range which includes all the distinct values
- Divide the range into sub-intervals of equal strength
- Summarize the data using f or f/n table. Here a frequency is the number of individuals falling into a sub-interval
When should limit grouping be applied to a histogram?
When using discrete data with many distinct values
What is the number of sub-intervals that work best for limit grouping? Explain
Should be between 5-20
Otherwise it won’t tell information about the data. Imagine if there was only one bar in the histogram or each bar corresponding to a distinct value with 100 values. Gross lol
Let’s say we want to analyze how many hours per week students are studying. A survey of 20 people gave answers ranging from 5 hrs to 96 hours. How would you sub-intervals to make the limit grouping histogram?
Option A:
0-19
20-39
40-59
60-78
80-99
Option B:
0-9
10-19
etc. (would give 10 sub-intervals)
What grouping is applied to continuous data when making a histogram?
Cutpoint grouping
When is cutpoint grouping used in a histogram?
When using continuous data
What is cutpoint grouping?
Used for continuous data, it defines sub-intervals such athat any value (decimals or whole number) in an interval can be assigned to one, and only one, sub-interval. This is because the possible values that continuous variable can take is any number in an interval
What is the steps to creating a histogram using cutpoint grouping?
- Choose the whole interval which includes all of the data values
- Divide this whole interval into 6 sub-intervals of equal length (i.e. 0-under 10, 10-under 20 etc.)
- Count the number of individuals falling into each sub-interval and summarize in a frequency or relative frequency table
- Plot the histogram with 1 bar corresponding to a sub-interval and the height of the bar = frequency or relative frequency as desired
What is the purpose of organizing data?
To analyze the distribution of the data
What is distribution and what are it’s 2 important features?
Distribution of a variable is a table, graph, or formula that provides
- All the possible values that this variable can take
- How often these values occur
Why is it important to determine the shape of the distribution of a variable?
Give an example
Plays a role in determining the appropriate inferential methods to analyze its data
If the distribution of a variable is bell shaped, a lot of inferential methods can be applied to analyze its data
What are the 3 important aspects when describing the shape of a distribution?
Symmetry
Skewness
Modality
What is symmetry in regards to distribution shape?
The left side of the distribution mirrors the right side, such as a bell-shape
What is skewness in regards to distribution shape?
Used for an asymetric shape and therefore has a longer tail to one side
If a distribution has a longer left tail, what is this called?
Left skewed, or negatively skewed
What is it called when the distribution has a longer right tail?
Right skewed or positively skewed
What is left skewed distribution?
When the left has a longer tail (so the peak is to the right)
What is right skewed distribution?
When the right has a longer tail (so the peak is to the left)
What is modality in regards to distribution shape?
Its the number of peaks in a distribution. May have one (unimodal), two (bimodal), or many (multimodal)
What is a unimodal distribution?
There is only one peak in the distribution
What is called when there are many peaks in the distribution?
multimodal
What is bimodal distribution?
When there are 2 peaks in the distribution
What are 2 well-known distribution shapes?
Bell-shaped
Uniform
What are the features of a bell-shaped distribution?
Unimodal
Symmetric
What is another name for a bell-shaped distribution?
Normal distribution
What are the features of a uniform model of distribution?
- If all the possible values that a variable can take have equal chance to happen, the distribution of this variable is a uniform distribution
- Uniform distributions have no mode and are symmetric
Give examples of graphical methods for organizing numerical data (4)
-histogram graph
-stem-and-leaf diagram
-dot-plot
-boxplot
Give examples of numerical methods for organizing numerical data (2)
-calculating center of data (mode, mean, median)
-calculating spread (range, IQR, standard deviation)
What is the leaf?
The rightmost digit of the data value
2005 - leaf is 5
34 - leaf is 4
What is a stem?
All data values except the rightmost digit
2005 - stem is 200
34 - stem is 3
What are the stem an leaf values of 15?
Leaf - 5
Stem - 1
What are the stem and leaf values of 183
Leaf - 3
Stem - 18
What are the steps to creating a stem-and-leaf diagram?
- Identify stem and leaf of each data value
- Draw a vertical line, write the stems from the smallest to largest in the vertical column to the left of the vertical line
- Write each leaf to the right of the vertical line in the same row as it’s corresponding stem
- Arrange the leaves in each row from the smallest to the largest
How is a dot plot read?
Each point corresponds to a data value. Points of the same value are stacked
What are descriptive measures?
Using numerical methods to summarize numerical data which includes finding the center of a numerical data set and describing it’s spread
What is the center of a data set?
The most typical value of the data set
What is the most typical value of a data set called?
Center
What are the 3 options for the center of a data set?
Mode, mean, median
20 students are asked who they are going to vote for in the next election, these are the results
UCP - 8
Liberal - 5
NDP - 3
Green - 4
What is the mode?
UCP
What is the mode of a data set?
The value that occurs most frequently
20 students are asked who they are going to vote for in the next election, these are the results
UCP - 8
Liberal - 5
NDP - 3
Green - 4
What type of data is this?
Categorical data
What value occurs most frequently in a data set?
The mode
16 students were asked how many email addresses they had and below are the results
1 email - 3
2 emails - 4
3 emails - 7
4 emails - 2
What is the mode?
3 emails
What is the mode in this data set?
{2, 4, 1, 6, 5, 7}
There is no mode in this example as no value occurs more than once
What is the mode in this data set?
{2, 4, 1, 2, 4, 6, 5}
Two modes: 2, 4
What does this symbol mean?
x̄
Pronounced X Bar
Denotes the mean of a data set
What does this symbol mean?
∑
Summation (or add up the included values)
What is the mean for the following data set?
{5, 7, 10, 13, 15}
x̄ = ∑x / n
∑x = 5+7+10+13+15 = 50
n = 5
x̄ = 50/5 = 10
How do we denote a sample mean?
x̄
Pronounced X Bar
How do we denote a population mean?
μ
Pronounced mu
How do you find the mean of the population?
μ = ∑x / N
So you add up all of the individual values of the whole population, and then divide that by the number of individuals in the entire population
Is the sample mean the same as the population mean?
No, the sample mean is only an estimation of the population mean
Because the sample mean is only an estimate of the population mean, what do we introduce?
Error or sample error
How can we measure a sampling error?
By using statistical inferential methods (if we learn this later, I don’t know it yet lol)
What are the steps to finding the median?
Sort the data values from the smallest to largest
If the number of data values is odd, the median is the middle value of the sorted data
If the number of the data values is even, the median is the average of the two values in the middle of the sorted data
What is the median?
A numerical value separating the higher half of values in a data set from the lower half
What is the numerical value separating the higher half of values in a data set from the lower half?
Median
How do you determine the median if the number of data values in a set is odd?
It is the middle value of the sorted data
How do you determine the median if the number of data values in a set is even?
It is the average of the two values in the middle of the sorted data
Find the median in the data set
{4, 7, 9, 12, 101}
9
It’s just the middle number in the ordered data set
Find the median of the following data set
{1, 5, 2, 7, 9}
reorder to
1, 2, 5, 7, 9
Median is 5
Find the median of the following data set
{3, 6, 2, 8, 4, 7}
reorder to
2, 3, 4, 6, 7, 8
Median is the average of 4 and 6
(4+6)/2 = 5
What can be used to describe the center of a data set?
mode, mean, median
How do we describe the center of categorical data?
Mode
What are the most common ways to find the center of a data set?
Mean and medians are used more commonly than mode
If a data set does not have outliers and its distribution is symmetric, what method should be used for describing the center of the data?
Mean
If a data set has outliers, what method should be used for describing the center of the data?
Median
How do we determine which method should be used for describing the center of data?
Mode - used for categorical data
Mean - used for numerical sets that has symmetrical distribution and no outliers
Median - used for numerical sets that have outliers
What is an outlier in a data set?
Observations very far away from most data values
What can be used to describe the spread of a numerical data set?
Range
Interquartile range (IQR)
Standard deviation
How do we calculate range?
Range = maximum-minimum
Determine the range of the following data set:
{2, 8, 12, 38, 58}
Range = max - min
Range = 58 - 2 = 56
Determine the range of the following data set:
{38, 12, 39, 24, 24, 5}
Range = max - min
Range = 39 - 5 = 34
What equation is used to determine the IQR?
IQR = Q3 - Q1
What does IQR stand for?
Interquartile Range
How much of the data set is included in the IQR?
The middle 50% of the data values
What does a small/large IQR tell us about the data?
Small IQR - small spread of the middle data values
Large IQR - Large spread of the middle data values
What is Q2 equivalent to?
The median
What are the steps to determining the IQR?
- Arrange data values in increasing order and determine the median (Q2)
- Find the higher half and lower half of the data set
- Find Q1, which is the median of the lower half, and Q3 which is the median of the upper half
- IQR = Q3-Q1
Determine the IQR of the following data set:
{13, 15, 21, 25, 26, 27, 30,
32, 34, 35, 38, 41, 43, 236}
Q2 (median) = 31
Q1 = 25
Q3 = 38
IQR = Q3-Q1
IQR = 38-25 =13
Determine the IQR of the following data set:
{13, 15, 16, 20, 21,
25, 26, 27, 30, 31,
32, 32, 34, 35,
38, 38, 41, 43, 46}
Q2 (median) = 31
Q1 = 23
Q3 = 36.5
IQR = Q3 - Q1
IQR = 36.5-23 = 13.5
What is the best way to describe the range of a data set when there are outliers?
IQR
While the range of a data set is easy to find, what is it very sensitive to?
Extreme values or outliers
What does xi mean?
Data values in a set
What is a standard deviation?
The “average” distance between data values and the sample mean
What value determines the “average” distance between data values and the sample mean?
Standard deviation
What is the notation for sample standard deviation?
s
What does s stand for?
Standard deviation in a sample
Which standard deviation equation needs to be used if you only have the sums but not the individual values?
The computing formula
Which standard deviation equation should be used if you have all the individual values?
Either the defining formula or computing formula
What is the difference in outcome (or answer) between the defining and computing formulas of standard deviation?
Nothing, the answers are the same, they just get you there a different way
What is the defining formula for standard deviation
The square route of
n-1
What is the computing formula for standard deviation
The square route of
(∑xi) squared (∑xi squared) - -------------------- n -------------------------------------------- n - 1
What is the difference between (∑xi) squared and (∑xi squared)
(∑xi) squared = the values are added and then squared
(∑xi squared) = the values are squared and then added
What does the value of s tell us about the spread of a set of data values?
It tells us the “average distance between data values and the sample mean, so if the s value is large, the spread is large, if the s value is small, the spread is small
Generally speaking, if a data set has no outliers and is not skewed, what methods should be used to describe its center and spread?
Mean and standard deviation, respectively
What is standard deviation sensitive to?
Outliers
If a data set has outliers and is skewed, what methods should be used to describe its center and spread?
Median and IQR, respectively
What is μ?
The population mean
Pronounced mu
What is σ?
The population standard deviation
Pronounced sigma
What is the population standard deviation denoted by?
σ
Pronounced sigma
Why is the population mean (μ) and population standard deviation (σ) usually unknown?
Because ALL of the population values are needed, but this is often impossible to obtain
What is a parameter?
Descriptive measure for a population including population mean (μ) or a population standard deviation (σ)
Is a parameter fixed or variable?
It is fixed, for example, a population has only one mean (μ)
What are statistics?
Descriptive measures for a sample such as sample mean (x̄) and sample standard deviation (s)
Are statistics fixed or variable?
They are variable; each sample is going to have slightly different values and therefore slightly different sample means (x̄) and sample standard deviations (s)
What are the properties of parameters?
fixed
usually unknown
What are the properties of statistics?
easily calculated given examples
varies from sample to sample
What is the five-number summary of a data set?
Minimum
Q1
Q2 (mean)
Q3
Maximum
What is a boxplot used for?
Provide a graphical display of the center and variation of a numerical data set
What is a boxplot based off?
The five-number summary
What are the steps to creating a box plot?
- Draw short horizontal lines at Q1, Q2, Q3. Then connect them with vertical lines to form a box
- Find potential outliers which are data values < lower limit or > upper limit and denote these outliers by dots in the boxplot
- Find the max and min of the data values that are NOT outliers and draw short horizontal lines at these values; draw a “whisker” from the box to these lines
How do you find the upper and lower limits of a box plot?
Upper limit = Q1 - 1.5 X IQR
Lower limit = Q3 + 1.5 X IQR
What can we tell about the data set distribution when a boxplot has an upper whisker that is longer than the lower whisker and there is a large distance between the Q2-Q3 with a small distance between Q1-Q2?
It is right skewed
How can you tell that a data set has a right skewed distribution when looking at a boxplot?
- upper whisker is longer than lower whisker
- large distance between Q2-Q3; small distance between Q1-Q2
How can you tell that a data set has a left skewed distribution when looking at a boxplot?
- lower whisker is longer than upper whisker
- large distance between Q1-Q2; small distance between Q2-Q3
How can you tell that a data set has a bell shaped distribution when looking at a boxplot?
- upper and lower whiskers have equal lengths
- the box in the middle is divided into 2 equal parts
What can we tell about the data set distribution when a boxplot has a lower whisker that is longer than the upper whisker and there is a large distance between the Q1-Q2 with a small distance between Q2-Q3?
Left skewed distribution
What can we tell about the data set distribution when a boxplot when the upper and lower whiskers have equal lengths and the box in the middle is divided into 2 equal parts?
It has a bell shaped distribution