additional Flashcards
Definition of statistics
the art and science of collecting, analyzing, presenting and interpreting data
the term Statistics refers to
numerical facts such as averages, medians, percentages and maximums that help us understand a variety of business and economic situations
Data
the facts and figures collected, analyzed and summarized for presentation and interpretation
data set
all the data collected in a particular study
Elements
entities on which data are collected
Variable
characteristic of interest fro the elements
Observation
set of measurements obtained for a particular element
What are the scales of measurement
- Nominal scale
- Ordinal Scale
- Interval Scale
- Ratio Scale
What is the nominal scale
the data for a variable consists of labels or names used to identify an attribute of the element
What is ordinal scale
the data exhibits the properties of nominal data and addition, the order or rank of the data is meaningful
What is interval scale
the data have all the properties of interval data and the ratio of two values is meaningful
The statistical method appropriate for summarizing data depends on whether the data are
categorical or quantitative
categorical data
data that can be groped by specific categories
- uses either the nominal or ordinal scale of measurement
Quantitative data
uses numeric values to indicate how much or how many
- uses either the interval or ratio scale of measurement
Cross-sectional data
data collected at the same or approximately the same point in time
Time series data
data collected over several time periods
An observation is the set of measurements obtained for each element in a data set. Hence, the number of observations is always the same as
the number of elements
An observation is the set of measurements obtained for each element in a data set. Hence, the number of observations is always the same as
the number of elements
Qualitative data can be
Discrete (finite)
Continuous (time/ weight) no seperation b/w possible data values
Descriptive statistics
summaries of data which may be tabular, graphical or numerical
Statistical inference
the process of using data obtained from a sample to make estimates or test hypotheses about the characteristics of a population
Data mining
deals with methods for developing useful decision-making information form large databases
- very useful for companies with strong consumer focus such as retail business, financial organizations, and communication companies
- the process of using procedures form statistics and computer science to extract useful information from extremely large databases
descriptive statistics
tabular, graphical, and numerical summaries of data
Data visualization
used to describe the use of graphical displays to summarize and present information about a data set
frequency distribution
a tabular summary of data showing number (frequency) of observations in each of several nonoverlapping categories or classes
Relative frequency distribution
gives a tabular summary of data showing the relative frequency for each class total = 1
Percent frequency
summarizes the percent frequency of the data for each class total = 100
How can you summarize data for catergorical variables
- tabular or graphical displays
What types of tabular data can be used for categorical variables
Frequency distribution table
- the # of (frequencies) or observations in each of several non overlapping categories
- how many times an element appears
what types of graphical displays are there for categorical variables
- pie charts
2. bar charts
describe pie charts
use relative frequency or % frequency
- not generally the best display, usually people can better judge differences in length compared to slices
Describe bar charts
shows categorical data in frequency, relative frequency or % frequency
pareto diagram
when the bars are arranged in descending order of height from left to right with the most frequently occurring cause appearing first
Often the number of classes in a frequency distribution for categorical data is
is the same as the number of categories
ie. coke, diet coke, ….
it is recommend that classes with smaller frequencies be
grouped into an aggregate class called “other” - classes with frequencies of 5% or less
The sum of frequencies in a frequency distribution always equals
the number of observations
The sum of the relative frequencies in any relative frequency distribution always equals
1.00
the sum of the percentages in a percent frequency distribution always equals
100
How can you summarize Quantitative Variable
- tabular
2. graphical
What is the tabular summary for Quantitative variables
Frequency distribution
- but we need to be more careful in defining the non overlapping classes
what are the steps to constructing a frequency distribution for quantitative data
- determine # of non overlapping classes (5-20)
- width of classes - use the same for each class
large data value - smallest / # of classes- class widths can be rounded
- determine class limits (so each data belongs to one class)
- upper limit and lower limit
What graphical representations can we use for Quantitaitve data
- Dot Plot
- Histogram
- Stem and Leaf
what is the formula to determine the class widths
(largest data value - smallest data value) / # of classes
what is the class midpoint
is the value halfway between the lower and upper class limits
desbribe the dot plot
- one of the simplest graphical summaries
- horizontal axis shows the range for the data
- useful for comparing the distribution of the data for two or more variables
describe the histogram
- for quantitative data (categorical use bar chart)
- similar to a bar chart but does spaces between the boxes
- common for quantitative data
What does the histogram help show
the shape or the skewness of the distribution
What type of skewness are there
- skewed left
- skewed right
- symmetrical
what are some examples of distributions that are roughly symmetrical
SAT scores, heights and weights of people
what are some examples of distributions that are generally skewed right (more data closer to 0 than the higher side)
Data from applications in business and economics often tend to be skewed right
example:
1. housing prices, salaries, purchase amounts, etc
Describe the stem and leaf
graphical display used to show the rank order and shape of a distribution of data
What advantages does the stem and leaf have over a histogram
- easier to construct by hand
- within a class interval, the stem and leaf provides more information than the histogram because the stem and leaf shows the actual data
What advantages does the stem and leaf have over a histogram
- easier to construct by hand
- within a class interval, the stem and leaf provides more information than the histogram because the stem and leaf shows the actual data
frequency distribution, histogram and stem and leaf
does not have an absolute number of rows or stems
stretched stem and leaf, whenever a a stem value is stated twice, the first value corresponds to leaf values of _________ and the second value corresponds to leaf values of __________
- 0-4
2. 5-9
is a stem and leaf display with more than 3 digits possible
yes, note a single digit is used to define each leaf and that only the first 3 digits of each data lvae have been used to construct the display
example the number 1565 - add info
however, it is not possible to reconstruct the exact values
what is an open-ended class (when speaking of classes for a frequency distribution)
open-end class requires only a lower class limit or an upper class limit
- ex. suppose two of the audit times had taken 58 and 65 days. rather than continue with the classes of width 5 with classes 35-39, 40-44 etc, we could simplify it
- we could show an open end class of 35 or more
when do you most often see open end classes
at the upper end of the distribution
sometimes they are seen at the lower end
and occasionally at both ends
the last entry in a cumulative frequency distribution always equals
the total number of observations
the last entry in a cumulative relative frequency distribution is always
1.00
the last entry in a cumulative percent frequency distribution is always
100
How can you summarize data for 2 variables
- Cross tabulations
2. graphically
Define cross tabulations to summarize 2 variables
- both variables can be either categorical or quantitative
- can have one cate and one quant. or combinations of
give an example of a cross tabulation
Restaurant Quality Rate Meal $
1 good $18
2 very good $22
3 excellent $28
4 bad $38
etc.
Explain cross tablulations
- need to decided # of classes to use when making a freq. dist. for quantitative variables
- margins provide info about each of the variable individually
- primary value - provide insight about the relationship b/w 2 variables
when summarizing data for two variables graphically what can you use
- scatter Diagrams and
2. Trendlines
Describe Scatter Diagrams
- graphical display of the relationship b/w 2 quantitative variables
Describe trend lines
a line that produces an approximation of the relationship
what kind of relationships can trendlines show
- positive relationship
- negative relationship
- no apparent relationship
when do you use stacked bar charts
- used to display rel. frequency of each class, similar to a pie chart based on %
can also be used to show frequencies
what types of bar charts are there
- stacked
2. side by side
What is Simpson’s paradox
data in 2 or more corsstabulations are often combined to produce a summary crosstabulation showing how 2 variables are related
- conclusions from 2 or more separate crosstabs can be reversed when the data are aggregated into a single cross tab
= the reversal of conclusions based on aggregated and unaggregated data is simpson’s paradox - investigate whether aggregate or unaggreate provides a better insight into the cross tabs
What are bar charts used for
- categorical data
- freq. and rel. freq distributions
what are pie charts used for
- categorical data
- rel freq and % freq
what are Dot plots used for
- Quantitative data
- used to show the distribution over the entire range of the data
What are Histograms used for
- quantiative data
- used to show freque dist data over a set of class intervals
What is the stem and leaf used for
- quantitative data
- to show rank order and shape of the distribution
what graphical displays are used to show relationships
- scatter diagram (the relationship b/w 2 quantitative variables)
- trendlines (used to approximate the relationship of data in a scatter diagram)
what are the displays used to show comparisons
- side by side chart (used to compare 2 variables)
2. stacked bar chart (used to compare the rel freq or 5 freq of 2 varialbes)
What can you use to show multiple variables
- Radar charts
- Bubble charts
- recommend not using them b/c can be over complicated
- use bar charts and scatter diagrams
What are the measures of location
- mean
- weighted mean
- GEOMETRIC mean
- median
- mode
- percentiles
- quartiles
What is the most important measure of location
Mean
what the mean also called
average or arithmetic average
which measure of central location is commonly used
mean
How do calcualate the mean
add up and divided by how many
What is the weighted mean
arithmetic mean where some data values contribute more than others
what is the formula for the weighted mean
sum (wi x xi)/ Sum wi
What is the geometric mean
finding the nth root of the product of n values
what is goemetric mean often used for
used to analyze growth rates in financial data
- in these cases, the arithmetic mean will provide misleading results
What is the mode
data value that occurs the most often (greatest frequency)
what is it called if there are 2 modes q
bimodal
what if there are more than 2 modes
called multimodal
- don’t report the mode b/c listing 3 or more modes would not be helpful in describe the location of the data
What are percentile
how data is spread over the interval from the smallest value to the largest
What are the steps in determing the perecentiles
- arrange data in ascending order
- compute an index i = (p/100)n
- p - pereentile of interest
n = # of observations
if the I is an integer (add i and iplus 1 then divided by 2 to find the location
What are Quartiles
- values that divide the data set into quarters
- each containing 25% of the observations
- always start with Q2 or the median
what is the median
- it is not affected by outliers
- the middle of a sorted list of data values
1. arrange in ascending order
2. odd # - the median is the middle
3. even # - the median is that number plus the next divided by 2
what is the median most often used for
annual income and property value data
b/c very low or very high values can inflate the mean
What are the measure of variability
- range
- interquartile range
- variance
- standard deviation
- coefficent of variation
What is the range
largest value - smallest value
why is the range seldom used
since it is only based on 2 observations it is highly influenced by extreme values
What is Interquartile range
Q3 - Q1
- overcomes the dependancy of extreme values
- the range of r the middle 50% of the data
What is variance
- utilizes all of the data
- based on difference b/w each observation (xi) and the mean
- called deviation about the mean
what is the formula for variance of a data set
sum (x-mean)squared / n-1 (sample)
for any data set the sum of the deviations about the mean will always be
0
sum (xi-mean) = o
What is the standard deviation
- positive square root of the variance
- easier to interpret than the variance b/c sd is measured in the same units as the data
what is the standard deviation commonly used for
used measure of risk associated with investing in stock and stock funds
what is the coefficient of variation
- used when interested in a descriptive statistics that indicates how large the SD is relative to the mean
- usually expressed as %
what is the formula for coefficient of variation
(sd/mean) x 100
what does the coefficient of variation tell us
it tells us the sample sd is x% of the value of the sample mean
- useful for comparing the variability of variables that have different sd and different means
What is MAE
mean absolute error
how do you calculate MAE
sum the absolute values of the deviations of the observations about the mean and divide it by the # of observations
When using the weighted mean, what is usually the weighted part
wi = pounds, dollars, volume or GPA by number
Whenever a data set contains extreme values, which measure of central location is preferred
Median because it is NOT influenced by extremely small and large data values
when is the mean appropriate for financial data
as an additive process
when do you use geomteric mean vs mean
any time you want to determine the mean rate of change over several successive periods
or
changes in populations of species, crop yields, pollution levels and birth and death rates
(applied to changes that occur over any number of successive periods of any length)
The difference between each xi and the mean is called a
deviation about the mean
what does the Coefficient of variation measure
it measures the standard deviation relative to the mean
the coefficient of variation is a useful statistic for
comparing the variability of variables that have different standard deviations and different means
What are the measures of distribution shape
- Skewness
- Chebyshev’s theorem
- z-Scores
- Empirical Rule
- Detecting outliers
What is Skewness
can be
- skewed left
- skewed right
- symmetrical, skewness is zero
what is skewed left
skewness is negative
- mean is usually less than the median
What is skewed right
Skewness is positive
- the mean is usually more than the median
What is Symmetrical skweness
- the skewness is zero
- the mean and median are equal
What are z-scores
to find relative locations of values within a data set
- also called standard value
what does measures of relative location help us determine
how far a particular value is from the mean
the process of converting a value for a variable to a z-score is often called
z-transformation
chebeyshev’s theorem allows us to do what
make statements about the proportion of data values that must be within a specified # of sd of the mean
Chebyshev’s theorem - how do you calcuatle at least
1 - (1/zsqured)
what are the rules for chebyshev
- at least 75% of the data must be within 2 sd of the mean
- at least 89% is within 3 sd of the mea
- at least 94% is within 4 sd of the mean
what is the empirical rule
based on a normal prob distribution
- used for symmetrical or bell shaped distribution
- to determine % of data values that must be within a specified # of sd from the mean
what are rules for Empirical rule
- approx 68% of the data values will be within 1 sd of the mean
- approx 95% of the data values will be within 2 sd of the mean
- almost all of the values will be within 3 sd of the mean
What can you use to detect outliers
- based on 1st and 3rd quartiles (lowe limit Q1 -1.5 (IQR), upper limit Q3 + 1.5(IQR)
- if the value is outside of these ranges, it is considered an outlier - Z-scores
- see empirical rule, treat any data values within a score of less -3 or greater than 3 SD as an outlier (double check)
What are the 5 number summary
- smallest value
- first Quartile
- Median - Q2
- 3rd Quartile
- Largest value
What are the measures of association b/w 2 variables
- covariance
2. Correlation Coefficient
What is covariance
a descriptive measure of the linear association between 2 variables
covariance is the measure of what
of how much TWO random variables vary together
- similar to variance but where variance tells you for a single variable, covariance tells you for TWO variables together
covariance is the measure of what
of how much TWO random variables vary together
- similar to variance but where variance tells you for a single variable, covariance tells you for TWO variables together
- if a relationship exists b/w the two variables
Types of covariance - may be incorrect
- negative
- near zero - or no association
- postive
Positive covariance -
- a positive number, can be any positive number (doesn’t tell us much, just that they are positive related)
- the value of x increase the value of y increase the
- slanted toward the right hand corner of this page
- it doesn’t tell us if the dots are close to the trendline or far away from the trendline, this is why we use correlation coefficient
negative covariance -
- can be any negative number, doesn’t tell us much just that they are negatively related)
- the value of x increases the value of y Decreases the closer to -1 the stronger
zero covariance - maybe incorrect
no association - no linear association b/w x and y - the number after calculation will be 0 or Covariance = 0 - no trend
Correlation Coefficient formula
covariance / sd of x and sd y
What are the advantages of correlation Coefficient
- covariance can take on any number while a correlation is limited to -1 to +1
- more useful for determining how strong the relationship is b/w the TWO variables
- it does not have units, covariance has units
- it isn’t affected by changes in the centre (ie mean) or scale o the variables
what is the simple definition of covariance
a statistical measure that shows whether two variables are related by measuring how the variables change in relation to each other
- tells you if there is a relationship between two things and the relationship (+ -)
What is the simple definition of correlation
a measure of how two variables change in relation to each other, but it goes one step further than covariance in that correlation tells HOW STRONG the relationship is
what does a positive covariance indicate
as one increase the other also increases
what does a negative covariance indicate
as one increase, the other actually Decreases
Why is interpreting Covariance difficult
because covariance values are sensitive to the scale o f the data. It does not tell us how close to the line the data is
What does correlation describe
describes relationships and is not sensitive to the scale of the data
How is correlation helpful
- it can tell us how strong the relationship is so if we know the value of lets say x, we can estimate the value of y pretty easily (within a range)
(make predictions and inferences (aka educated guesses) - strong relationship (smaller range)
Correlation does not mean
causation
What is the maximum value for Correlation Coefficient
1
What does it mean when Correlation = 1
when a straight line with a positive slope can go through the centre of every data point
Does correlation depend on the scale of the data?
no
Correlation can equal 1 when the slope is
large and when the slope is small
Correlation can equal 1 with a small amount of data points
be careful because we should not have much confidence in predictions made with this line (need more data)
Correlation and 3 points vs only 2
there is a very small chance that we will be able to draw a straight line through all 3 points. You can always draw a straight line between 2 points. 3 points gives us more confidence in the trend
What does a correlation of - 1 tell us
straight line with a negative slope goes through the centre of EVERY data point
- strong Negative Relationship
- if we know the value of x we can estimate within a narrow range the value of y
What does a correlation of 0 tell us
no relationship
Correlation = -0.02
negative relationship but since it is close to zero, it is not a strong relationship
What is another name for correlation Coefficient
Pearson Product Moment correlation Coefficient