additional Flashcards

1
Q

Definition of statistics

A

the art and science of collecting, analyzing, presenting and interpreting data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

the term Statistics refers to

A

numerical facts such as averages, medians, percentages and maximums that help us understand a variety of business and economic situations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Data

A

the facts and figures collected, analyzed and summarized for presentation and interpretation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

data set

A

all the data collected in a particular study

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Elements

A

entities on which data are collected

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Variable

A

characteristic of interest fro the elements

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Observation

A

set of measurements obtained for a particular element

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the scales of measurement

A
  1. Nominal scale
  2. Ordinal Scale
  3. Interval Scale
  4. Ratio Scale
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the nominal scale

A

the data for a variable consists of labels or names used to identify an attribute of the element

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is ordinal scale

A

the data exhibits the properties of nominal data and addition, the order or rank of the data is meaningful

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is interval scale

A

the data have all the properties of interval data and the ratio of two values is meaningful

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

The statistical method appropriate for summarizing data depends on whether the data are

A

categorical or quantitative

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

categorical data

A

data that can be groped by specific categories

- uses either the nominal or ordinal scale of measurement

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Quantitative data

A

uses numeric values to indicate how much or how many

- uses either the interval or ratio scale of measurement

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Cross-sectional data

A

data collected at the same or approximately the same point in time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Time series data

A

data collected over several time periods

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

An observation is the set of measurements obtained for each element in a data set. Hence, the number of observations is always the same as

A

the number of elements

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

An observation is the set of measurements obtained for each element in a data set. Hence, the number of observations is always the same as

A

the number of elements

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Qualitative data can be

A

Discrete (finite)

Continuous (time/ weight) no seperation b/w possible data values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Descriptive statistics

A

summaries of data which may be tabular, graphical or numerical

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Statistical inference

A

the process of using data obtained from a sample to make estimates or test hypotheses about the characteristics of a population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Data mining

A

deals with methods for developing useful decision-making information form large databases

  • very useful for companies with strong consumer focus such as retail business, financial organizations, and communication companies
  • the process of using procedures form statistics and computer science to extract useful information from extremely large databases
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

descriptive statistics

A

tabular, graphical, and numerical summaries of data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Data visualization

A

used to describe the use of graphical displays to summarize and present information about a data set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
frequency distribution
a tabular summary of data showing number (frequency) of observations in each of several nonoverlapping categories or classes
26
Relative frequency distribution
``` gives a tabular summary of data showing the relative frequency for each class total = 1 ```
27
Percent frequency
``` summarizes the percent frequency of the data for each class total = 100 ```
28
How can you summarize data for catergorical variables
1. tabular or graphical displays
29
What types of tabular data can be used for categorical variables
Frequency distribution table - the # of (frequencies) or observations in each of several non overlapping categories - how many times an element appears
30
what types of graphical displays are there for categorical variables
1. pie charts | 2. bar charts
31
describe pie charts
use relative frequency or % frequency | - not generally the best display, usually people can better judge differences in length compared to slices
32
Describe bar charts
shows categorical data in frequency, relative frequency or % frequency
33
pareto diagram
when the bars are arranged in descending order of height from left to right with the most frequently occurring cause appearing first
34
Often the number of classes in a frequency distribution for categorical data is
is the same as the number of categories | ie. coke, diet coke, ....
35
it is recommend that classes with smaller frequencies be
grouped into an aggregate class called "other" - classes with frequencies of 5% or less
36
The sum of frequencies in a frequency distribution always equals
the number of observations
37
The sum of the relative frequencies in any relative frequency distribution always equals
1.00
38
the sum of the percentages in a percent frequency distribution always equals
100
39
How can you summarize Quantitative Variable
1. tabular | 2. graphical
40
What is the tabular summary for Quantitative variables
Frequency distribution | - but we need to be more careful in defining the non overlapping classes
41
what are the steps to constructing a frequency distribution for quantitative data
1. determine # of non overlapping classes (5-20) 2. width of classes - use the same for each class large data value - smallest / # of classes - class widths can be rounded 3. determine class limits (so each data belongs to one class) - upper limit and lower limit
42
What graphical representations can we use for Quantitaitve data
1. Dot Plot 2. Histogram 3. Stem and Leaf
43
what is the formula to determine the class widths
(largest data value - smallest data value) / # of classes
44
what is the class midpoint
is the value halfway between the lower and upper class limits
45
desbribe the dot plot
- one of the simplest graphical summaries - horizontal axis shows the range for the data - useful for comparing the distribution of the data for two or more variables
46
describe the histogram
- for quantitative data (categorical use bar chart) - similar to a bar chart but does spaces between the boxes - common for quantitative data
47
What does the histogram help show
the shape or the skewness of the distribution
48
What type of skewness are there
1. skewed left 2. skewed right 3. symmetrical
49
what are some examples of distributions that are roughly symmetrical
SAT scores, heights and weights of people
50
what are some examples of distributions that are generally skewed right (more data closer to 0 than the higher side)
Data from applications in business and economics often tend to be skewed right example: 1. housing prices, salaries, purchase amounts, etc
51
Describe the stem and leaf
graphical display used to show the rank order and shape of a distribution of data
52
What advantages does the stem and leaf have over a histogram
1. easier to construct by hand 2. within a class interval, the stem and leaf provides more information than the histogram because the stem and leaf shows the actual data
53
What advantages does the stem and leaf have over a histogram
1. easier to construct by hand 2. within a class interval, the stem and leaf provides more information than the histogram because the stem and leaf shows the actual data
54
frequency distribution, histogram and stem and leaf
does not have an absolute number of rows or stems
55
stretched stem and leaf, whenever a a stem value is stated twice, the first value corresponds to leaf values of _________ and the second value corresponds to leaf values of __________
1. 0-4 | 2. 5-9
56
is a stem and leaf display with more than 3 digits possible
yes, note a single digit is used to define each leaf and that only the first 3 digits of each data lvae have been used to construct the display example the number 1565 - add info however, it is not possible to reconstruct the exact values
57
what is an open-ended class (when speaking of classes for a frequency distribution)
open-end class requires only a lower class limit or an upper class limit - ex. suppose two of the audit times had taken 58 and 65 days. rather than continue with the classes of width 5 with classes 35-39, 40-44 etc, we could simplify it - we could show an open end class of 35 or more
58
when do you most often see open end classes
at the upper end of the distribution sometimes they are seen at the lower end and occasionally at both ends
59
the last entry in a cumulative frequency distribution always equals
the total number of observations
60
the last entry in a cumulative relative frequency distribution is always
1.00
61
the last entry in a cumulative percent frequency distribution is always
100
62
How can you summarize data for 2 variables
1. Cross tabulations | 2. graphically
63
Define cross tabulations to summarize 2 variables
- both variables can be either categorical or quantitative | - can have one cate and one quant. or combinations of
64
give an example of a cross tabulation
Restaurant Quality Rate Meal $ 1 good $18 2 very good $22 3 excellent $28 4 bad $38 etc.
65
Explain cross tablulations
- need to decided # of classes to use when making a freq. dist. for quantitative variables - margins provide info about each of the variable individually - primary value - provide insight about the relationship b/w 2 variables
66
when summarizing data for two variables graphically what can you use
1. scatter Diagrams and | 2. Trendlines
67
Describe Scatter Diagrams
- graphical display of the relationship b/w 2 quantitative variables
68
Describe trend lines
a line that produces an approximation of the relationship
69
what kind of relationships can trendlines show
1. positive relationship 2. negative relationship 3. no apparent relationship
70
when do you use stacked bar charts
- used to display rel. frequency of each class, similar to a pie chart based on % can also be used to show frequencies
71
what types of bar charts are there
1. stacked | 2. side by side
72
What is Simpson's paradox
data in 2 or more corsstabulations are often combined to produce a summary crosstabulation showing how 2 variables are related - conclusions from 2 or more separate crosstabs can be reversed when the data are aggregated into a single cross tab = the reversal of conclusions based on aggregated and unaggregated data is simpson's paradox - investigate whether aggregate or unaggreate provides a better insight into the cross tabs
73
What are bar charts used for
- categorical data | - freq. and rel. freq distributions
74
what are pie charts used for
- categorical data | - rel freq and % freq
75
what are Dot plots used for
- Quantitative data | - used to show the distribution over the entire range of the data
76
What are Histograms used for
- quantiative data | - used to show freque dist data over a set of class intervals
77
What is the stem and leaf used for
- quantitative data | - to show rank order and shape of the distribution
78
what graphical displays are used to show relationships
1. scatter diagram (the relationship b/w 2 quantitative variables) 2. trendlines (used to approximate the relationship of data in a scatter diagram)
79
what are the displays used to show comparisons
1. side by side chart (used to compare 2 variables) | 2. stacked bar chart (used to compare the rel freq or 5 freq of 2 varialbes)
80
What can you use to show multiple variables
1. Radar charts 2. Bubble charts - recommend not using them b/c can be over complicated - use bar charts and scatter diagrams
81
What are the measures of location
1. mean 2. weighted mean 3. GEOMETRIC mean 2. median 3. mode 4. percentiles 5. quartiles
82
What is the most important measure of location
Mean
83
what the mean also called
average or arithmetic average
84
which measure of central location is commonly used
mean
85
How do calcualate the mean
add up and divided by how many
86
What is the weighted mean
arithmetic mean where some data values contribute more than others
87
what is the formula for the weighted mean
sum (wi x xi)/ Sum wi
88
What is the geometric mean
finding the nth root of the product of n values
89
what is goemetric mean often used for
used to analyze growth rates in financial data | - in these cases, the arithmetic mean will provide misleading results
90
What is the mode
data value that occurs the most often (greatest frequency)
91
what is it called if there are 2 modes q
bimodal
92
what if there are more than 2 modes
called multimodal | - don't report the mode b/c listing 3 or more modes would not be helpful in describe the location of the data
93
What are percentile
how data is spread over the interval from the smallest value to the largest
94
What are the steps in determing the perecentiles
1. arrange data in ascending order 2. compute an index i = (p/100)n 3. p - pereentile of interest n = # of observations if the I is an integer (add i and iplus 1 then divided by 2 to find the location
95
What are Quartiles
- values that divide the data set into quarters - each containing 25% of the observations - always start with Q2 or the median
96
what is the median
- it is not affected by outliers - the middle of a sorted list of data values 1. arrange in ascending order 2. odd # - the median is the middle 3. even # - the median is that number plus the next divided by 2
97
what is the median most often used for
annual income and property value data | b/c very low or very high values can inflate the mean
98
What are the measure of variability
1. range 2. interquartile range 3. variance 4. standard deviation 5. coefficent of variation
99
What is the range
largest value - smallest value
100
why is the range seldom used
since it is only based on 2 observations it is highly influenced by extreme values
101
What is Interquartile range
Q3 - Q1 - overcomes the dependancy of extreme values - the range of r the middle 50% of the data
102
What is variance
- utilizes all of the data - based on difference b/w each observation (xi) and the mean - called deviation about the mean
103
what is the formula for variance of a data set
sum (x-mean)squared / n-1 (sample)
104
for any data set the sum of the deviations about the mean will always be
0 | sum (xi-mean) = o
105
What is the standard deviation
- positive square root of the variance | - easier to interpret than the variance b/c sd is measured in the same units as the data
106
what is the standard deviation commonly used for
used measure of risk associated with investing in stock and stock funds
107
what is the coefficient of variation
- used when interested in a descriptive statistics that indicates how large the SD is relative to the mean - usually expressed as %
108
what is the formula for coefficient of variation
(sd/mean) x 100
109
what does the coefficient of variation tell us
it tells us the sample sd is x% of the value of the sample mean - useful for comparing the variability of variables that have different sd and different means
110
What is MAE
mean absolute error
111
how do you calculate MAE
sum the absolute values of the deviations of the observations about the mean and divide it by the # of observations
112
When using the weighted mean, what is usually the weighted part
wi = pounds, dollars, volume or GPA by number
113
Whenever a data set contains extreme values, which measure of central location is preferred
Median because it is NOT influenced by extremely small and large data values
114
when is the mean appropriate for financial data
as an additive process
115
when do you use geomteric mean vs mean
any time you want to determine the mean rate of change over several successive periods or changes in populations of species, crop yields, pollution levels and birth and death rates (applied to changes that occur over any number of successive periods of any length)
116
The difference between each xi and the mean is called a
deviation about the mean
117
what does the Coefficient of variation measure
it measures the standard deviation relative to the mean
118
the coefficient of variation is a useful statistic for
comparing the variability of variables that have different standard deviations and different means
119
What are the measures of distribution shape
1. Skewness 2. Chebyshev's theorem 3. z-Scores 4. Empirical Rule 5. Detecting outliers
120
What is Skewness
can be 1. skewed left 2. skewed right 3. symmetrical, skewness is zero
121
what is skewed left
skewness is negative | - mean is usually less than the median
122
What is skewed right
Skewness is positive | - the mean is usually more than the median
123
What is Symmetrical skweness
- the skewness is zero | - the mean and median are equal
124
What are z-scores
to find relative locations of values within a data set - also called standard value
125
what does measures of relative location help us determine
how far a particular value is from the mean
126
the process of converting a value for a variable to a z-score is often called
z-transformation
127
chebeyshev's theorem allows us to do what
make statements about the proportion of data values that must be within a specified # of sd of the mean
128
Chebyshev's theorem - how do you calcuatle at least
1 - (1/zsqured)
129
what are the rules for chebyshev
1. at least 75% of the data must be within 2 sd of the mean 2. at least 89% is within 3 sd of the mea 3. at least 94% is within 4 sd of the mean
130
what is the empirical rule
based on a normal prob distribution - used for symmetrical or bell shaped distribution - to determine % of data values that must be within a specified # of sd from the mean
131
what are rules for Empirical rule
1. approx 68% of the data values will be within 1 sd of the mean 2. approx 95% of the data values will be within 2 sd of the mean 3. almost all of the values will be within 3 sd of the mean
132
What can you use to detect outliers
1. based on 1st and 3rd quartiles (lowe limit Q1 -1.5 (IQR), upper limit Q3 + 1.5(IQR) - if the value is outside of these ranges, it is considered an outlier 2. Z-scores - see empirical rule, treat any data values within a score of less -3 or greater than 3 SD as an outlier (double check)
133
What are the 5 number summary
1. smallest value 2. first Quartile 3. Median - Q2 4. 3rd Quartile 5. Largest value
134
What are the measures of association b/w 2 variables
1. covariance | 2. Correlation Coefficient
135
What is covariance
a descriptive measure of the linear association between 2 variables
136
covariance is the measure of what
of how much TWO random variables vary together - similar to variance but where variance tells you for a single variable, covariance tells you for TWO variables together
137
covariance is the measure of what
of how much TWO random variables vary together - similar to variance but where variance tells you for a single variable, covariance tells you for TWO variables together - if a relationship exists b/w the two variables
138
Types of covariance - may be incorrect
1. negative 2. near zero - or no association 3. postive
139
Positive covariance -
- a positive number, can be any positive number (doesn't tell us much, just that they are positive related) - the value of x increase the value of y increase the - slanted toward the right hand corner of this page - it doesn't tell us if the dots are close to the trendline or far away from the trendline, this is why we use correlation coefficient
140
negative covariance -
- can be any negative number, doesn't tell us much just that they are negatively related) - the value of x increases the value of y Decreases the closer to -1 the stronger
141
zero covariance - maybe incorrect
``` no association - no linear association b/w x and y - the number after calculation will be 0 or Covariance = 0 - no trend ```
142
Correlation Coefficient formula
covariance / sd of x and sd y
143
What are the advantages of correlation Coefficient
1. covariance can take on any number while a correlation is limited to -1 to +1 2. more useful for determining how strong the relationship is b/w the TWO variables 3. it does not have units, covariance has units 4. it isn't affected by changes in the centre (ie mean) or scale o the variables
144
what is the simple definition of covariance
a statistical measure that shows whether two variables are related by measuring how the variables change in relation to each other - tells you if there is a relationship between two things and the relationship (+ -)
145
What is the simple definition of correlation
a measure of how two variables change in relation to each other, but it goes one step further than covariance in that correlation tells HOW STRONG the relationship is
146
what does a positive covariance indicate
as one increase the other also increases
147
what does a negative covariance indicate
as one increase, the other actually Decreases
148
Why is interpreting Covariance difficult
because covariance values are sensitive to the scale o f the data. It does not tell us how close to the line the data is
149
What does correlation describe
describes relationships and is not sensitive to the scale of the data
150
How is correlation helpful
- it can tell us how strong the relationship is so if we know the value of lets say x, we can estimate the value of y pretty easily (within a range) (make predictions and inferences (aka educated guesses) - strong relationship (smaller range)
151
Correlation does not mean
causation
152
What is the maximum value for Correlation Coefficient
1
153
What does it mean when Correlation = 1
when a straight line with a positive slope can go through the centre of every data point
154
Does correlation depend on the scale of the data?
no
155
Correlation can equal 1 when the slope is
large and when the slope is small
156
Correlation can equal 1 with a small amount of data points
be careful because we should not have much confidence in predictions made with this line (need more data)
157
Correlation and 3 points vs only 2
there is a very small chance that we will be able to draw a straight line through all 3 points. You can always draw a straight line between 2 points. 3 points gives us more confidence in the trend
158
What does a correlation of - 1 tell us
straight line with a negative slope goes through the centre of EVERY data point - strong Negative Relationship - if we know the value of x we can estimate within a narrow range the value of y
159
What does a correlation of 0 tell us
no relationship
160
Correlation = -0.02
negative relationship but since it is close to zero, it is not a strong relationship
161
What is another name for correlation Coefficient
Pearson Product Moment correlation Coefficient