additional Flashcards

1
Q

Definition of statistics

A

the art and science of collecting, analyzing, presenting and interpreting data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

the term Statistics refers to

A

numerical facts such as averages, medians, percentages and maximums that help us understand a variety of business and economic situations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Data

A

the facts and figures collected, analyzed and summarized for presentation and interpretation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

data set

A

all the data collected in a particular study

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Elements

A

entities on which data are collected

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Variable

A

characteristic of interest fro the elements

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Observation

A

set of measurements obtained for a particular element

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the scales of measurement

A
  1. Nominal scale
  2. Ordinal Scale
  3. Interval Scale
  4. Ratio Scale
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the nominal scale

A

the data for a variable consists of labels or names used to identify an attribute of the element

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is ordinal scale

A

the data exhibits the properties of nominal data and addition, the order or rank of the data is meaningful

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is interval scale

A

the data have all the properties of interval data and the ratio of two values is meaningful

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

The statistical method appropriate for summarizing data depends on whether the data are

A

categorical or quantitative

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

categorical data

A

data that can be groped by specific categories

- uses either the nominal or ordinal scale of measurement

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Quantitative data

A

uses numeric values to indicate how much or how many

- uses either the interval or ratio scale of measurement

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Cross-sectional data

A

data collected at the same or approximately the same point in time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Time series data

A

data collected over several time periods

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

An observation is the set of measurements obtained for each element in a data set. Hence, the number of observations is always the same as

A

the number of elements

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

An observation is the set of measurements obtained for each element in a data set. Hence, the number of observations is always the same as

A

the number of elements

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Qualitative data can be

A

Discrete (finite)

Continuous (time/ weight) no seperation b/w possible data values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Descriptive statistics

A

summaries of data which may be tabular, graphical or numerical

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Statistical inference

A

the process of using data obtained from a sample to make estimates or test hypotheses about the characteristics of a population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Data mining

A

deals with methods for developing useful decision-making information form large databases

  • very useful for companies with strong consumer focus such as retail business, financial organizations, and communication companies
  • the process of using procedures form statistics and computer science to extract useful information from extremely large databases
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

descriptive statistics

A

tabular, graphical, and numerical summaries of data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Data visualization

A

used to describe the use of graphical displays to summarize and present information about a data set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

frequency distribution

A

a tabular summary of data showing number (frequency) of observations in each of several nonoverlapping categories or classes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Relative frequency distribution

A
gives a tabular summary of data showing the relative frequency for each class 
total = 1
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Percent frequency

A
summarizes the percent frequency of the data for each class 
total = 100
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

How can you summarize data for catergorical variables

A
  1. tabular or graphical displays
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What types of tabular data can be used for categorical variables

A

Frequency distribution table

  • the # of (frequencies) or observations in each of several non overlapping categories
  • how many times an element appears
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

what types of graphical displays are there for categorical variables

A
  1. pie charts

2. bar charts

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

describe pie charts

A

use relative frequency or % frequency

- not generally the best display, usually people can better judge differences in length compared to slices

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Describe bar charts

A

shows categorical data in frequency, relative frequency or % frequency

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

pareto diagram

A

when the bars are arranged in descending order of height from left to right with the most frequently occurring cause appearing first

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Often the number of classes in a frequency distribution for categorical data is

A

is the same as the number of categories

ie. coke, diet coke, ….

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

it is recommend that classes with smaller frequencies be

A

grouped into an aggregate class called “other” - classes with frequencies of 5% or less

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

The sum of frequencies in a frequency distribution always equals

A

the number of observations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

The sum of the relative frequencies in any relative frequency distribution always equals

A

1.00

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

the sum of the percentages in a percent frequency distribution always equals

A

100

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

How can you summarize Quantitative Variable

A
  1. tabular

2. graphical

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

What is the tabular summary for Quantitative variables

A

Frequency distribution

- but we need to be more careful in defining the non overlapping classes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

what are the steps to constructing a frequency distribution for quantitative data

A
  1. determine # of non overlapping classes (5-20)
  2. width of classes - use the same for each class
    large data value - smallest / # of classes
    • class widths can be rounded
  3. determine class limits (so each data belongs to one class)
    • upper limit and lower limit
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

What graphical representations can we use for Quantitaitve data

A
  1. Dot Plot
  2. Histogram
  3. Stem and Leaf
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

what is the formula to determine the class widths

A

(largest data value - smallest data value) / # of classes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

what is the class midpoint

A

is the value halfway between the lower and upper class limits

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

desbribe the dot plot

A
  • one of the simplest graphical summaries
  • horizontal axis shows the range for the data
  • useful for comparing the distribution of the data for two or more variables
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

describe the histogram

A
  • for quantitative data (categorical use bar chart)
  • similar to a bar chart but does spaces between the boxes
  • common for quantitative data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

What does the histogram help show

A

the shape or the skewness of the distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

What type of skewness are there

A
  1. skewed left
  2. skewed right
  3. symmetrical
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

what are some examples of distributions that are roughly symmetrical

A

SAT scores, heights and weights of people

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

what are some examples of distributions that are generally skewed right (more data closer to 0 than the higher side)

A

Data from applications in business and economics often tend to be skewed right

example:
1. housing prices, salaries, purchase amounts, etc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

Describe the stem and leaf

A

graphical display used to show the rank order and shape of a distribution of data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

What advantages does the stem and leaf have over a histogram

A
  1. easier to construct by hand
  2. within a class interval, the stem and leaf provides more information than the histogram because the stem and leaf shows the actual data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

What advantages does the stem and leaf have over a histogram

A
  1. easier to construct by hand
  2. within a class interval, the stem and leaf provides more information than the histogram because the stem and leaf shows the actual data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

frequency distribution, histogram and stem and leaf

A

does not have an absolute number of rows or stems

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

stretched stem and leaf, whenever a a stem value is stated twice, the first value corresponds to leaf values of _________ and the second value corresponds to leaf values of __________

A
  1. 0-4

2. 5-9

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

is a stem and leaf display with more than 3 digits possible

A

yes, note a single digit is used to define each leaf and that only the first 3 digits of each data lvae have been used to construct the display

example the number 1565 - add info

however, it is not possible to reconstruct the exact values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

what is an open-ended class (when speaking of classes for a frequency distribution)

A

open-end class requires only a lower class limit or an upper class limit

  • ex. suppose two of the audit times had taken 58 and 65 days. rather than continue with the classes of width 5 with classes 35-39, 40-44 etc, we could simplify it
  • we could show an open end class of 35 or more
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

when do you most often see open end classes

A

at the upper end of the distribution
sometimes they are seen at the lower end
and occasionally at both ends

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
59
Q

the last entry in a cumulative frequency distribution always equals

A

the total number of observations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
60
Q

the last entry in a cumulative relative frequency distribution is always

A

1.00

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
61
Q

the last entry in a cumulative percent frequency distribution is always

A

100

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
62
Q

How can you summarize data for 2 variables

A
  1. Cross tabulations

2. graphically

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
63
Q

Define cross tabulations to summarize 2 variables

A
  • both variables can be either categorical or quantitative

- can have one cate and one quant. or combinations of

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
64
Q

give an example of a cross tabulation

A

Restaurant Quality Rate Meal $
1 good $18
2 very good $22
3 excellent $28
4 bad $38
etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
65
Q

Explain cross tablulations

A
  • need to decided # of classes to use when making a freq. dist. for quantitative variables
  • margins provide info about each of the variable individually
  • primary value - provide insight about the relationship b/w 2 variables
66
Q

when summarizing data for two variables graphically what can you use

A
  1. scatter Diagrams and

2. Trendlines

67
Q

Describe Scatter Diagrams

A
  • graphical display of the relationship b/w 2 quantitative variables
68
Q

Describe trend lines

A

a line that produces an approximation of the relationship

69
Q

what kind of relationships can trendlines show

A
  1. positive relationship
  2. negative relationship
  3. no apparent relationship
70
Q

when do you use stacked bar charts

A
  • used to display rel. frequency of each class, similar to a pie chart based on %
    can also be used to show frequencies
71
Q

what types of bar charts are there

A
  1. stacked

2. side by side

72
Q

What is Simpson’s paradox

A

data in 2 or more corsstabulations are often combined to produce a summary crosstabulation showing how 2 variables are related

  • conclusions from 2 or more separate crosstabs can be reversed when the data are aggregated into a single cross tab
    = the reversal of conclusions based on aggregated and unaggregated data is simpson’s paradox
  • investigate whether aggregate or unaggreate provides a better insight into the cross tabs
73
Q

What are bar charts used for

A
  • categorical data

- freq. and rel. freq distributions

74
Q

what are pie charts used for

A
  • categorical data

- rel freq and % freq

75
Q

what are Dot plots used for

A
  • Quantitative data

- used to show the distribution over the entire range of the data

76
Q

What are Histograms used for

A
  • quantiative data

- used to show freque dist data over a set of class intervals

77
Q

What is the stem and leaf used for

A
  • quantitative data

- to show rank order and shape of the distribution

78
Q

what graphical displays are used to show relationships

A
  1. scatter diagram (the relationship b/w 2 quantitative variables)
  2. trendlines (used to approximate the relationship of data in a scatter diagram)
79
Q

what are the displays used to show comparisons

A
  1. side by side chart (used to compare 2 variables)

2. stacked bar chart (used to compare the rel freq or 5 freq of 2 varialbes)

80
Q

What can you use to show multiple variables

A
  1. Radar charts
  2. Bubble charts
    - recommend not using them b/c can be over complicated
    - use bar charts and scatter diagrams
81
Q

What are the measures of location

A
  1. mean
  2. weighted mean
  3. GEOMETRIC mean
  4. median
  5. mode
  6. percentiles
  7. quartiles
82
Q

What is the most important measure of location

A

Mean

83
Q

what the mean also called

A

average or arithmetic average

84
Q

which measure of central location is commonly used

A

mean

85
Q

How do calcualate the mean

A

add up and divided by how many

86
Q

What is the weighted mean

A

arithmetic mean where some data values contribute more than others

87
Q

what is the formula for the weighted mean

A

sum (wi x xi)/ Sum wi

88
Q

What is the geometric mean

A

finding the nth root of the product of n values

89
Q

what is goemetric mean often used for

A

used to analyze growth rates in financial data

- in these cases, the arithmetic mean will provide misleading results

90
Q

What is the mode

A

data value that occurs the most often (greatest frequency)

91
Q

what is it called if there are 2 modes q

A

bimodal

92
Q

what if there are more than 2 modes

A

called multimodal

- don’t report the mode b/c listing 3 or more modes would not be helpful in describe the location of the data

93
Q

What are percentile

A

how data is spread over the interval from the smallest value to the largest

94
Q

What are the steps in determing the perecentiles

A
  1. arrange data in ascending order
  2. compute an index i = (p/100)n
  3. p - pereentile of interest
    n = # of observations
    if the I is an integer (add i and iplus 1 then divided by 2 to find the location
95
Q

What are Quartiles

A
  • values that divide the data set into quarters
  • each containing 25% of the observations
  • always start with Q2 or the median
96
Q

what is the median

A
  • it is not affected by outliers
  • the middle of a sorted list of data values
    1. arrange in ascending order
    2. odd # - the median is the middle
    3. even # - the median is that number plus the next divided by 2
97
Q

what is the median most often used for

A

annual income and property value data

b/c very low or very high values can inflate the mean

98
Q

What are the measure of variability

A
  1. range
  2. interquartile range
  3. variance
  4. standard deviation
  5. coefficent of variation
99
Q

What is the range

A

largest value - smallest value

100
Q

why is the range seldom used

A

since it is only based on 2 observations it is highly influenced by extreme values

101
Q

What is Interquartile range

A

Q3 - Q1

  • overcomes the dependancy of extreme values
  • the range of r the middle 50% of the data
102
Q

What is variance

A
  • utilizes all of the data
  • based on difference b/w each observation (xi) and the mean
  • called deviation about the mean
103
Q

what is the formula for variance of a data set

A

sum (x-mean)squared / n-1 (sample)

104
Q

for any data set the sum of the deviations about the mean will always be

A

0

sum (xi-mean) = o

105
Q

What is the standard deviation

A
  • positive square root of the variance

- easier to interpret than the variance b/c sd is measured in the same units as the data

106
Q

what is the standard deviation commonly used for

A

used measure of risk associated with investing in stock and stock funds

107
Q

what is the coefficient of variation

A
  • used when interested in a descriptive statistics that indicates how large the SD is relative to the mean
  • usually expressed as %
108
Q

what is the formula for coefficient of variation

A

(sd/mean) x 100

109
Q

what does the coefficient of variation tell us

A

it tells us the sample sd is x% of the value of the sample mean
- useful for comparing the variability of variables that have different sd and different means

110
Q

What is MAE

A

mean absolute error

111
Q

how do you calculate MAE

A

sum the absolute values of the deviations of the observations about the mean and divide it by the # of observations

112
Q

When using the weighted mean, what is usually the weighted part

A

wi = pounds, dollars, volume or GPA by number

113
Q

Whenever a data set contains extreme values, which measure of central location is preferred

A

Median because it is NOT influenced by extremely small and large data values

114
Q

when is the mean appropriate for financial data

A

as an additive process

115
Q

when do you use geomteric mean vs mean

A

any time you want to determine the mean rate of change over several successive periods
or
changes in populations of species, crop yields, pollution levels and birth and death rates

(applied to changes that occur over any number of successive periods of any length)

116
Q

The difference between each xi and the mean is called a

A

deviation about the mean

117
Q

what does the Coefficient of variation measure

A

it measures the standard deviation relative to the mean

118
Q

the coefficient of variation is a useful statistic for

A

comparing the variability of variables that have different standard deviations and different means

119
Q

What are the measures of distribution shape

A
  1. Skewness
  2. Chebyshev’s theorem
  3. z-Scores
  4. Empirical Rule
  5. Detecting outliers
120
Q

What is Skewness

A

can be

  1. skewed left
  2. skewed right
  3. symmetrical, skewness is zero
121
Q

what is skewed left

A

skewness is negative

- mean is usually less than the median

122
Q

What is skewed right

A

Skewness is positive

- the mean is usually more than the median

123
Q

What is Symmetrical skweness

A
  • the skewness is zero

- the mean and median are equal

124
Q

What are z-scores

A

to find relative locations of values within a data set

  • also called standard value
125
Q

what does measures of relative location help us determine

A

how far a particular value is from the mean

126
Q

the process of converting a value for a variable to a z-score is often called

A

z-transformation

127
Q

chebeyshev’s theorem allows us to do what

A

make statements about the proportion of data values that must be within a specified # of sd of the mean

128
Q

Chebyshev’s theorem - how do you calcuatle at least

A

1 - (1/zsqured)

129
Q

what are the rules for chebyshev

A
  1. at least 75% of the data must be within 2 sd of the mean
  2. at least 89% is within 3 sd of the mea
  3. at least 94% is within 4 sd of the mean
130
Q

what is the empirical rule

A

based on a normal prob distribution

  • used for symmetrical or bell shaped distribution
  • to determine % of data values that must be within a specified # of sd from the mean
131
Q

what are rules for Empirical rule

A
  1. approx 68% of the data values will be within 1 sd of the mean
  2. approx 95% of the data values will be within 2 sd of the mean
  3. almost all of the values will be within 3 sd of the mean
132
Q

What can you use to detect outliers

A
  1. based on 1st and 3rd quartiles (lowe limit Q1 -1.5 (IQR), upper limit Q3 + 1.5(IQR)
    - if the value is outside of these ranges, it is considered an outlier
  2. Z-scores
    - see empirical rule, treat any data values within a score of less -3 or greater than 3 SD as an outlier (double check)
133
Q

What are the 5 number summary

A
  1. smallest value
  2. first Quartile
  3. Median - Q2
  4. 3rd Quartile
  5. Largest value
134
Q

What are the measures of association b/w 2 variables

A
  1. covariance

2. Correlation Coefficient

135
Q

What is covariance

A

a descriptive measure of the linear association between 2 variables

136
Q

covariance is the measure of what

A

of how much TWO random variables vary together
- similar to variance but where variance tells you for a single variable, covariance tells you for TWO variables together

137
Q

covariance is the measure of what

A

of how much TWO random variables vary together
- similar to variance but where variance tells you for a single variable, covariance tells you for TWO variables together

  • if a relationship exists b/w the two variables
138
Q

Types of covariance - may be incorrect

A
  1. negative
  2. near zero - or no association
  3. postive
139
Q

Positive covariance -

A
  • a positive number, can be any positive number (doesn’t tell us much, just that they are positive related)
  • the value of x increase the value of y increase the
  • slanted toward the right hand corner of this page
  • it doesn’t tell us if the dots are close to the trendline or far away from the trendline, this is why we use correlation coefficient
140
Q

negative covariance -

A
  • can be any negative number, doesn’t tell us much just that they are negatively related)
  • the value of x increases the value of y Decreases the closer to -1 the stronger
141
Q

zero covariance - maybe incorrect

A
no association 
- no linear association b/w x and y 
- the number after calculation will be 0
or Covariance = 0
- no trend
142
Q

Correlation Coefficient formula

A

covariance / sd of x and sd y

143
Q

What are the advantages of correlation Coefficient

A
  1. covariance can take on any number while a correlation is limited to -1 to +1
  2. more useful for determining how strong the relationship is b/w the TWO variables
  3. it does not have units, covariance has units
  4. it isn’t affected by changes in the centre (ie mean) or scale o the variables
144
Q

what is the simple definition of covariance

A

a statistical measure that shows whether two variables are related by measuring how the variables change in relation to each other
- tells you if there is a relationship between two things and the relationship (+ -)

145
Q

What is the simple definition of correlation

A

a measure of how two variables change in relation to each other, but it goes one step further than covariance in that correlation tells HOW STRONG the relationship is

146
Q

what does a positive covariance indicate

A

as one increase the other also increases

147
Q

what does a negative covariance indicate

A

as one increase, the other actually Decreases

148
Q

Why is interpreting Covariance difficult

A

because covariance values are sensitive to the scale o f the data. It does not tell us how close to the line the data is

149
Q

What does correlation describe

A

describes relationships and is not sensitive to the scale of the data

150
Q

How is correlation helpful

A
  • it can tell us how strong the relationship is so if we know the value of lets say x, we can estimate the value of y pretty easily (within a range)
    (make predictions and inferences (aka educated guesses)
  • strong relationship (smaller range)
151
Q

Correlation does not mean

A

causation

152
Q

What is the maximum value for Correlation Coefficient

A

1

153
Q

What does it mean when Correlation = 1

A

when a straight line with a positive slope can go through the centre of every data point

154
Q

Does correlation depend on the scale of the data?

A

no

155
Q

Correlation can equal 1 when the slope is

A

large and when the slope is small

156
Q

Correlation can equal 1 with a small amount of data points

A

be careful because we should not have much confidence in predictions made with this line (need more data)

157
Q

Correlation and 3 points vs only 2

A

there is a very small chance that we will be able to draw a straight line through all 3 points. You can always draw a straight line between 2 points. 3 points gives us more confidence in the trend

158
Q

What does a correlation of - 1 tell us

A

straight line with a negative slope goes through the centre of EVERY data point

  • strong Negative Relationship
  • if we know the value of x we can estimate within a narrow range the value of y
159
Q

What does a correlation of 0 tell us

A

no relationship

160
Q

Correlation = -0.02

A

negative relationship but since it is close to zero, it is not a strong relationship

161
Q

What is another name for correlation Coefficient

A

Pearson Product Moment correlation Coefficient