Exam Revision Flashcards
Statistics
Statistics is the branch of mathematics that examines ways to process and analyse data. Statistics provides procedures to collect and transform data in ways that are useful to business decision makers. To understand anything about statistics, you first need to understand the meaning of a variable.
4 fundamental terms of statistics
Population
Sample
Parameter
Statistic
Population
A population consists of all the members of a group about which you want to
draw a conclusion.
Sample
A sample is the portion of the population selected for analysis
Parmeter
A parameter is a numerical measure that describes a characteristic of a
population (measures used to describe a population) GREEK LETTERS REFER
TO A PARAMETER
Statistic
A statistic is a numerical measure that describes a characteristic of a sample
(measures calculated from sample data) ROMAN LETTERS REFER TO
STATISTICS
2 types of statistics
Descriptive statistics
Inferential statistics
Descriptive statistics
Collecting, summarising and presenting data
Inferential statistics
Drawing conclusions about a population based on sample data/results (i.e. estimating a parameter based on a statistic such as hypothesis testing.
2 types of data
Categorical (defined categories)
Numerical (quantitative)
2 types of numerical variables
Discrete (counted items)
Continuous (measured characteristics)
4 levels of Measurement and Measurement Scales from highest to lowest
Ratio data
Interval data
Ordinal data
Nominal data
Ratio data
Differences between measurements are meaningful and a true zero
exists
Interval data
Differences between measurements are meaningful but no true zero
exists (has negatives)
Ordinal data
Ordered categories (rankings, order or scaling)
Nominal data
Categories (no ordering or direction)
4 measures used to describe data
Central tendency
Quartiles
Variation
Shape
4 measures of central tendency
Arithmetic mean
Median
Mode
Geometric mean
5 measures of variation
Range Interquartile range Variance Standard deviation Coefficient of variation
1 measure of shape
Skewness
Arithmetic mean
Arithmetic mean is summing up the observations and dividing by the number of observations.
Median and mode extreme values
The median is not sensitive to extreme values and the mean is sensitive to extreme values.
Sigma
Sigma is short for adding up the values
Median
In an ordered array, the median is the middle number (50% above and 50%below). It’s main advantage over the arithmetic mean is that it is not affected by extreme values.
Mode
A measure of central tendency. Value that occurs most often (the most frequent). Not affected by extreme values. Never use the mode by itself, always use in conjunction with median or mean. Unlike mean and median, there may be no unique (single) mode for a given data set. Used for either numerical or categorical (nominal) data.
Quartiles
Quartiles split the ranked data into four segments, with an equal number of values per segment. The first quartile, Q1, is the value for which 25% of the observations are smaller and 75% are larger. The second quartile, Q2, is the same as the median (50% are smaller, 50% are larger). Only 25% of the observations are greater than the third quartile, Q3
Measures of variation
Measures of variation give information on the spread or variability of the data values
Interquartile range
Like the median and Q1 and Q2, the IQR is a resistant summary measure (resistant to the presence of extreme values) Eliminates outlier problems by using the interquartile range, as high- and low-valued observations are removed from calculations. IQR = 3rd quartile – 1st quartile. IQR = Q3 - Q1
Sample variance
Measures average scatter around the mean. Units are also squared. This measure tells you the average deviation of the mean. The reason we square the values is because some are negative and some are positive. The sample variance is the squared average difference between the mean.
Sample standard deviation
Most commonly used measure of variation. Shows variation about the mean. Has the same units as the original data. It can be considered a measure of uncertainty.
Coefficient of variation
Measures relative variation i.e. shows variation relative to mean. Can be used to compare two or more sets of data measured in different units. Always expressed as percentage (%)
The Z score
The difference between a given observation and the mean, divided by the standard deviation. A Z score of 2.0 means that a value is 2.0 standard deviations from the mean. A Z score above 3.0 or below -3.0 is considered an outlier
The shape of a distribution
Describes how data are distributed. Measures of shape are symmetric or skewed
Left skewed and right skewed
When the data is left or negatively skewed the distance between the q1 and q2 is greater than the distance between q2 and q3. The reverse applies for right or positively skewed data. If the data is symmetric the distances are the same
What does a box and whisker plot show
Box and whisker plot show location, spread and shape.
Population variance
the average of the squared deviations of values from the mean
Population standard deviation
shows variation about the mean. is the square root of the population variance. has the same units as the original data
Covariance
The sample covariance measures the strength of the linear relationship between two numerical variables. Only concerned with the direction of the relationship. No causal effect is implied. Is affected by units of measurement
Correlation
Measures the relative strength of the linear relationship between two variables
Features of correlation coefficient
Also called Standardised Covariance i.e. invariant to units of measure. Ranges between –1 and 1. The closer to –1, the stronger the negative linear relationship
The closer to 1, the stronger the positive linear relationship. The closer to 0, the weaker the linear relationship
5 number summary
Numerical data summarised by quartiles. Xsmallest Q1 Median Q3 Xlargest
3 approaches to assessing probability
a priori
Empirical
Subjective
a priori
Classical probability. Based on prior knowledge
Empirical
Classical probability. Based on observed data
Classical probability. Based on observed data
Subjective probability. Based on individual judgment or opinion about the probability of occurrence
Probability
a numerical value that represents the chance, likelihood, possibility that an event will occur (always between 0 and 1)
Discrete probability
A discrete probability can only take certain values.
4 essential properties of the binomial distribution
A fixed number of observations
Two mutually exclusive and collectively exhaustive events
Constant probability for each observation
Observations are independent
Index numbers
Index numbers allow relative comparisons over time. Index
numbers are reported relative to a Base Period Index. Base period index = 100 by
definition. Used for an individual item or measurement.
Which price index to use
Paasche is more accurate but more difficult to achieve.
Characteristics of the normal distribution
Bell-shaped
Symmetrical
Mean, median and mode are equal
Central location is determined by the mean
Spread is determined by the standard deviation (IT IS THE POPULATION STANDARD DEVIATION)
The random variable x has an infinite theoretical range
What is the height of the curve a measure of
Probability
What must the area under the curve be
1
Calculate descriptive numerical measures to determine nornality
Do the mean and median have similar values? (Remember there may be no unique mode or there may be multiple modes.)
Is the interquartile range approximately 1.33 times the standard deviation?
Is the range approximately 6 times the standard deviation?
Calculate standard deviation to determine normality
Do approximately 2/3 of the observations lie within mean 1 standard deviation?
Do approximately 80% of the observations lie within mean 1.28 standard deviations?
Do approximately 95% of the observations lie within mean 2 standard deviations?
Continuous probability density function
Mathematical expression that defines the distribution of the values for a continuous random variable.
Sampling distribution
A sampling distribution is a distribution of all of the possible values of a statistic for a given size sample selected from a population.
Standard error of the mean
Different samples of the same size from the same population will yield different sample means.
A measure of the variability in the mean from sample to sample is given by the Standard Error of the Mean. Note that the standard error of the mean decreases as the sample size increases.
If the population is not normal
We can apply the Central Limit Theorem, which states that regardless of the shape of individual values in the population distribution, as long as the sample size is large enough (generally n ≥ 30) the sampling distribution of XBAR will be approximately normally distributed with:
Sampling Distribution of the Proportion
Selecting all possible samples of a certain size, the distribution of all possible sample proportions is the sampling distribution of the proportion.
Simple random sampling
Every individual or item from the frame (N) has an equal chance of being selected (1/N).
Selection may be with replacement or without replacement.
Samples can be obtained from a table of random numbers or computer random number generators.
Simple to use but may not be a good representation of the population’s underlying characteristics.
Systematic sampling
Divide frame of N individuals into n groups of k individuals: k = N/n.
Randomly select one individual from the 1st group.
Select every kth individual thereafter.
Like simple random sampling, simple to use but may not be a good representation of the population’s underlying characteristics.
Stratified sampling
Divide population into two or more subgroups (called strata) according to some common characteristic.
A simple random sample is selected from each subgroup, with sample sizes proportional to strata sizes – called proportionate stratified sampling.
Samples from subgroups are combined into one.
Stratified sampling pros
More efficient than simple random sampling or systematic sampling because of assured representation of items across entire population.
Homogeneity of items within each stratum provides greater precision in the estimates of underlying population parameters.
Cluster samples
Population is divided into several ‘clusters’, each representative of the population e.g. postcode areas, electorates etc.
A simple random sample of clusters is selected:
All items in the selected clusters can be used, or items can be chosen from a cluster using another probability sampling technique.
Cluster sampling pros
More cost effective than random sampling, especially if population is geographically widespread.
Often requires a larger sample size compared to simple random sampling or stratified sampling for same level of precision.
Survey errors
Coverage error – appropriate or adequate frame?
Non-response error – results in non-response bias.
Measurement error – ambiguous wording, halo effect or respondent error.
Sampling error – always exists and is the difference between sample statistic and population parameter.
Point estimate
A point estimate is the value of a single sample statistic.
Confidence interval
A confidence interval provides a range of values constructed around the point estimate.
Confidence interval estimation
An interval gives a range of values: Takes into consideration variation in sample statistics from sample to sample. Based on observations from 1 sample.
Gives information about closeness to unknown population parameters.
Stated in terms of level of confidence. Can never be 100% confident.
A relative frequency interpretation
In the long run, 90%, 95% or 99% of all the confidence intervals that can be constructed (in repeated samples) will contain the unknown true parameter.
Confidence Interval for μ (σ Known) assumptions
Assumptions:
Population standard deviation σ is known
Population is normally distributed
If population is not normal, use Central Limit Theorem.
Will the true average always be in the middle of the confidence interval
Not necessarily. , A good but not perfect measure
Confidence interval for μ (σ Unknown)
If the population standard deviation σ is unknown, we can substitute the sample standard deviation, S.
This introduces extra uncertainty, since S is variable from sample to sample.
So we use the Student t distribution instead of the normal distribution:
The t value depends on degrees of freedom denoted by sample size minus 1 i.e. (d.f = n - 1).
d.f are number of observations that are free to vary after sample mean has been calculated.
Degrees of freedom
: Number of observations that are free to
vary after sample mean has been calculated
Confidence interval example interpretation
We are 95% confident that the true percentage of left-handers in the population is between 0.1651 and 0.3349 i.e.:
Although the interval from 0.1651 to 0.3349 may or may not contain the true proportion, 95% of intervals formed from repeated samples of size 100 in this manner will contain the true proportion.
Sampling error
The required sample size can be found to reach a desired margin of error (e) with a specified level of confidence (1 - alpha).
The margin of error is also called a sampling error:
The amount of imprecision in the estimate of the population parameter.
The amount added and subtracted to the point estimate to form the confidence interval.
Rule for rounding confidence intervals
Always round up (sideways)
Hypothesis
A hypothesis is a statement (assumption) about a population parameter
The Null Hypothesis, H0
States the belief or assumption in the current situation (status quo)
Begin with the assumption that the null hypothesis is true
(similar to the notion of innocent until proven guilty)
Refers to the status quo
Always contains ‘=‘, ‘≤’ or ‘’ sign
May or may not be rejected
Is always about a population parameter; e.g. μ, not about a sample statistic
The Alternative Hypothesis, H1
Is the opposite of the null hypothesis
e.g. The average number of TV sets in Australia
homes is not equal to 3 ( H1: μ ≠ 3 )
Challenges the status quo
Can only can contain either the ‘’ or ‘≠’ sign
May or may not be proven
Is generally the claim or hypothesis that the researcher is trying to prove
Errors in making decisions (Hypothesis testing)
Type I error
Reject a true null hypothesis
Considered a serious type of error
Type II error
Fail to reject a false null hypothesis
The probability of errors
The probability of Type I error is alpha
Called level of significance of the test; i.e. 0.01, 0.05, 0.10
Set by the researcher in advance
The probability of Type II error is β
p-value approach to testing
p-value: Probability of obtaining a test statistic more extreme
( ≤ or ) than the observed sample value, given H0 is true
Also called observed level of significance
Smallest value of for which H0 can be rejected
Obtain the p-value from Table E.2 or computer
If p-value < alpha , reject H0
If p-value >= alpha , do not reject H0
Regression analysis
Regression analysis is used to:
predict the value of a dependent variable (Y) based on the value of at least one independent variable (X)
explain the impact of changes in an independent variable on the dependent variable
Dependent variable (y)
Dependent variable (Y): the variable we wish to predict or explain (response variable)
Independent variable (x)
Independent variable (X): the variable used to explain the dependent variable (explanatory variable)
Simple linear regression
Only one independent variable, X
Relationship between X and Y is described by a linear function
Changes in Y are assumed to be caused by changes in X
b0 and b1
b0 and b1 are obtained by finding the values of b0 and b1 that minimise the sum of the squared differences between actual values (Y) and predicted values ( )
b0
b0 is the estimated average value of Y when the value of X is zero
b1
b1 is the estimated change in the average value of Y as a result of a one-unit change in X
Coefficient of Determination, r2
The coefficient of determination is the portion of the total variation in the dependent variable that is explained by variation in the independent variable
The coefficient of determination is also called r-squared and is denoted as r2
ASSUMPTIONS OF REGRESSION
Linearity of the relationship
Independence of error values
Normality of error values
constant variance of the errors of the probability distribution
Check these assumptions by examining residuals
residual for observation
The residual for observation i, ei, is the difference between its observed and predicted value
Idea of the multiple regression model
Examine the linear relationship between
1 dependent (Y) & 2 or more independent variables (Xi).
Why we need Adjusted r^2
r2 never decreases when a new X variable is added to the model.
This can be a disadvantage when comparing models.
What is the net effect of adding a new variable?
We lose a degree of freedom when a new X variable is added.
Did the new X variable add enough explanatory power to offset the loss of one degree of freedom?
Adjusted r^2
Shows the proportion of variation in Y explained by all X variables adjusted for the number of X variables used.
Penalises excessive use of unimportant independent variables.
Smaller than r2
Useful in comparing among models.
F Test for Overall Significance of the Model:
Shows if there is a linear relationship between all of the X variables considered together and Y.
multiple regression assumptions
The errors are normally distributed.
Errors have a constant variance.
The model errors are independent.
Using dummy variables
A dummy variable is a categorical explanatory variable with two levels:
yes or no, on or off, male or female
coded as 0 or 1
Regression intercepts are different if the variable is significant.
Assumes equal slopes for other variables.
If more than two levels, the number of dummy variables needed is number of levels minus 1.
Time-series data and plot
Numerical data obtained at regular time intervals.
The time intervals can be annually, quarterly, daily, hourly etc.
A time-series plot is a two-dimensional plot of time series data.
The vertical axis measures the variable of interest.
The horizontal axis corresponds to the time periods.
Classical Multiplicative Time-series Model Components
Trend component
Seasonal component
Cyclical component
Irregular component
Trend component
Long-run increase or decrease over time (overall upward or downward movement).
Data taken over a long period of time.
Trend can be upward or downward.
Trend can be linear or non-linear.
Seasonal component
Short-term regular wave-like patterns.
Observed within 1 year.
Often monthly or quarterly.
Cyclical component
Long-term wave-like patterns.
Usually occur every 2-10 years.
Often measured peak to peak or trough to trough.
Irregular component
Unpredictable, random, ‘residual’ fluctuations.
Due to random variations of:
Nature.
Accidents or unusual events.
‘Noise’ in the time series.
Usually short duration and non-repeating.
Smoothing the Annual Time Series – Moving Averages
A series of arithmetic means over time.
Calculate moving averages to get an overall impression of the pattern of movement over time.
Moving averages can be used for smoothing: averages of consecutive time-series values for a chosen period of length (L).
Result dependent upon choice of L (length of period for computing means).
Examples:
For a 5 year moving average, L = 5.
For a 7 year moving average, L = 7 etc.
PHOTOS 1-8
Frequency distribution, histogram and graphing
PHOTO 9
CV
PHOTO 10
SKEWNESS
PHOTOS 11-12
EMPIRICAL RULE
PHOTOS 13-14
BOX AND WHISKER
PHOTOS 15-18
BAYES THEOREM
PHOTOS 19-22
INVESTMENT RETURNS
PHOTOS 23-24
PORTFOLIO RETURN AND RISK
PHOTO 25
INDEX NUMBERS INTERPRETATION
GO OVER DECISION MAKING FLASHCARDS AND PHOTOS
ALMOST 4 MONTHS WITH SAM!!! SHE’S SO INCREDIBLE AND MAKES ME SO HAPPY!!!!
PHOTOS 26-27
NORMAL PROBABILITY PLOT
PHOTOS 28-29
TUTORS NORMAL DISTRIBUTION EXAMPLE
PHOTO 30
STANDARD ERROR OF THE MEAN
PHOTO 31
SAMPLING DISTRIBUTION PROPERTIES
PHOTOS 32-33
CENTRAL LIMIT THEOREM
PHOTO 34
CONFIDENCE INTERVAL ESTIMATION PROCESS
PHOTOS 35-36
CONFIDENCE INTERVAL EXAMPLE
PHOTOS 37-41
DETERMINING SAMPLE SIZE
PHOTO 42
OUTCOMES AND PROBABILITIES OF HYPOTHESIS TESTING
PHOTO 43
2 TAIL TESTS
PHOTO 44-45
P VALUE 2 TAIL TESTS
PHOTOS 46-47
1 TAIL TESTS
PHOTO 48
P VALUE 1 TAIL
PHOTOS 49-50
HYPOTHESIS TESTING FOR THE PROPORTION
PHOTOS 51-52
SIMPLE REGRESSION MODEL AND EQUATION
PHOTOS 53-58
SIMPLE REGRESSION EXAMPLE
PHOTO 59
INTERPOLATION V EXTRAPOLATION
PHOTO 60
EXAMPLES OF R2
PHOTOS 61-62
COMPARING STANDARD ERRORS
PHOTOS 63-65
F TEST FOR SIGNIFICANCE
PHOTO 66
CONFIDENCE INTERVAL ESTIMATE FOR THE SLOPE
PHOTOS 67-68
MULTIPLE REGRESSION MODEL AND EQUATION
PHOTOS 69-73
MULTIPLE REGRESSION EXAMPLE
PHOTO 74-75
ADJUSTED R2
PHOTOS 76-79
SIGNIFICANCE F TEST MULTIPLE
PHOTOS 80-83
ARE INDIVIDUAL VARIABLES SIGNIFICANT
PHOTOS 84-85
CONFIDENCE INTERVAL ESTIMATE FOR THE SLOPE MULTIPLE
PHOTOS 86-91
DUMMY VARIABLES
PHOTOS 92-94
INTERACTION BETWEEN VARIABLES
PHOTOS 95-96
TREND AND SEASONAL COMPONENT
PHOTOS 97-98
MULTIPLICATIVE TIME SERIES MODEL
PHOTOS 99-102
MOVING AVERAGES
PHOTO 103
LEAST SQUARES TREND FITTING
PHOTO 104
QUADRATIC FORM TREND FORECASTING
PHOTOS 105-106
EXPONENTIAL TREND FORECASTING
PHOTOS 107-108
MODEL SELECTION
PHOTO 109
RESIDUAL ANALYSIS FORECASTING
PHOTO 110
FORECASTING WITH SEASONAL DATA
PHOTOS 111-114
QUARTERLY MODEL
As an aid to the establishment of personnel requirements, the director of a hospital wishes to estimate the mean number of people who are admitted to the emergency room during a 24-hour period. The director randomly selects 64 different 24-hour periods and determines the number of admissions for each. For this sample, = 19.8 and s2 = 25. Which of the following assumptions is necessary in order for a confidence interval to be valid?
No assumptions are necessary (Central limit theorem)
It is desired to estimate the average total compensation of CEOs in the Service industry. Data were randomly collected from 18 CEOs and the 95% confidence interval was calculated to be ($2,181,260, $5,836,180). Which of the following interpretations is correct?
We are 95% confident that the average total compensation of all CEOs in the Service industry falls in the interval $2,181,260 to $5,836,180.
The power of a statistical test is
the probability of rejecting H0 when it is false.
Statistical independence determination
P(A intersection B) = {P}(A) * {P}(B).
Implications of increasing the sample size (sampling distributions - normal distribution)
With the sample size increasing from
n
= 25 to
n
= 100, more sample means
will be closer to the distribution mean. The standard error of the sampling
distribution of size 100 is much smaller than that of size 25, so the likelihood
that the sample mean will fall within
0.2 minutes of the mean is much
higher for samples of size 100 (probability = 0.8413) than for samples of
size 25 (probability = 0. 3830).
A market researcher states that she has 95% confidence that the mean monthly sales of a product are between $170,000 and $200,000. Explain the meaning of this statement.
if all possible samples of the same size
n
are taken, 95% of them include the true
population average monthly sales of the product within the interval developed.
Thus you are 95% confident that this sample is one that does correctly estimate
the true average amount.
When can you assume that the sampling distribution is approx normal
No. Since the population standard deviation is known and
n
= 50, from the
Central Limit Theorem, we may assume that the sampling distribution of is approximately normal.
What does reducing the confidence level do to the confidence interval
The reduced confidence level narrows the width of the confidence interval.
A stationery store wants to estimate the mean retail value of greeting cards that it has in its inventory. A random sample of 20 greeting cards indicates a mean value of $4.95 and a standard deviation of $0.82.
Interpret the confidence interval and how is this helpful in estimating the value of total inventory
The store owner can be 95% confident that the population mean retail value
of greeting cards that the store has in its inventory is somewhere between
$4.56 and $5.34. The store owner could multiply the ends of the confidence
interval by the number of cards to estimate the total value of his inventory.
Interpret a proportion confidence interval
You are 95% confident that the population proportion of employers who
have used a recruitment service within the past two months to find new
staff is between 0.17 and 0.24.
You are 99% confident that the population proportion of employers who
have used a recruitment service within the past two months to find new
staff is between 0.17 and 0.25.
What happens to the confidence interval when you increase the level of confidence
When the level of confidence is increased, the confidence interval becomes
wider. The loss in precision reflected as a wider confidence interval is the
price you have to pay to achieve a higher level of confidence.
When do you reject the null hypothesis
Decision rule: Reject if smaller than lower bound or greater than upperbound
p value interpretation
photo in favourites on phone 31/5/2018
Interpretation of hypothesis testing answer
There is enough evidence to conclude the population mean delivery time
has been reduced below the previous value of 25 minutes, at the 5% level
of significance.
p-
value
=
0.0047 interpretation
Since
p-
value = 0.0047 is less than alpha there is
enough evidence to conclude the population mean delivery time has been
reduced below the previous value of 25 minutes.
What does increasing the sample size do in regards to hypothesis testing and proportions
A larger sample size implies that there is more information about the population and reduces the standard error (variation) of the sample proportion
Conditions of hypothesis testing when it isnt exactly normal
The samples used need to be random. As the sample size is large the condtions that np>5 and n(1-p) need t be met
What do you need to know to perform the t test on the population mean
You must assume the the observed sequence in which the data were collected is random and that the data are approx normally distributed
forecasting questions
Photos on phone album