Exam Revision Flashcards
Statistics
Statistics is the branch of mathematics that examines ways to process and analyse data. Statistics provides procedures to collect and transform data in ways that are useful to business decision makers. To understand anything about statistics, you first need to understand the meaning of a variable.
4 fundamental terms of statistics
Population
Sample
Parameter
Statistic
Population
A population consists of all the members of a group about which you want to
draw a conclusion.
Sample
A sample is the portion of the population selected for analysis
Parmeter
A parameter is a numerical measure that describes a characteristic of a
population (measures used to describe a population) GREEK LETTERS REFER
TO A PARAMETER
Statistic
A statistic is a numerical measure that describes a characteristic of a sample
(measures calculated from sample data) ROMAN LETTERS REFER TO
STATISTICS
2 types of statistics
Descriptive statistics
Inferential statistics
Descriptive statistics
Collecting, summarising and presenting data
Inferential statistics
Drawing conclusions about a population based on sample data/results (i.e. estimating a parameter based on a statistic such as hypothesis testing.
2 types of data
Categorical (defined categories)
Numerical (quantitative)
2 types of numerical variables
Discrete (counted items)
Continuous (measured characteristics)
4 levels of Measurement and Measurement Scales from highest to lowest
Ratio data
Interval data
Ordinal data
Nominal data
Ratio data
Differences between measurements are meaningful and a true zero
exists
Interval data
Differences between measurements are meaningful but no true zero
exists (has negatives)
Ordinal data
Ordered categories (rankings, order or scaling)
Nominal data
Categories (no ordering or direction)
4 measures used to describe data
Central tendency
Quartiles
Variation
Shape
4 measures of central tendency
Arithmetic mean
Median
Mode
Geometric mean
5 measures of variation
Range Interquartile range Variance Standard deviation Coefficient of variation
1 measure of shape
Skewness
Arithmetic mean
Arithmetic mean is summing up the observations and dividing by the number of observations.
Median and mode extreme values
The median is not sensitive to extreme values and the mean is sensitive to extreme values.
Sigma
Sigma is short for adding up the values
Median
In an ordered array, the median is the middle number (50% above and 50%below). It’s main advantage over the arithmetic mean is that it is not affected by extreme values.
Mode
A measure of central tendency. Value that occurs most often (the most frequent). Not affected by extreme values. Never use the mode by itself, always use in conjunction with median or mean. Unlike mean and median, there may be no unique (single) mode for a given data set. Used for either numerical or categorical (nominal) data.
Quartiles
Quartiles split the ranked data into four segments, with an equal number of values per segment. The first quartile, Q1, is the value for which 25% of the observations are smaller and 75% are larger. The second quartile, Q2, is the same as the median (50% are smaller, 50% are larger). Only 25% of the observations are greater than the third quartile, Q3
Measures of variation
Measures of variation give information on the spread or variability of the data values
Interquartile range
Like the median and Q1 and Q2, the IQR is a resistant summary measure (resistant to the presence of extreme values) Eliminates outlier problems by using the interquartile range, as high- and low-valued observations are removed from calculations. IQR = 3rd quartile – 1st quartile. IQR = Q3 - Q1
Sample variance
Measures average scatter around the mean. Units are also squared. This measure tells you the average deviation of the mean. The reason we square the values is because some are negative and some are positive. The sample variance is the squared average difference between the mean.
Sample standard deviation
Most commonly used measure of variation. Shows variation about the mean. Has the same units as the original data. It can be considered a measure of uncertainty.
Coefficient of variation
Measures relative variation i.e. shows variation relative to mean. Can be used to compare two or more sets of data measured in different units. Always expressed as percentage (%)
The Z score
The difference between a given observation and the mean, divided by the standard deviation. A Z score of 2.0 means that a value is 2.0 standard deviations from the mean. A Z score above 3.0 or below -3.0 is considered an outlier
The shape of a distribution
Describes how data are distributed. Measures of shape are symmetric or skewed
Left skewed and right skewed
When the data is left or negatively skewed the distance between the q1 and q2 is greater than the distance between q2 and q3. The reverse applies for right or positively skewed data. If the data is symmetric the distances are the same
What does a box and whisker plot show
Box and whisker plot show location, spread and shape.
Population variance
the average of the squared deviations of values from the mean
Population standard deviation
shows variation about the mean. is the square root of the population variance. has the same units as the original data
Covariance
The sample covariance measures the strength of the linear relationship between two numerical variables. Only concerned with the direction of the relationship. No causal effect is implied. Is affected by units of measurement
Correlation
Measures the relative strength of the linear relationship between two variables
Features of correlation coefficient
Also called Standardised Covariance i.e. invariant to units of measure. Ranges between –1 and 1. The closer to –1, the stronger the negative linear relationship
The closer to 1, the stronger the positive linear relationship. The closer to 0, the weaker the linear relationship
5 number summary
Numerical data summarised by quartiles. Xsmallest Q1 Median Q3 Xlargest
3 approaches to assessing probability
a priori
Empirical
Subjective
a priori
Classical probability. Based on prior knowledge
Empirical
Classical probability. Based on observed data
Classical probability. Based on observed data
Subjective probability. Based on individual judgment or opinion about the probability of occurrence
Probability
a numerical value that represents the chance, likelihood, possibility that an event will occur (always between 0 and 1)
Discrete probability
A discrete probability can only take certain values.
4 essential properties of the binomial distribution
A fixed number of observations
Two mutually exclusive and collectively exhaustive events
Constant probability for each observation
Observations are independent
Index numbers
Index numbers allow relative comparisons over time. Index
numbers are reported relative to a Base Period Index. Base period index = 100 by
definition. Used for an individual item or measurement.
Which price index to use
Paasche is more accurate but more difficult to achieve.
Characteristics of the normal distribution
Bell-shaped
Symmetrical
Mean, median and mode are equal
Central location is determined by the mean
Spread is determined by the standard deviation (IT IS THE POPULATION STANDARD DEVIATION)
The random variable x has an infinite theoretical range
What is the height of the curve a measure of
Probability
What must the area under the curve be
1
Calculate descriptive numerical measures to determine nornality
Do the mean and median have similar values? (Remember there may be no unique mode or there may be multiple modes.)
Is the interquartile range approximately 1.33 times the standard deviation?
Is the range approximately 6 times the standard deviation?
Calculate standard deviation to determine normality
Do approximately 2/3 of the observations lie within mean 1 standard deviation?
Do approximately 80% of the observations lie within mean 1.28 standard deviations?
Do approximately 95% of the observations lie within mean 2 standard deviations?
Continuous probability density function
Mathematical expression that defines the distribution of the values for a continuous random variable.
Sampling distribution
A sampling distribution is a distribution of all of the possible values of a statistic for a given size sample selected from a population.
Standard error of the mean
Different samples of the same size from the same population will yield different sample means.
A measure of the variability in the mean from sample to sample is given by the Standard Error of the Mean. Note that the standard error of the mean decreases as the sample size increases.
If the population is not normal
We can apply the Central Limit Theorem, which states that regardless of the shape of individual values in the population distribution, as long as the sample size is large enough (generally n ≥ 30) the sampling distribution of XBAR will be approximately normally distributed with:
Sampling Distribution of the Proportion
Selecting all possible samples of a certain size, the distribution of all possible sample proportions is the sampling distribution of the proportion.
Simple random sampling
Every individual or item from the frame (N) has an equal chance of being selected (1/N).
Selection may be with replacement or without replacement.
Samples can be obtained from a table of random numbers or computer random number generators.
Simple to use but may not be a good representation of the population’s underlying characteristics.
Systematic sampling
Divide frame of N individuals into n groups of k individuals: k = N/n.
Randomly select one individual from the 1st group.
Select every kth individual thereafter.
Like simple random sampling, simple to use but may not be a good representation of the population’s underlying characteristics.
Stratified sampling
Divide population into two or more subgroups (called strata) according to some common characteristic.
A simple random sample is selected from each subgroup, with sample sizes proportional to strata sizes – called proportionate stratified sampling.
Samples from subgroups are combined into one.
Stratified sampling pros
More efficient than simple random sampling or systematic sampling because of assured representation of items across entire population.
Homogeneity of items within each stratum provides greater precision in the estimates of underlying population parameters.
Cluster samples
Population is divided into several ‘clusters’, each representative of the population e.g. postcode areas, electorates etc.
A simple random sample of clusters is selected:
All items in the selected clusters can be used, or items can be chosen from a cluster using another probability sampling technique.
Cluster sampling pros
More cost effective than random sampling, especially if population is geographically widespread.
Often requires a larger sample size compared to simple random sampling or stratified sampling for same level of precision.
Survey errors
Coverage error – appropriate or adequate frame?
Non-response error – results in non-response bias.
Measurement error – ambiguous wording, halo effect or respondent error.
Sampling error – always exists and is the difference between sample statistic and population parameter.
Point estimate
A point estimate is the value of a single sample statistic.
Confidence interval
A confidence interval provides a range of values constructed around the point estimate.
Confidence interval estimation
An interval gives a range of values: Takes into consideration variation in sample statistics from sample to sample. Based on observations from 1 sample.
Gives information about closeness to unknown population parameters.
Stated in terms of level of confidence. Can never be 100% confident.