Stats Flashcards
Most common observation study?
Surveys
What are surveys? (Observational study)
Questionnaires presented to individuals, selected from a POPULATION OF INTEREST
What is the role of surveys (what they can and can’t do)?
- Can only report relationships between variables
- Cannot claim CAUSE and EFFECT
What is an experiment?
The systematic procedure carried out under controlled conditions
What is the role of experiments (3)?
- To discover an unknown effect
- To illustrate a known effect
- To test OR establish a hypothesis
What should experiments be designed to do?
Minimise BIASES that might occur
When analysing a process, experiments are used to evaluate…
- Which PROCESS INPUTS have a significant impact on the PROCESS OUTPUTS
What’s the process called behind the several different ways to collect experimental process input/output information?
Design of Experiments (DOE)
Purpose of experimentation… (6)
- Comparing alternatives
- Identifying the significant inputs (factors) which affect the outputs response
I.e. separating vital many from the trivial few - Achieving an OPTIMAL PROCESS OUTPUT (response)
- Reduce Variability
- Minimizing, Maximizing, or Targeting an Output
- Achieve product & process robustness
To minimize bias, you need to…
Select your sample of individuals randomly!
What are the three data collection types? (3)
- Categorical data
- Numerical data
- Ordinal data
What is Categorical data?
Records qualities or characteristics about the individual, such as eye color or opinions (agree/disagree)
(NB Numbers do not have “real numerical meaning”)
What is Numerical data?
Records measurements or counts regarding each individual
What is Ordinal data?
Are in between categorical and numerical: data appear in categories, but the categories have a meaningful order (E.g. Rankings 1st - 5th (best to worst))
If the data set contains an even number of values… (median)
The median is the average of the two values that are in the middle
Standard Deviation? (definition)
Quantifies the typical distance from any value in the data set to the centre
Standard Deviation (equation)
sigma = sqrt (sum: xi - mean x)^2/n-1
Properties of standard deviation
- Is always +ve
- Smallest possible value is zero
- Affected by OUTLIERS
- Has the same UNITS as the original data
A random variable is…
a variable whose possible values are numerical outcomes of a RANDOM PHENOMENON
Types of random variables:
- Continuous
- Discrete
A probability of distribution is…
a list of possible values of a random variable,
together with their probabilities
A binomial distribution is…
a frequency distribution of the possible number of
successful outcomes in a given number of trials in each of which there is the same probability of success… (I.e. SUCCESS/FAILURE)
Characteristics of a Binomial Distribution (4)
- Must be a fixed number of trials (n)
- Only two outcomes: SUCCESS/FAILURE
- The probability of success,p, must remain the same for each trial (p)
- The outcomes of each trial must be INDEPENDENT of each other
If a random variable X has a binomial distribution, PROBABILITIES for X can be calculated using the following formula:
(n choose x) (p^x)(1-p)^n-x
Binomial Distribution parameters:
n = no. trials x = no. successes n-x = no. fails p = success probability (any trial) 1-p = failure probability
Probabilities of a binomial distribution hold between…
0 to n (least/most no. successes in a trial)
For a binomial random variable the mean is:
µ = n.p
The variance of a random variable is…
The weighted average of the squared distances from the mean
The variance of a random variable is… (formula)
sigma^2 = n*p(1-p)
Discrete random variable:
A variable which can only take a countable number
of values
Continuous random variable:
A random variable takes on values within AN INTERVAL (has so many possible values that they might as well be considered continuous)
The most adopted distribution for continuous
random variables:
The normal distribution
The Normal Distribution: Definition
Random Variable X follows a normal distribution if its values fall into a bell-shaped continuous curve that is symmetric
The Normal Distribution: Fundamental characteristics (3)
- The area under the curve is EQUAL TO UNITY
- It has symmetry about the centre (i.e., it has 50% of values less than the mean and 50% greater than the mean)
-Each normal distribution is described via the mean,
µ, and the standard deviation
Saddle Points:
Where the bell-shaped curve changes from concave down to concave up.
Distance between the mean and the saddle points
1 σ
For any normal distribution, almost all
its values lie within __ standard
deviations of the mean
3
The Standard Normal Distribution, AKA:
The Z-Distribution
The Standard Normal Distribution has mean equal to:
0
The Standard Normal Distribution has S.D. equal to:
Unity
The normal random variable of a standard normal distribution is called a…
Standard score / z-value
A value on the Z-distribution represents…
the number of standard deviations the data is
above or below the mean
68% of Standard normal distribution values are:
within 1 σ of the mean
95% of Standard normal distribution values are:
within 2 σs of the mean
99.7% of Standard normal distribution values are:
within 3 σs of the mean
To change a value of X into a value of Z, you can use this formula:
z = (X - µ)/σ
!Problem follows a normal distribution, this is what you have to do to find a probability for X
A TO F!
a) Define your problem as either P(X<a>b), or P(ab) the result is one minus the probability determined under c) problem solved!
b) Calculate the corresponding z-values via: Z=(Xµ)/σ
If your problem follows a normal distribution, this is what you have to do to find a probability for X:
c) Find the probability for the transformed Z-value using the Z-table
d) If P(X</a><a>b) the result is one minus the probability determined under c) problem solved!
f) If P(aa) and subtract the results problem solved!
f) If P(aa) and subtract the results problem solved!</a>
When a sample of data is taken from a given population of data…
the statistical results/characteristics vary from sample to sample
To build the sampling distribution of the sample mean (3):
To build the sampling distribution of the sample mean:
1) Take a sample of values from random variable X (population)
2) Calculate the mean of the sample,
3) Repeat step 1) and 2) over and over again
All the sample means result in a new population which is denoted using random variable
X~
The sampling distribution of the sample means gives all the possible values of the sample mean and quantifies…
how often they occur
A sampling distribution has its own…
shape, centre, and variability.
The mean of SAMPLING DISTRIBUTION X~ is denoted as:
µx~
The variability characterising a population of values (
X) is quantified in terms of
Standard deviations
The variability in the sample mean X~ is measured in terms of standard errors
σx~ = σx/sqrt n
If the distribution of X is normal, then also the distribution of X~ is…
normal
If the distribution of X is unknown or not-normal, according to Central Limit Theorem (CLM), the distribution of X~ can be…
approximated with a normal distribution
For the sampling distribution X~, it can be approximated to the normal distribution if: (2)
- The population has mean µ, and standard deviation σ
- A sufficient amount of LARGE/RANDOM samples are taken
Further, the larger the sample size, n, the closer the distribution of the sample means will be to a…
normal distribution
Probability for X~ (formula)
Z = (X~-µx~)/(σx/sqrt n)
Confidence Interval:
A range of values so defined that there is a specified
probability that the value of a parameter lies within it
- sample statistic ± (margin of error) gives a range of likely values for the parameter under investigation.
The goal when making an estimate using a confidence interval is to
minimise the margin of error.
The size of the margin of error is affected by:
1) Confidence level
2) Sample size
3) Variability in the population
Confidence Level:
The probability that the value of a parameter falls within a specified range of values.
… in other words, the confidence level of a confidence interval corresponds to the percentage of the time the result would be correct if numerous random samples were taken.
For a given confidence level, the number of standard errors to be added and subtracted (±) is proportional to…
z*-, which determined from the standard normal distribution (Z-)
The confidence interval for a population mean is:
x~ ± z*(σx/sqrt n)
This means that as n increases both the standard error and the margin of error decrease, with this resulting in a
narrower confidence interval
as the confidence level increases,
the margin of error increases
When estimating a population mean, the sample size needed to achieve the desired margin of error can be estimated a priori via the following formula:
n = (z*σx/MOE)^2 (next greatest integer)
If σx is unknown,
a pilot test can be run in order to make a rough estimate
The sample size needed to achieve the desired margin of error can be estimated (very roughly!) via the following formula:
1/sqrt n
Variability (also called spread or dispersion) refers to how
spread out a set of data is. Variability is measured in terms of standard errors/deviations
To compare two different populations, it is common practice to calculate the confidence interval for the difference of two population means as:
x~-y~ ± z*sqrt(σ1/n1+σ2/n2)
A hypothesis test is
a procedure that uses data from a sample to confirm or
deny a claim about a population
Every hypothesis test is based on two hypotheses, i.e.:
- null hypothesis H0
- the research (or alternative) hypothesis (denoted Ha)
Ha can be formed in three different ways, the population parameter is _____ to the claimed value (3)
- Not equal to
- Larger than
- Smaller than
The null hypothesis is set up so that H0 is
true unless some data and statistics demonstrate otherwise
a statistically significant result is when:
H0 is rejected in favour of Ha
As soon as the z-value of interest is known, proceed as follows:
⊗ if Ha is the less than alternative then: p-value = z-value
⊗ if Ha is the greater than alternative then: p-value = 1 - z-value
⊗ if Ha is the not-equal-to alternative then: p-value = 2*z-value
bivariate data set
each observation is described using two variables, x and y
After organising your bivariate data set, you can…
⊗ look for patterns
⊗ find a possible correlation
⊗ predict a value fory for a given value for x
⊗ summarise the dataset with scatterplots
given a bivariate data set, it is important to quantify
STRENGTH & DIRECTION of linear relationship
n in the correlation coefficient equation is..
the number of pairs of data
we have a strong linear relationship when
r+0.6
the correlation coefficient is dimensionless, so that changing the units of X and Y
does not affect r
the correlation coefficient does not change if variables X and Y are
switched in the data set
Pearson product moment correlation coefficient, R^2, ranges between
0 to 1 for no to perfect correlation
Function y=f(x) can be determined using a regression line provided that: (2)
- the data in the scatterplot follow (roughly) a linear distribution
- we have a strong linear relationship between
x and y, i.e. r+0.6
To determine m and b, you can use the following relationship
m = r(σy/σx)
A log-log regression line is expressed mathematically as:
y= a x^k
log-log line
Y = mX + b (X = logx, Y = logy)