Stats Flashcards by Callum Birse

Most common observation study?

Surveys

How well did you know this?

Not at all

Perfectly

What are surveys? (Observational study)

Questionnaires presented to individuals, selected from a POPULATION OF INTEREST

How well did you know this?

Not at all

Perfectly

What is the role of surveys (what they can and can’t do)?

Can only report relationships between variables

- Cannot claim CAUSE and EFFECT

How well did you know this?

Not at all

Perfectly

What is an experiment?

The systematic procedure carried out under controlled conditions

How well did you know this?

Not at all

Perfectly

What is the role of experiments (3)?

To discover an unknown effect
To illustrate a known effect
To test OR establish a hypothesis

How well did you know this?

Not at all

Perfectly

What should experiments be designed to do?

Minimise BIASES that might occur

How well did you know this?

Not at all

Perfectly

When analysing a process, experiments are used to evaluate…

Which PROCESS INPUTS have a significant impact on the PROCESS OUTPUTS

How well did you know this?

Not at all

Perfectly

What’s the process called behind the several different ways to collect experimental process input/output information?

Design of Experiments (DOE)

How well did you know this?

Not at all

Perfectly

Purpose of experimentation… (6)

Comparing alternatives
Identifying the significant inputs (factors) which affect the outputs response
I.e. separating vital many from the trivial few
Achieving an OPTIMAL PROCESS OUTPUT (response)
Reduce Variability
Minimizing, Maximizing, or Targeting an Output
Achieve product & process robustness

How well did you know this?

Not at all

Perfectly

To minimize bias, you need to…

Select your sample of individuals randomly!

How well did you know this?

Not at all

Perfectly

What are the three data collection types? (3)

Categorical data
Numerical data
Ordinal data

How well did you know this?

Not at all

Perfectly

What is Categorical data?

Records qualities or characteristics about the individual, such as eye color or opinions (agree/disagree)
(NB Numbers do not have “real numerical meaning”)

How well did you know this?

Not at all

Perfectly

What is Numerical data?

Records measurements or counts regarding each individual

How well did you know this?

Not at all

Perfectly

What is Ordinal data?

Are in between categorical and numerical: data appear in categories, but the categories have a meaningful order (E.g. Rankings 1st - 5th (best to worst))

How well did you know this?

Not at all

Perfectly

If the data set contains an even number of values… (median)

The median is the average of the two values that are in the middle

How well did you know this?

Not at all

Perfectly

Standard Deviation? (definition)

Quantifies the typical distance from any value in the data set to the centre

How well did you know this?

Not at all

Perfectly

Standard Deviation (equation)

sigma = sqrt (sum: xi - mean x)^2/n-1

How well did you know this?

Not at all

Perfectly

Properties of standard deviation

Is always +ve
Smallest possible value is zero
Affected by OUTLIERS
Has the same UNITS as the original data

How well did you know this?

Not at all

Perfectly

A random variable is…

a variable whose possible values are numerical outcomes of a RANDOM PHENOMENON

How well did you know this?

Not at all

Perfectly

Types of random variables:

Continuous

- Discrete

How well did you know this?

Not at all

Perfectly

A probability of distribution is…

a list of possible values of a random variable,

together with their probabilities

How well did you know this?

Not at all

Perfectly

A binomial distribution is…

a frequency distribution of the possible number of
successful outcomes in a given number of trials in each of which there is the same probability of success… (I.e. SUCCESS/FAILURE)

How well did you know this?

Not at all

Perfectly

Characteristics of a Binomial Distribution (4)

Must be a fixed number of trials (n)
Only two outcomes: SUCCESS/FAILURE
The probability of success,p, must remain the same for each trial (p)
The outcomes of each trial must be INDEPENDENT of each other

How well did you know this?

Not at all

Perfectly

If a random variable X has a binomial distribution, PROBABILITIES for X can be calculated using the following formula:

(n choose x) (p^x)(1-p)^n-x

How well did you know this?

Not at all

Perfectly

Binomial Distribution parameters:

``` n = no. trials x = no. successes n-x = no. fails p = success probability (any trial) 1-p = failure probability ```

Probabilities of a binomial distribution hold between...

0 to n (least/most no. successes in a trial)

For a binomial random variable the mean is:

µ = n.p

The variance of a random variable is...

The weighted average of the squared distances from the mean

The variance of a random variable is... (formula)

sigma^2 = n*p(1-p)

Discrete random variable:

A variable which can only take a countable number | of values

Continuous random variable:

A random variable takes on values within AN INTERVAL (has so many possible values that they might as well be considered continuous)

The most adopted distribution for continuous | random variables:

The normal distribution

The Normal Distribution: Definition

Random Variable X follows a normal distribution if its values fall into a bell-shaped continuous curve that is symmetric

The Normal Distribution: Fundamental characteristics (3)

- The area under the curve is EQUAL TO UNITY - It has symmetry about the centre (i.e., it has 50% of values less than the mean and 50% greater than the mean) -Each normal distribution is described via the mean, µ, and the standard deviation

Saddle Points:

Where the bell-shaped curve changes from concave down to concave up.

Distance between the mean and the saddle points

1 σ

For any normal distribution, almost all its values lie within __ standard deviations of the mean

The Standard Normal Distribution, AKA:

The Z-Distribution

The Standard Normal Distribution has mean equal to:

The Standard Normal Distribution has S.D. equal to:

Unity

The normal random variable of a standard normal distribution is called a...

Standard score / z-value

A value on the Z-distribution represents...

the number of standard deviations the data is | above or below the mean

68% of Standard normal distribution values are:

within 1 σ of the mean

95% of Standard normal distribution values are:

within 2 σs of the mean

99.7% of Standard normal distribution values are:

within 3 σs of the mean

To change a value of X into a value of Z, you can use this formula:

z = (X - µ)/σ

!Problem follows a normal distribution, this is what you have to do to find a probability for X A TO F!

a) Define your problem as either P(Xb), or P(ab) the result is one minus the probability determined under c) problem solved! b) Calculate the corresponding z-values via: Z=(Xµ)/σ If your problem follows a normal distribution, this is what you have to do to find a probability for X: c) Find the probability for the transformed Z-value using the Z-table d) If P(X b) the result is one minus the probability determined under c) problem solved! f) If P(aa) and subtract the results problem solved! f) If P(aa) and subtract the results problem solved!

When a sample of data is taken from a given population of data...

the statistical results/characteristics vary from sample to sample

To build the sampling distribution of the sample mean (3):

To build the sampling distribution of the sample mean: 1) Take a sample of values from random variable X (population) 2) Calculate the mean of the sample, 3) Repeat step 1) and 2) over and over again

All the sample means result in a new population which is denoted using random variable

The sampling distribution of the sample means gives all the possible values of the sample mean and quantifies...

how often they occur

A sampling distribution has its own...

shape, centre, and variability.

The mean of SAMPLING DISTRIBUTION X~ is denoted as:

µx~

The variability characterising a population of values ( | X) is quantified in terms of

Standard deviations

The variability in the sample mean X~ is measured in terms of standard errors

σx~ = σx/sqrt n

If the distribution of X is normal, then also the distribution of X~ is...

normal

If the distribution of X is unknown or not-normal, according to Central Limit Theorem (CLM), the distribution of X~ can be...

approximated with a normal distribution

For the sampling distribution X~, it can be approximated to the normal distribution if: (2)

- The population has mean µ, and standard deviation σ | - A sufficient amount of LARGE/RANDOM samples are taken

Further, the larger the sample size, n, the closer the distribution of the sample means will be to a...

normal distribution

Probability for X~ (formula)

Z = (X~-µx~)/(σx/sqrt n)

Confidence Interval:

A range of values so defined that there is a specified probability that the value of a parameter lies within it - sample statistic ± (margin of error) gives a range of likely values for the parameter under investigation.

The goal when making an estimate using a confidence interval is to

minimise the margin of error.

The size of the margin of error is affected by:

1) Confidence level 2) Sample size 3) Variability in the population

Confidence Level:

The probability that the value of a parameter falls within a specified range of values. ... in other words, the confidence level of a confidence interval corresponds to the percentage of the time the result would be correct if numerous random samples were taken.

For a given confidence level, the number of standard errors to be added and subtracted (±) is proportional to...

z*-, which determined from the standard normal distribution (Z-)

The confidence interval for a population mean is:

x~ ± z*(σx/sqrt n)

This means that as n increases both the standard error and the margin of error decrease, with this resulting in a

narrower confidence interval

as the confidence level increases,

the margin of error increases

When estimating a population mean, the sample size needed to achieve the desired margin of error can be estimated a priori via the following formula:

n = (z*σx/MOE)^2 (next greatest integer)

If σx is unknown,

a pilot test can be run in order to make a rough estimate

The sample size needed to achieve the desired margin of error can be estimated (very roughly!) via the following formula:

1/sqrt n

Variability (also called spread or dispersion) refers to how

spread out a set of data is. Variability is measured in terms of standard errors/deviations

To compare two different populations, it is common practice to calculate the confidence interval for the difference of two population means as:

x~-y~ ± z*sqrt(σ1/n1+σ2/n2)

A hypothesis test is

a procedure that uses data from a sample to confirm or | deny a claim about a population

Every hypothesis test is based on two hypotheses, i.e.:

- null hypothesis H0 | - the research (or alternative) hypothesis (denoted Ha)

Ha can be formed in three different ways, the population parameter is _____ to the claimed value (3)

- Not equal to - Larger than - Smaller than

The null hypothesis is set up so that H0 is

true unless some data and statistics demonstrate otherwise

a statistically significant result is when:

H0 is rejected in favour of Ha

As soon as the z-value of interest is known, proceed as follows:

⊗ if Ha is the less than alternative then: p-value = z-value ⊗ if Ha is the greater than alternative then: p-value = 1 - z-value ⊗ if Ha is the not-equal-to alternative then: p-value = 2*z-value

bivariate data set

each observation is described using two variables, x and y

After organising your bivariate data set, you can...

⊗ look for patterns ⊗ find a possible correlation ⊗ predict a value fory for a given value for x ⊗ summarise the dataset with scatterplots

given a bivariate data set, it is important to quantify

STRENGTH & DIRECTION of linear relationship

n in the correlation coefficient equation is..

the number of pairs of data

we have a strong linear relationship when

r+0.6

the correlation coefficient is dimensionless, so that changing the units of X and Y

does not affect r

the correlation coefficient does not change if variables X and Y are

switched in the data set

Pearson product moment correlation coefficient, R^2, ranges between

0 to 1 for no to perfect correlation

Function y=f(x) can be determined using a regression line provided that: (2)

- the data in the scatterplot follow (roughly) a linear distribution - we have a strong linear relationship between x and y, i.e. r+0.6

To determine m and b, you can use the following relationship

m = r(σy/σx)

A log-log regression line is expressed mathematically as:

y= a x^k

log-log line

Y = mX + b (X = logx, Y = logy)

Stats Flashcards

(91 cards)