Statistics Flashcards
What two things must you do when making a stemplot?
- Use equal intervals.
- Give a key with an example to show how to read it.
What are the two definitions of outliers?
- More than 1.5x the IDR above or below the upper and lower quartiles respectively.
- 2 standard deviations above or below the mean.
If there are n data points, which data point is the:
- median?
- UQ?
- LQ?
- [(n+1)/2]th point
Let m be the number of points strictly above or below the median
- [(n+1)/2] + [(m+1)/2] th point
- [(m+1)/2] th point
What are the formula for:
- s.d. of a population?
- s.d. of a sample?
- σ = Sxx/n = [Σ(xi - x)2] / n = [Σ (xi2) - nx2] / n
- s = Sxx/n - 1 = [Σ(xi - x)2] / n - 1 = [Σ (xi2) - nx2] / n - 1
What are the formulae for:
- mean
- Sxx
when using grouped data?
- x_bar = Σfixi/Σfi
- Sxx = Σ xi2fi - [(Σfixi)2/n]
- Where is the tail for a positvely skewed distribution?
- Where is the tail for a negatively skewed distribution?
- away from y axis
- close to y axis.
What are the two defining features of a normal distribution?
- Symmetrical
- Bell-shaped
If a variable X is coded to a variable Y such that y = ax + b, what are the mean and s.d. of y in terms of the mean and s.d. of x?
- y_bar = a(x_bar) + b
- sy = |a|sx
If two variables, X and Y, are combined what are the new mean and variance?
- mean = [Σx + Σy/nx + ny]
- variance = [Σx2 + Σy2/nx + ny] - (mean)2
What is the formula for:
- E(X)
- E(f(X))
- E(aX + b)
- E(f(X) + g(X))
- Var(X)
- Var(aX + b)
- i=1Σn xi P(X = xi) = μ
- i=1Σn f(xi) P(X = xi)
- a E(X) + b
- E(f(X))+ E(g(X))
- E(X2) - μ2
- a2 Var(X)
What are the formulae for:
- Var(X + Y)
- E(aX + bY)
- Var(aX + bY)
- Var(X) + Var(Y)
- a E(X) + b E(Y)
- a2 Var(X) + b2 Var(Y)
What formulae are used to calulate expectation and variance:
- for multiple observations of same variable?
- for scaling one observation by a factor?
- n E(X) & n Var(X)
- n E(X) & n2Var(X)
What are the conditions for a discrete uniform distribution?
Each value is equally likely to occur i.e. P(X= xi) = 1/n for i = 1, 2, 3, …, n
E(X) = a + [n+1/2] where a is one less than the lowest value which is included in the distribution
Var(X) = n2 - 1/12
What are the conditions for a discrete geometric distribution?
- Outcome either sucess or faliure
- Independent trial
- Prob. of success, p, is same for each trial
- X ~ Geo(p)
- P(X = r) = qr-1 x p
- Mode is always 1
- P(X =< x) = 1-qx
- E(X) = 1/p
- Var(X) q/p2
What are the conditions for the binomial model?
- Finite number of trials, n
- Each trial is a success or failure
- Prob. of success, p, is same for each trial
- Dis. rand. var. X gives no. of successful outcomes in n trials
- X ~ Bin(n,p)
- P(X = r) = nCr * qn-r * pr
- E(X) = np
- Var(X) = npq
What is mutual exclusivity?
P(A n B) = 0
P(A u B) = P(A) + P(B)
In general, what is the formula for P(A u B)?
P(A) + P(B) - P(A n B)
How can you show that the probability of event A happening is independent of event B happening?
P(B | A) = P(B n A)/P(A)
If A and B are independent, P(A n B) = P(A) x P(B)
Then: P(B | A) = P(A) x P(B)/P(A) = P(B)
How to find μ & σ when given two probabilities?
Let the given probabilities be:
P(X > a) = p1 and P(X > b) = p2
- Standardise each variable (write z as a function of a and a function of b)
- Write an expression for z in terms of φ-1 and p1, and φ-1 and p2
- Equate the equations in the first two steps.
- Eliminate μ or σ to solve for the other, then find the one you eliminated.
How can we approximate the Binomial distribution with the Normal distribution and what are the conditions that make this a good approximation (including continuity corrections)?
If X~Bin(n, p), we can make the approximation:
X~N(np, npq)
Only if np>5 and nq>5
Continuity corrections:
If inclusive boundary, normal range goes to ±0.5 above or below the upper/lower boundaries respectively.
If exclusive boundary, the normal range goes to ∓ below or above the upper/ lower boundaries respectively.
What are the steps of a normal/binomial hypothesis test?
- Define variable, stating n but keeping p as a variable, as well as assumptions leading to trials being independent.
- State hypotheses & distribution according to H0.
- State level (%) and type (one/two-tailed) as well as rejection criterion e.g. ‘The test value, x, will lie in the critical region if P(X >= x) < 5%.’
- Calulate required probability and make conclusion.
When is taking a census impractical and why?
When the pop. size is large.
Time consuming and expensive and difficult to do with accuracy
What are the advantages of taking a sample survey?
Can get data quickly and cheaply
Can give accurate indications if sample is representative
What are some sources of bias?
- Bad sampling frame
- Wrong sampling unit
- Non-response by some of the units
- Bias from person conducting survey
What is simple random sampling?
Assigning a number to every unit in the sampling frame and picking numbers randomly, without replacement.
What is systematic sampling?
List the population in some order and choose every kth member after picking a random starting point.
What is stratified sampling?
Splitting population into proportionate strata e.g. age groups and simple random sampling within each stratum.
What is cluster sampling?
Population naturally split into clusters. Random sampling used to determine which clusters to sample, and then sample within each chosen cluster.
What is quota sampling?
Population split into subgroups and a certain number (quota) from each subgroup are chosen not neccessarily randomly. Used if no population frame available.
What are the conditions for a Poisson distribution?
- Events occur singly and at random in a given interval of time or space.
- λ, the mean number of occurences in the given interval is known and finite.
- The number of occurences in the given interval, X~Po(λ)
- P(X = x) = e-λ * [λx/x!]
- E(X) = λ & Var(X) = λ
- If λ is an integer, there are two modes, λ - 1 & λ
- If λ is not an integer, the mode is the integer below λ
When and how can we approximate the Binomial distribution as Poission distribution?
When n is large (>50) and p is small (<0.1), X~Po(np) is appropriate
If X~Po(λ) and Y~Po(μ), what is the distribution of X + Y (assuming X and Y are independent)?
X + Y ~ Po(λ + μ)
When is the least squares regression line x on y used instead of y on x?
- When neither variable in controlled (independent) and you want to interpolate a value of x for a given value of y
- When y is the independent variable and you want to interpolate either x given y or y given x.
- y on x is used in the opposites of these situations.
How to carry out a significance test for PMCC?
- H0 : ρ = 0, H1 : [if 1-tailed test, either ρ is > or <0 depending on whether you are testing for a +ve or -ve correlation, if 2-tailed test ρ ≠ 0]
- State significance level and read critical value from tables.
- Reject if r is greater than critical value and make conclusion.
How to carry out a Spearman’s Rank hypothesis test?
- State H0 (ρs = 0) & H1 (ρs > or < or ≠ 0)
- State level and type of test.
- State rejection criterion (sample size is n, from tables critical value is ‘a’, so reject H0 if rs > ‘a’
- Calculate rs :
- Rank all points in terms of one metric and then in the other metric. If n values of a metric are the same, rank them all as: their value + [(n-1)/2]
- Square the differences between each points score in each metric
- Apply formula in booklet
- Make conclusion.