AS Stats Flashcards
define population
the whole set of items that are of interest
define census
observes or measures every member of a population
what is the advantage of using the census?
it should give a completely accurate result
what are the disadvantages of using the census?
time consuming & expensive
cannot be used if the testing process destroys the item
difficult to process a large quantity of data
define sample
a selection of observations taken from a subset of the population, which is used to find out information about the whole population
what are the advantages of using a sample?
less time consuming & expensive than the census
fewer people have to respond
less data to process than a census
what are the disadvantages of using a sample?
data might not be as accurate
sample might not be large enough to give info about small subsets of the population
define sampling units
individual units of a population
define sampling frame
a list of individually named or numbered sampling units of a population
(how does sampling size affect the validity of the conclusions?)
sample size depends on required accuracy & resources
larger sample sizes are more accurate
a varied population requires a larger sample than a uniform population
different samples produce differing results due to natural variation within populations
what are the 3 types of random sampling?
simple random
systematic
stratified
define simple random sampling
every sample of size n has an equal chance of being selected
need a sampling frame
what are advantages of simple random sampling?
no bias
easy & cheap for small sample
each sampling unit has a known & equal chance of selection
what are disadvantages of simple random sampling?
not suitable from large sample bc time consuming, disruptive & expensive
need sampling frame
define systematic (random) sampling?
the required elements are chosen at regular intervals from an ordered list
what are advantages of systematic sampling?
simple & quick to use
suitable for large samples/populations
what are disadvantages of systematic sampling?
need sampling frame
can be biased if sampling frame is not random
define stratified (random) sampling
population is divided into mutually exclusive strata & a random sample is taken from each
what are advantages of stratified sampling?
sample accurately reflects the population structure
guarantees proportional representation of groups within the population
what are disadvantages of stratified sampling?
population must be clearly classified into distinct strata
selection within each stratum is random so same disadvantages as random
what are the 2 types of non-random sampling?
quota
opportunity
define quota sampling
researcher selects a sample that reflects the characteristics of the whole population
what are advantages of quota sampling?
allows a small sample to be representative of the population
no sampling frame needed
quick, easy & cheap
easy comparison b/w different groups within population
what are disadvantages of quota sampling?
non-random can introduce bias
population must be divided into groups - expensive or inaccurate
increase scope of study increases # of groups, which increases time & cost
non-responses not recorded
define opportunity/convenience sampling
take sample from people available at the time of study & who fit the criteria
what are advantages of opportunity sampling?
easy
cheap
what are disadvantages of opportunity sampling?
likely to be not representative of the population
dependent on individual researcher
define quantitative variables
variables/data associated with numerical observations
define qualitative variables
variables/data associated with non-numerical observations
define continuous variable
can take any value within a given range
define discrete variable
can take only specific values within a given range
grouped frequency table
data is grouped into classes
class boundaries show max. & min. values in each class
midpoint is the average of each class boundary
class width is the difference b/w the upper & lower class boundaries
when is it best to use mean, median or mode?
mean: quantitative data with no extreme values
median: quantitative data with extreme values
mode: qualitative or quantitative data with 1 or 2 modes
what is the formula for the mean & for mean of data in frequency table?
Σx / n
n = Σf
Σxf / Σf
how do you calculate median from frequency table?
arrange data points in ascending order
add 1 to the # of data points then divide by 2
Σf / 2 + 0.5 to find data - n+1 / 2 th
how do you calculate the mode from frequency table?
x value with the highest frequency
value that appears the most
how do you calculate mean, median & mode from grouped frequency table?
mean: Σ(midpoint x f) / Σf
median: linear interpolation
or Σf / 2 is the number of the value & see what class it is in
mode: class with the highest f
linear interpolation if specified
what are the other measures of location?
Q1 - lower quartile (first 25% of data)
Q2 - median (first 50% of data)
Q3 - upper quartile (first 75% of data)
P10 - 10th percentile (first 10% of data)
how do you calculate the location of Q1, Q2 & Q3 for discrete data?
Q2: Σf + 1 / 2
Q1: 1/4 x Σf
Q3: 3/4 x Σf
if whole number, Q1/Q3 is halfway b/w this data point & one above
if not whole number, round up & Q1/Q3 is this data point
what is the assumption made by using linear interpolation?
that the data is evenly distributed within each class
what is the formula for linear interpolation?
GLB + (PV/GF x CW)
lower bound of class + (place value/group frequency x class width)
place value - how much you have to count up to get into that class
what are 3 ways of measuring spread of data & define them?
range - difference b/w largest & smallest values in the data set
interquartile range (IQR) - Q3 - Q1, the difference b/w Q3 & Q1
interpercentile range (IPR) - difference b/w the values for 2 given percentiles
what are the other ways of measuring spread, define & formulae?
variance - each point deviates from the mean by: x - x̄
Sxx/n
Sxx is in FB
standard deviation - square root of variance
see FB
what is the formula for coded data?
y = x-a / b
what is the formula for the mean of coded data?
ȳ = x̄ - a / b
what is the formula for standard deviation of coded data?
σy = σx / b
how does coding affect the mean & sd?
the code is applied directly to the mean
sd is only impacted by b
how do you draw a box plot?
see notes sheet
needs scale
x = outlier
how are box plots interpreted?
comparison of position of median
what are the formulae for an outlier?
outlier < Q1 - kIQR
outlier > Q3 + kIQR
mean + or - 2σ
how do you compare measures of location & spread?
location:
1. compare the means or medians
2. e.g. so people in set A have to travel further than set B on average
spread:
1. compare the standard deviations, variance, range or IQR
2. so there is more/less variability in data set A than data set B
define outlier
an extreme value that lies outside of the pattern of data
it is mathematically defined
define anomalies
result caused by error
it is removed from the data set (= cleaning)
what are the key aspects of a cumulative frequency graph?
start at frequency 0
continuous & CW doesn’t need to be equal
join w smooth curve through all points
points plotted at max. of CW
why is a CF graph better than linear interpolation when estimating quartiles & percentiles?
it doesn’t assume even distribution within class
what are the key aspects of a histogram?
area of the bar is proportional to the frequency
x: class width (may not be =), continuous variable
y: f density
what is a frequency polygon?
joining the middle of the top of each bar on a histogram with equal class widths
mean & sd compared
median & IQR compared
cannot mix up bc…
mean & sd more affected by outliers than median & IQR
mixed up are not comparable
what is the difference b/w correlation & causation?
correlation: pattern/trend b/w data sets
causation: one variable is directly impacted by the other variable
define bivariate data
data that has pairs of values for 2 variables
what relationship does correlation assume?
linear
always say ‘linear correlation’
what are the types of correlation?
+ve
-ve
strong
weak
none
describe regression line
least squares regression line b/w bivariate data
= straight line that minimises the sum of the squares of the distances of each point from the line
y=a+bx
gradient of line will be +ve for +ve linear correlation & -ve for -ve linear correlation
can only be used to find y from x not x from y
how can you interpret correlation of the data?
r (regression statistic) informs how close data is to linear regression line
-1≤ r ≤ 1
r = 0: no linear correlation
r closer to -1: stronger -ve linear correlation
r closer to +1: stronger +ve linear correlation
interpolation vs extrapolation of linear regression line
interpolation - extracting/predicting value from inside range of data
extrapolation - predicting value from outside the range of data = do not do bc less reliable
what variable can you predict using linear regression line?
dependent only
how would you predict IV from linear regression line?
use regression line of x on y = map it the other way round
define experiment
repeatable process that gives rise to a number of outcomes (results)
define event
collection of one or more outcomes
define sample space
set of all possible outcomes
venn diagrams
table
tree diagram
define equally likely
same probability of outcome
outcomes/total # possible outcomes
what 2 ways can probability be calculated?
sample space e.g. venn diagram, table, tree diagram
linear interpolation - for continuous data/grouped frequency table
rules for venn diagrams
fill from middle outwards
assign a value to central intersection - if unknown, put x
shade intersection, union & complement on venn diagram
what are the notations?
union: A or B or both
see notes
define mutually exclusive & what is the formula?
events that cannot happen at the same time
P(A n B) = 0
P(A u B) = P(A) + P(B)
define independent & what is the formula?
the outcome of one event does not affect the outcome of the others
the probability of one event is not impacted by the probability of another event
(probability of A happening is the same whether or not B happens)
P(A n B) = P(A) x P(B)
what is the assumption for tree diagrams?
an object is not replaced
define random variable
variable whose value depends on the outcome of a random event
notation for random variable & outcome
X - random variable
x - random outcome
sum of all outcomes of an event
1
what are the types of probability distribution?
probability mass function:
P(X=x) = 1/6, x = 1,2,3,4,5,6
table
diagram
what is a uniform discrete probability distribution?
every outcome has the same probability
fixed numerical values
what is a cumulative probability function?
tells you the sum of all individual probabilities up to & including x in the calculation for P(X≤x)
binomial distribution
X ~ B(n,p)
B - 2 possible outcomes (success & failure)
n - fixed number of trials
p - fixed probability for each result
outcomes are independent
what is the probability mass function of random variable X, which has binomial distribution
P(X = r) = nCr p^r (1-p)^(n-r)
see notes booklet
how do you find constant k in random variable probability Qs?
use the formula they give & sum all the probabilities to 1
then solve for k
define population parameter
condition of the distribution that is being tested
define test statistic
the actual result of doing the experiment
define null & alternative hypothesis
null: H0
the hypothesis that you assume to be correct
alternative: H1
tells you your assumption about the population parameter is wrong
one-tailed vs two-tailed tests
one-tailed - one direction
H1: p>… or H1: p<….
two-tailed - 2 directions
H1: p≠…
define significance level
boundary decided before experiment to decide whether the test fulfils H0 or H1
what is the critical region & what is the critical value:
critical region is the region of probability which, if test statistic fall inside it, would cause you to reject the null hypothesis
critical value is the first value to be inside the critical region
define actual significance level
probability of incorrectly rejecting H0
the actual probability of critical region
P(X≤CV) or P(X≥CV)
what is the critical region for a two-tailed test?
2 parts - half at each end of the distribution
what are the 2 methods for conducting a hypothesis test?
- probability of test statistic
- calculate critical region & compare test statistic
structure
see notes sheet