Statistics Flashcards
discrete vs continuous data
discrete:
- set number of values, eg shoe size
continuous:
- can have any value, eg height
definition:
population
total set of possible values that could be selected for the sample
definition
sampling unit
a single member of the population
definition
sample
a selection of sampling units observed to make conclusions about population as a whole
definition
sampling frame
a list of all members of the population
advantages and disadvantages:
sample
advantages
- less time consuming/ expensive
- fewer people to respond
- less data to process than census
disavantages:
* data may not as accurate as census
* may not be large enough to give info abt small sub groups of population
dis/advantages
census
pros
* should give accurate results
cons
* time / expensive
* can’t be used when testing process destroys the item
* hard to process large quantity of data
Systematic sampling definition
A sample is formed by choosing members of a population at regular intervals using a list
stratified sampling
- population divided into specific groups & random sample taken from e/ group
- proportion chosen from group equal to proportion sample size n is of total population N
pros and cons of stratified sampling
PROS
* useful when very diff groups in population
* sample represenative of population structure
* members selected randomly
CONS
* can’t be used if not possible to split population into specific groups
* same cons as simple random
opportunity sampling
sample is formed using available members of population who fit criteria
Pros and cons of opportunity sampling
PROS
* Quick and easy
* useful when list of population not possible
CONS
* unlikely to be representative of population structure
* likely to produce biased results
pros and cons of quota sampling
PROS
* useful when sampling frame not available
* sample will be representative of population structure
CONS
* may introduce bias as some members of the population may choose not to be sampled
in a data set
outliers are
any data points 2 standard deviations more or less than mean
in a box plot
outliers are
any data point that is 1.5x IQR more or less than upper or lower quartile
how to work out estimated mean in a frequency table
- mid interval value (x)
- frequency (f)
- Efx / f
coding
measure of location is affected by:
measure of spread is affected by:
measure of location is affected by: all operations
measure of spread is affected by: only multiplication or division
linear interpolation
what do you do to the value when finding quartiles / percentiles for discrete data?
- decimal number: round up
- whole number: take average of x and next number
How to work out outliers?
if not in the range:
[Q1-1.5(IQR)] , [Q3+1.5(IQR)]
2 events CANNOT be both:
independent and mutually exclusive
because
- when mutually exclusive: P(A n B) = 0
- when independent: P(A n B) = P(A) x P(B) and these 2 cannot be equal
to work out P(A l B’):
P(A n B’) / P(B’)
probability
condition for independency:
P(AnB) = P(A) x P(B)
condition for mutually exclusive:
P (A n B) = 0
What is a histogram?
- A histogram: for grouped continuous data whereas a bar chart: discrete or qualitative data
- no gaps betw
- Whilst in a bar chart the frequency is read from the height of the bar, in a histogram the height of the bar is the frequency density
- On a histogram frequency density is plotted on the y– axis. This allows a histogram to be plotted for unequal class intervals
- It is particularly useful if data is spread out at either or both ends
- The area of each bar on a histogram will be proportional to the frequency in that class
give a reason to justify the use of a histogram to represent these data
it is (unequal) grouped
continuous data
grouped frequency table given
explain why using mean and standard deviation are just estimates
because the data is grouped so no exact values
definition
census
observation of every member of a population to make a conclusion
Write down the underlying feature associated with each of the bars in a histogram.
area of the bar is proportinal to the frequency
Explain why a linear model may be appropriate to describe the relationship between f and d (positive correlation) (1)
point lie reasonably close to a straight line
Reliability of extra data points using line of best fit (2)
- reliable: interpolation as …. within range of values collected
- unreliable: extrapolation as …. outside range of data collected
y = axⁿ and y = kbˣ as an equation of a straight line
- log y = log a + n log b
Y = C + nX - Y = C + X logb
(log b is the constant, gradient)