statistics Flashcards
random error
lecture : can be conceptualised as sample variability
notes: can’t avoid it only way is to test the whole population which is impossible but you can minimise by increasing sample size
bias (systemic error)
a difference between the observed value and true value due to all causes except sampling variability
random sample
each member of the population has an equal chance of being chosen
properties of a good sample
representative by structure
random
representative by number of cases
how to select a sample
- select a sampling method
- define target population
- determine sample size
high power
large sample size
little scatter
low power
small sample size
scatter is large
definition of paired data
when 2 or more measurements are made on the same observational unit
descriptive statistics
organising and summarising the data , tables, histograms, pie charts etc, tables (frequency distributions and relative frequency distributions)
measures of central tendency (mean , median, mode)
central tendency describes location and variation describes SPREAD (red book lec 2 )
measures of variability ( range , variance, standard deviation)
inferential statisics
using the sample that you worked with to make a general conclusion
uses probability to determines how confident we can be that the conclusions we derive are correct
what are measures ovariation in descriptive statisitcs
IQ range
variance
SD
range
mean
it’s the balance point
can be heavily affected by outliers so outliers can make the mean a bad measure of central tendancy
median
it’s the middle value when the variables are ranked in order
its the point that divided a distribution into 2 equal halves
its unaffected by outlierss? not sure
if you have normal distrubted data ( symmetric) how does this affect the central tendancy
mean and median will be the same and mode
what happens in skewed data to the central tendency
the mean lies further towards the skew than the median does (because rememeber mean is affected by outliers)
in skewed date the median and mean are more towards the skew than the mode
mode
the most common data point. its possible to have more than 1. if all values are unique there is no mode
SD
takes into account all individual deviations
the larger the SD , the greater the variation around the mean
google is a measure of the amount of variation or dispersion of a set of values.[1] A low standard deviation indicates that the values tend to be close to the mean (also called the expected value) of the set, while a high standard deviation indicates that the values are spread out over a wider range.
when is SD zero
only when all values are the same
is SD affected by outliers
YES
can range be affected by outliers
yes
IQ range
a measure of variability based on dividing data set into quartiles
ranke data set
what is the best to use in symmeterical data
mean + SD
what is best to use in skewed data
median and IQ (as not affected by outliers )
3 sigma rule - NB
ONLY WORKS for normal distrubted date
68% of the data lie within 1 SD of mean above and below
95% data lie within 2 SD of the mean
above and below
99% data lie within 3 SD of the mean above and below
central limit theoreom
- create a population with a known distribution that is not normal
- randomly select many samples of equal size from that population
- tabulate the means of these samples and graph the frequency distrubtion
t states that if your samples are large enough the distribution of the means will approximate a normal distribution, even if the population is not normal or ‘gaussian’
what can be the cause of outliers
inavalid data entry
biological diversity
random chance
experimental error
skewed distrubtion
tests use to eliminate outliers from results
chauvenet’s criterion
grubbs test
pierce criterion
confidence interval for population mwans what do we assume
- normally distributed
- random representative sample
- independent observations
what does confidence interval depends on
- sample mean (leads to your population mean )
- SD
- sample size
- degree of confidence
what is the purpose of confidence interval for the mean
mistakes
it gives a range of values around the mean where the true population is expected to be located
heal
regression analaysis models
statistical models - describe the relationship between 2 variables
deterministic - hypothesize exact relationships
probabilistic- hypothesize 2 components
- deterministic
- random error
types of regression models
simple ( 1 explanatory variable) divide into simple and linear
multiple ( 2+ explanatory variables divide into simple and linear
regression modelling
- determine the problem
- specify model
- collect data
- do descriptive data analysis
- estimate unknown parameter
- evaluate model
- use model for predict
(remember model is 25 )
defintion of regression analayss
R A is helpful in assessing specific forms of relationships between variables and the ultimate objective is to predict the or estimate the value of 1 variable corresponding to a given value of another variable .
advantages of mean
- very sensitive measure
- can be combined with the means of other groups to give the overall mean
- considers all the available information
disadvantages of mean
- affected by outliers
- can only be used on interval or ratio data
- can only be used if you have a normal distrubtion
advantges of median
- unaffected by outliers
- can be used with non numerical date
disadvatnages mediana
- only takes into account order - value is ignored
disadvantage of mode
is a terminal statistic - a given subgroup could make this measure unrepresentative
advantges of mode
- quick and easy
- unaffected by outliers
- can be used at any level of meaures
advantges of mode
- quick and easy
- unaffected by outliers
- can be used at any level of meaures
assumptions about bivaradte date For each value of X there is a normally distributed subpopulation of Y values.
- For each value of X there is a normally distributed subpopulation of Y values.
2.For each value of Y there is a normally distributed subpopulation of X values.
- The joint distribution of X and Y is a normal distribution called the bivariate normal
distribution. - The subpopulations of Y values all have the same variance.
- The subpopulations of X values all have the same variance.