Sub-topic 1: Variables, distributions and summary statistics Flashcards
Week 2
Biologists make observations of (collect data on) selected
variables on a sample from the population to estimate
the value of one or more parameters of that
population.
Yes.
Variable: any observable feature of the natural world (e.g. number of limpets in a quadrat, sex of a frog, moisture content of a leaf). These are all variables as they have the potential to vary
Yes.
Population: the target group of interest in
the study. Can be finite (e.g. number of fish
in a pond) or infinite (e.g. number of fish in
the ocean)
Yes.
Sample: we cannot practically count every unit in a population, therefore we sample a subset of a population and attempt to draw inferences about the entire population from this sample
Yes.
Parameters: a parameter is some
characteristic of the distribution of the
variables in a population (e.g. the
average or variance of weights of fish in
a pond)
Yes.
VARIABLE A variable is any observable feature of the natural world
Yes.
DATUM A datum, or observation, is any one record of the state of a variable.
Yes.
DATASET Any collection of observations made on a variable is a data set
Yes.
POPULATION The set of all possible observations on a variable is the population
Yes.
FINITE
POPULATION
Populations can be either finite or infinite. Finite populations have a finite, countable
number of elements and can, in theory at least, be completely sampled.
Yes.
INFINITE
POPULATION
Infinite populations have an infinite number of elements and can never be completely
sampled.
Yes.
SAMPLE Large and infinite populations cannot be observed in their entirety, so we take only
(nearly always randomly) a sample (sub)set of observations from a population.
Yes.
PARAMETER A parameter is some characteristic of the distribution of the values of a variable in a
population.
Yes.
STATISTIC The term “statistic” is used in two ways: to refer to the entire body of procedures for
dealing with data; or to refer to estimates of population parameters based on samples.
Yes.
NOMINAL/
CLASSIFICATION
Features which can be classified into named groups, lacking
order
Yes.
ORDINAL/
RANKING
Features which can be ranked in order
Yes.
NUMERICAL/
QUANTITATIVE
Features which can be enumerated or quantified (counted or
measured)
e.g. weight, number, temperature, counts of animals
Can be subdivided into
• e.g. Interval v Ratio: arbitrary zero and unit
(temperature [Celsius]) v true zero (weight)
• Discrete v Continuous: values which are whole
numbers (counts) v values which can be fractions
(weight)
Yes.
Measures of location Mode most common value Median middle value Mean average value
Yes.
Symbols Sample mean ⨱ Population mean μ
Yes.
Measures of shape Variance spread of distribution Skewness skew of peak of distribution to one side of the mean Kurtosis “peakedness” or “flatness” of distribution
Yes.
Symbols Sample variance s2 Population variance σ2 Sample standard deviation s Population standard deviation σ
Yes.
VARIANCE (s2): measures the dispersion of data around their
mean value
Yes.
NORMAL DISTRIBUTION: a symmetric distribution, often called a
bell-curve which describes many parameters of the natural world,
e.g. height, weight, test scores in a very large class
Yes.
STANDARD DEVIATION: �2 = a measure of dispersion in the
data, standardised relative to the mean
Yes.
SKEWNESS: measures the extent to which the distribution is
“pushed” to either side of the mean
Yes.
KURTOSIS: measures the “peakedness” of the distribution
Yes.
meso – middle, intermediate, halfway
• lepto – small, fine, thin, delicate
• platy – broad, flat
Yes.
Accuracy • How close the estimate is to the true value • A biased method gives estimates which differ consistently from the true value • Cannot be determined from the data
Yes.
Precision • How close repeated estimates are • Precision can be determined from the data (standard error, confidence interval)
Yes.
Accuracy – how true (unbiased) is the result? • Requires attention to the methods of sample selection and measurement • Ensure no bias in instruments or techniques • Calibrate, ground-truth or similar • Randomly select samples
Yes.
Precision – how variable is the result? • Requires attention to sampling design and effort • Vary the number of samples taken Vary the size of the sampling unit • Vary the arrangement of the sampling units
Yes.
Aim • To estimate the lead content of oysters with a SE (standard error) of 8 ppm Prior sampling indicates that s2 (variance) is usually about 900 ppm Method • Calculate SE for varying n and use number which gives desired SE > this is about 14
Yes.
Recap: Take home messages
• Variables are any feature of the world – divided into 3 main groups:
• NOMINAL/CLASSIFICATION variables can be classified into named groups
e.g. sex, colour, habitat type etc.
• ORDINAL/RANKING variables can be ranked in order e.g. social position,
size-class, education level etc.
• NUMERICAL/QUANTITATIVE variables can be enumerated or quantified
(counted or measured), e.g. height, weight etc.
Yes.
• Variable distribution can be graphically represented in frequency
histograms showing the ‘shape’ and ‘spread’ of the data
Yes.
Summary statistics can be broadly divided into measures of location
(e.g. mean, median and mode) and measures of shape (e.g. variance
and standard deviation)
Yes.
Shape of distribution is vital in choosing appropriate statistical tests
Yes.
To be reliable observations and estimates should be accurate (not
showing bias) and precise (not too variable)
Yes.
Precision – how variable is the result? • Requires attention to sampling design and effort • Vary the number of samples taken Vary the size of the sampling unit • Vary the arrangement of the sampling units
Yes.
Scheme Advantages Disadvantages Uses Simple random (SR) Usually simple to use Provides limited information, probably not efficient or precise Pilot studies, simple studies Stratified Usually provides more precise results than other methods; provides more information than SR More complex to run and analyse; may take more time to sample Situations where an area, or population, can be divided into homogeneous strata; testing hypotheses Cluster When the situation is suitable, this scheme is likely to be more efficient; provides more information than SR More complex to run and analyse; may be less precise than stratified sampling Situations where items of interest are naturally grouped in clusters; can be used to test some types of hypotheses Systematic Usually simple to use Unless done carefully, may provide biased estimates Drawing maps and similar situations
Yes.
There is no one-size-fits-all approach, and depends on, e.g.:
• Cost/benefit à pilot studies useful in this regard
• Accuracy and precision trade-offs
• The study focus and ramifications
• Three main samplings schemes (and a special fourth case):
• Simple random
• Stratified random
• Cluster
• (Systematic)
• Models may be developed from observations and tested by
• Sampling (mensurative) experiments [more general results];
or,
• Manipulative experiments, generally trickier but can provide
much more explicit tests of mechanisms
Yes.
Use a balanced design when possible
• Balanced designs have equal numbers of replicates in all treatments
• Analysis is usually easier
• More readily meet the assumptions necessary for some tests
• May conflict with the requirements of a stratified sampling scheme
Yes.
Use a multifactor design when possible
• These designs are usually more efficient (more powerful for same
effort)
• These designs are usually more informative
• Ensure that all combinations of treatments are included
This is referred to as an orthogonal design
Non-orthogonal designs can be difficult to analyse
Yes.
REPLICATION à DO IT!
• Studies must be replicated in order to draw correct inferences
• Avoiding pseudoreplication is imperative and requires close
attention
Yes.
CONFOUNDING FACTORS
• To be avoided at all costs
• Makes it very difficult, if not impossible, to test hypotheses
Yes.
RANDOM & INDEPENDENT
• Random samples alleviate bias and maintain independence
• Non-independence can lead to incorrect conclusions
• May be done if implicitly part of the study and
appropriately accounted for
Yes.
BALANCED & COMPLETE (ORTHOGONAL)
• Usually provide more/better information
• Easier to analyse in many instances
• Unbalanced can be accommodated, incomplete not so much
Yes.
Test Statistics
• Theoretical distributions based on sample data for varying sample sizes
(degrees of freedom)
Yes.
Test Statistics
• Theoretical distributions based on sample data for varying sample sizes
(degrees of freedom)
Yes.
Multiple pairwise tests (3+ means)
• DON’T DO IT
• Greatly inflate Type I error rate (≈ 5% per comparison)
• ANOVA
• Use when comparing 3+ means
• Controls � at 0.05 (5%) for the entire procedure
• Assumptions
• Independent, normal, equal variance, additive
• Check assumptions graphically and using Cochran’s test
*generally robust to violations of normality and equal
variance assumptions
Yes.
Post-hoc multiple comparisons
• ANOVA identifies significant result but doesn’t tell you WHERE
• Use post-hoc tests, i.e. Tukey’s HSD to determine which means
differ
Yes.
Sub-topic 5: Two-factor ANOVA Design • Two rivers were sampled: one with pollutant released in upper reaches; the other was the closest similar unpolluted control • Samples were taken in the upper reaches of each river, at the mouth, and about half-way between • 3 water samples were collected at each combination of river and section • The number of plankton was counted and pollutants measured Factors River (Pollution): Polluted, Control Section (Area): Upper, Middle, Lower Replicates Water samples: 3 Variables Number of plankton Pollutants
Yes.
Arrangement and number of factors: Stratified (crossed/orthogonal) – all levels of one factor are present with all levels of the other(s) factor Cluster (nested/hierarchical) – some levels of one factor are present only at some levels of the other(s) factor Selection and number of replicates: Random – the replicates in each subgroup are randomly and independently selected Repeated measures – the replicates in some subgroups are the same as replicates in other subgroups Sub-topic 5: Two-factor ANOVA Selection and number of levels: Fixed – specific levels are chosen from the range available (or all available levels are used) Random – the levels in the study are randomly selected from those available and not all available are used These affect: - the complexity of the design - the appropriate model for the analysis - the power of tests for different effects
Yes.
Correlated variables vary together Parametric (Pearson’s) • For numerical or quantitative variables • r measures the closeness of the relationship • Correlation analyses linear relationships Non-linear relations need other methods Non-parametric (Spearman’s rank) • Non-parametric correlation is used when one or both variables is ordinal • May be useful when the assumptions of parametric analyses do not hold
Yes.
Assumptions of parametric correlation Normality of observations The observations on each of the variables are assumed to be normally distributed Linearity of relationship • The relationship is assumed to be linear (a straight line) • Checking assumptions Normality of observations • With many observations (>40) can plot frequency distribution • With fewer observations can do normal probability plots (not discussed in this unit) Linearity of relationship Graph the data!
Yes.
A valid test for a correlation requires the following:
• Quantitative observations: if one or both variables are ranking, or ordinal,
variables, use the non-parametric correlation coefficient (Sub-Topic 2)
Yes.
Independent observations: the selection of one point (e.g. animal) must not
influence whether or not any other point is selected. Analyses of nonindependent observations may be unreliable
Yes.
Observations bivariate normal: this means that the two variables must be
normally distributed. If one or both variables are not normally distributed it
may be possible to transform them so that they are (Sub-Topic 4)
Yes.
Linear relationship: the correlation coefficient measures only the degree of
linear relationship. If the relationship is not linear it may be possible to
transform one or both variables so that it is (Sub-topic 4)
Yes.