Statistical analyses Flashcards
Under what circumstances are data transformations important for multivariate analyses?
If data do not have a uniform scale
How do you deal with qualitative variables in multivariate analyses
Give them a numerical value
example: seasonality, use four seperate variables, designate absence or presence of that season as a 0 or 1
What does standardization do? (2)
used to remove influences of magnitude difference
results in dimensionless variables
What is a z-score and how is it calculated? (2 steps)
Used to standardize data;
- take difference between the value and mean of the variable
- divide by the stdev of the variable
What is the difference between an object and a variable?
Object: categories of data (samples, sites, time periods, etc.)
Variable: measured value for each object
What does normalization do?
Corrects distribution shapes of variables that depart from normality, tries to obtain homogenous variances for variables for better multivariate analyses
What transformation can be done on data with a lot of zeroes?
Hellinger transformation
What are exploratory multivariate analyses?
multivariate analyses that are used to reveal patterns in large datasets, but do not explain why those patterns exist
What does a cluster analysis do?
Minimizes w/in group variation, maximizes between group variation (reduces the dimensionality of the dataset to a few groups of objects)
Under what circumstances might a cluster analysis be useful?
When distinct discontinuities are expected
What are the two steps for a cluster analysis?
- use a relevant association coefficienct to caluclate a dissimilarity/similarity matrix between or among objects/variables.
- represent association matrix as a tree (heirarchical clustering) or as groups of objects (k-means clustering)
What types of linkage rules are generally used to form martrices for heirarchical clustering? (3)
- nearest neighbor- distance between two clusters is equal to the distance between their CLOSEST neighboring points
- further neighbor: distance between 2 clusters is equal to the distance between their two furthest objects
- UPGMA: distance between 2 clusters is equal to the avg. distance between all inter-cluster pairs
How does k-means clustering work?
objects are clustered into k (defined in advance) number of clusters based on their nearest Euclidan distance to the mean of clusters
What is one advantage and one disadvantage of using k-means clustering?
advantage: don’t need a similiarity matrix
disadvantage: sensitive to outliers
What is a PCA?
Priciple componet analysis
What does a PCA do?
Calculates new synthetic variables (principle components) using linear combinations of the original variables to account for as much variability as possible
What kind of matrix is used for PCA when all data points have the same units (ex: species abundance)?
Variance-covariance matrix
What kind of matrix is used for PCA when data points have the different units?
correlation matrix, variables must be standardized so that distances are independent of original scales
What are the dots on a PCA ordination?
Objects
What are the vectors on a PCA ordination and what do they mean?
Variables
Vector direction indicates greatest change, rate may indicate rate of change
Under what conditions should a PCA be used?
Good when looking at linear responses across short gradients (otherwise CA, NMDS are better)
What is an eigenvalue?
value denoting how much variance is explained by a given principle component.
When is an eigenvalue considered significant?
If its value is greater than the average of all eigenvalues
Why are correlations between principle components and original variables not statistically valid in terms of describing which variables contribute most to variation observed in a PCA ordination?
components and variables are already linearly correlated and are not independent of one another
What does PCoA stand for?
Principle coordinate analysis
How is PCoA different from PCA?
Works with any dissimiliarity measure– can pick the association coefficient that works best for your data
Why are componets more difficult to interpret for PCoA than for PCA?
There is no direct link between componets because PCoA components are complex functions of variables depending on the matrix coefficient used to form the matrix; can still correlate variables with axes (but not stat. sig)
What does NMDS stand for?
non-parametric multidimensional scaling
What analysis is good for identifying underlying gradients and representing relationships based on various distance measures?
NMDS
How does NMDS work?
Ranks distances between objects (matrix), uses those ranks to map them non-linearly in ordination so as to preserve their ranks with the least amount of “stress”; proximity between objects corresponds to their similiarity.
How does NMDS calculate stress?
Goes through several iterations of ordinations to indentify the lowest stress based on comparisons to original distances between samples
How do you interpret stress values for NMDS?
> 0.3 stress is poor representation of the data
Which two statistical tests are usually used for multivariate data?
NPMANOVA, ANOSIM
What is NPMANOVA used for?
used to test for significance between at least two sets of multivariate, quantitative data
What is the null hypothesis for NPMANOVA for multivariate data and how is it tested?
null hypothesis of equality between datasets is tested using Wilke’s lambda; can then use another post-hoc test to assess the significance of pairwise comparisons.
What is ANOSIM used for?
Can test for significance based on any distance measure, compares ranks of distance in and between groups
What statistic do you get after running an ANOSIM and how is it interpreted?
R statistic, measures whether separation is found (R=1) or no separation occurs (R=0)
R>0.75 considered well-separated statistically
R> 0.5 considered separated but overlapping
R
What “indirect” gradient analyses can be used to assess environmental gradients for PCA and CA?
ANOVA
What “indirect” gradient analyses can be used to assess environmental gradients for PCoA and NMDS?
Spearman Rank Correlations
What are some other “indirect” gradient analyses? (2)
Run linear regression of variables onto existing ordination (done in R)
Can also use site symbols where size is proportional to the environmental value (good for NMDS)
What is a constrained (canonical) ordination?
A direct gradient analysis where only the variation that can be explained by the environmental variables (provided in a seperate table) is displayed in the ordination
In Constrained (canonical) ordination, species abundance is usually considered a ______.
response
Constrained (canonical ordination) is usually based on _______ ______ _______ that related axes to environmental variables
multivariate linear models
Redudundancy analysis (RDA) is another _________ analysis for environmental gradients.
Direct
RDA is an extension of _____, where components are constrained to be linear combinations of environmental variables
PCA
How does RDA “explain” variation between independent and dependent variables?
uses multiple linear regression, get correlation coefficients between each species and each environmental variable
______ is similar to RDA, but uses unimodal species-environment relationships.
Canonical correspondence analysis
How is a Mantel test used?
To compare to matrices; calculates correlation coefficients between corresponding matrix positions.
What are two common diversity indices?
Shannon and Simplot
What types of diversity can be calculated based on Simplot diversity? (3)
alpha: within sample diversity
beta: between sample diversity
gamma: landscape scale diversity
What is a key difference between CA and PCA?
CA is used for categorical rather than continuous data
For CA, all data should be on the ____ scale and ________.
same; non-negative
CA decomposed the chi squared statistica associated with a table into __________ __________.
Orthogonal factors
What is Euclidean distance?
Derived from the Pythagorean theorem, is just the “ordinary” distance between two points in 3D Euclidean space
Bray-Curtis dissimilarity is used to:
quantify the compositional dissimilarity between sites based on counts at each site.
ANOVA stands for:
Analysis of variance
What are three assumptions of an ANOVA?
normality, independence, homogeneity of variance
What is an ANOVA used for?
to analyze differences amoung group means and associated procedures (including variance)
What is a MANOVA?
A mulivariate ANOVA (ANOVA with several dependent variables)
What dissimiliarity metric is chi squared distance based on?
Euclidean distance
What statistic is used by MANOVA?
Wilk’s lambda, a multivariate generalization of the F-distribution used in univariate analyses
What statistic is used by ANOVA?
F distribution, describes the distribution of the test statistic when the null hypothesis is false
What is Euclidean distance?
Distance between samples in 3D space, also called Pythagorean distance
What is Bray-Curtis dissimilarity? How can the Bray-Curtis similarity be calculated?
A common ecological metric for determining the dissimilarity between two sites based on counts at both sites; Bray-Curtis similarity can be calculated by subtracting the Bray-Curtis dissimilarity from 100.
When should a Kruskal-Wallis test be performed/used?
When you have one nominal variable and one ranked or scaled variable; is non-parametric so can be used in the place of a one-way ANOVA when data is not normally distributed
What type of data can a Mann-Whitney U test be performed on?
Two sets of independent non-parametric data (not normally distributed, variance is unequal)
What does the Mann-Whitney U test test?
Whether two independent groups of samples come from the same distribution; non-parametric version of the t-test, based on rank abundances
What is the null hypothesis of a Kruskal-Wallis test?
That the mean (or mean rank) of the two groups is the same