Stats Year 2 - Man Flashcards
SSR
The sum of the square if the distance away from any point to y bar
SSR + SSE
The total variance on the y axis
Regression line will always pass through (…) so if you know that and the gradient, you know the line
x bar, y bar
SSE
Sum of squares of errors
sum ((actual y - predicted y)squared)
Linear regression measures
Causal relationship between two variables
Regression analysis fits … line to a … plot
straight, scatter
Mulitple linear regression is superior to classical linear progression because:
- It allows us to perform all the calculations in one go
- It determines the contribution of each independednt variable
- The result is a model that can predict the DV using two or more IVs
- The model is usually good or adequate without all IVs
Limitation of multiple linear regression
Process is iterative, so each step depends on the previous
How to run multiple linear regression?
- Make an initial model and run with lm
- Remove the variable with the highest p value
- Repeat until all variables have a p value less than 0.05
ANOVA test of successive models in MLR
Tells us whether there is a significant difference in the performance of each model.
Value within range is good
Value outside of range means that they are significantly different
AIC
An assessment of how well the model is doing given the number of DVs.
Want to minimise the AIC
Model will stop running when removing a variable will increase the AIC
AIC equation
AIC = 2k - 2 ln(L)
where k is the number of parameters in the model and L is the maximum likelihood of …
Using AIC to compare models
AIC < 2 means models are similarly good
AIC 4-7 means one model is probably better
AIC > 10 means there is stron g evidence that the lower model is better.
Principle Component Analysis
A tool for exploring the structure of multivariate data.
PCA as a Data Reduction Technique
Allows us to reduce the number of variables to a manageable number of new variables or components without sacrifising too much information
What do the new variables represent in PCA
The variance in the data set
If the data represents measurements from model cars that are all exactly to scale but have different sizes, then the only variable would be size.
additional differences cause additional variables
Limitations of PCA
Not a test, hence no hypotheses
Categorical variables cannot be used.
Missing values cannot be accomodated
Assumption of PCA
Assumes variables are continuous or on an interval scale
Two types oof PCA
Covariance or correlation matrix
Covariance matrix
Applies more weight to some variables than others depending on their variance.
Use when scales are similar.
Correlation matrix
Effectively expresses each variable (column of matrix) in standard deviation units, giving each variable equal weight
Standardised measurement for PCA
Standardised measurement = (measurement - mean of that type of measurement)/SD of that type of measurement
This is the same as the Z score
The first PC expresses the … portion of the variance, the second the next … and so on
Biggest
What to look for in a scree plot?
An obvious elbow which tells us which PC are the most important
Best way to visualise what’s happening on the first two PCs is by …
Using a biplot.
Variables plotted closely together are closely correlated.
Arrows in a biplot indicate…
…the contribution of each variable to each component
Hierarchical, Agglomerative Cluster Analysis
A technique for exploring the structure of complex multivariaet data.
Aims to find groups of objects with the data and represent it graphically.
Three principle stages in classical clustering
- Transforming or scaling the data
- Producing a triangular distance matrix between all possible pairs of objects
- Making clusters.
Transformation in clustering
PCA of correlations effectively rescales the variables to unit variance so no single variance overpowers the model.
Usual procedure is to convert each variable to a set of Z scores.
Distance matrix
Gives the distance between each object and any other
Making clusters
Methods for this diverge based on how differences between groups are defined.
Strengths of clustering
Variables can be of any type provided the distance metric is appropriate.
Many procedures allow for missing values
Cophenetic Correlation
Simplest measure of how well a particular dendrogram fits the distance matrix
Correlation between distance matrix and matrix of distances between clusters as shown in dendrogram
Limitations of clustering
Depends on how clusters are defined
Depends on how distances between clusters are defined.
Height on dendrogram
Indicates the degree of similarity between two variables.
Lower height of horizontal connector = more similar
Cophenetic function
Creates a triangular distance matrix from the dendrogram.
Values will be repeated because distances between variables are not shown over distances between clusters