Stats Year 2 - Man Flashcards

Question 1

Q

SSR

Answer

A

The sum of the square if the distance away from any point to y bar

Question 2

Q

SSR + SSE

Answer

A

The total variance on the y axis

Question 3

Q

Regression line will always pass through (…) so if you know that and the gradient, you know the line

Answer

A

x bar, y bar

Question 4

Q

SSE

Answer

A

Sum of squares of errors
sum ((actual y - predicted y)squared)

Question 5

Q

Linear regression measures

Answer

A

Causal relationship between two variables

Question 6

Q

Regression analysis fits … line to a … plot

Answer

A

straight, scatter

Question 7

Q

Mulitple linear regression is superior to classical linear progression because:

Answer

A

It allows us to perform all the calculations in one go
It determines the contribution of each independednt variable
The result is a model that can predict the DV using two or more IVs
The model is usually good or adequate without all IVs

Question 8

Q

Limitation of multiple linear regression

Answer

A

Process is iterative, so each step depends on the previous

Question 9

Q

How to run multiple linear regression?

Answer

A

Make an initial model and run with lm
Remove the variable with the highest p value
Repeat until all variables have a p value less than 0.05

Question 10

Q

ANOVA test of successive models in MLR

Answer

A

Tells us whether there is a significant difference in the performance of each model.
Value within range is good
Value outside of range means that they are significantly different

Question 11

Q

AIC

Answer

A

An assessment of how well the model is doing given the number of DVs.
Want to minimise the AIC
Model will stop running when removing a variable will increase the AIC

Question 12

Q

AIC equation

Answer

A

AIC = 2k - 2 ln(L)

where k is the number of parameters in the model and L is the maximum likelihood of …

Question 13

Q

Using AIC to compare models

Answer

A

AIC < 2 means models are similarly good
AIC 4-7 means one model is probably better
AIC > 10 means there is stron g evidence that the lower model is better.

Question 14

Q

Principle Component Analysis

Answer

A

A tool for exploring the structure of multivariate data.

Question 15

Q

PCA as a Data Reduction Technique

Answer

A

Allows us to reduce the number of variables to a manageable number of new variables or components without sacrifising too much information

Question 16

Q

What do the new variables represent in PCA

Answer

A

The variance in the data set
If the data represents measurements from model cars that are all exactly to scale but have different sizes, then the only variable would be size.
additional differences cause additional variables

Question 17

Q

Limitations of PCA

Answer

A

Not a test, hence no hypotheses
Categorical variables cannot be used.
Missing values cannot be accomodated

Question 18

Q

Assumption of PCA

Answer

A

Assumes variables are continuous or on an interval scale

Question 19

Q

Two types oof PCA

Answer

A

Covariance or correlation matrix

Question 20

Q

Covariance matrix

Answer

A

Applies more weight to some variables than others depending on their variance.
Use when scales are similar.

Question 21

Q

Correlation matrix

Answer

A

Effectively expresses each variable (column of matrix) in standard deviation units, giving each variable equal weight

Question 22

Q

Standardised measurement for PCA

Answer

A

Standardised measurement = (measurement - mean of that type of measurement)/SD of that type of measurement

This is the same as the Z score

Question 23

Q

The first PC expresses the … portion of the variance, the second the next … and so on

Question 24

Q

What to look for in a scree plot?

Answer

A

An obvious elbow which tells us which PC are the most important

Question 25

Q

Best way to visualise what’s happening on the first two PCs is by …

Answer

A

Using a biplot.
Variables plotted closely together are closely correlated.

Question 26

Q

Arrows in a biplot indicate…

Answer

A

…the contribution of each variable to each component

Question 27

Q

Hierarchical, Agglomerative Cluster Analysis

Answer

A

A technique for exploring the structure of complex multivariaet data.
Aims to find groups of objects with the data and represent it graphically.

Question 28

Q

Three principle stages in classical clustering

Answer

A

Transforming or scaling the data
Producing a triangular distance matrix between all possible pairs of objects
Making clusters.

Question 29

Q

Transformation in clustering

Answer

A

PCA of correlations effectively rescales the variables to unit variance so no single variance overpowers the model.
Usual procedure is to convert each variable to a set of Z scores.

Question 30

Q

Distance matrix

Answer

A

Gives the distance between each object and any other

Question 31

Q

Making clusters

Answer

A

Methods for this diverge based on how differences between groups are defined.

Question 32

Q

Strengths of clustering

Answer

A

Variables can be of any type provided the distance metric is appropriate.
Many procedures allow for missing values

Question 33

Q

Cophenetic Correlation

Answer

A

Simplest measure of how well a particular dendrogram fits the distance matrix
Correlation between distance matrix and matrix of distances between clusters as shown in dendrogram

Question 34

Q

Limitations of clustering

Answer

A

Depends on how clusters are defined
Depends on how distances between clusters are defined.

Question 35

Q

Height on dendrogram

Answer

A

Indicates the degree of similarity between two variables.
Lower height of horizontal connector = more similar

Question 36

Q

Cophenetic function

Answer

A

Creates a triangular distance matrix from the dendrogram.
Values will be repeated because distances between variables are not shown over distances between clusters