Stats Year 2 - Man Flashcards

1
Q

SSR

A

The sum of the square if the distance away from any point to y bar

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

SSR + SSE

A

The total variance on the y axis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Regression line will always pass through (…) so if you know that and the gradient, you know the line

A

x bar, y bar

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

SSE

A

Sum of squares of errors
sum ((actual y - predicted y)squared)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Linear regression measures

A

Causal relationship between two variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Regression analysis fits … line to a … plot

A

straight, scatter

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Mulitple linear regression is superior to classical linear progression because:

A
  1. It allows us to perform all the calculations in one go
  2. It determines the contribution of each independednt variable
  3. The result is a model that can predict the DV using two or more IVs
  4. The model is usually good or adequate without all IVs
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Limitation of multiple linear regression

A

Process is iterative, so each step depends on the previous

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How to run multiple linear regression?

A
  1. Make an initial model and run with lm
  2. Remove the variable with the highest p value
  3. Repeat until all variables have a p value less than 0.05
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

ANOVA test of successive models in MLR

A

Tells us whether there is a significant difference in the performance of each model.
Value within range is good
Value outside of range means that they are significantly different

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

AIC

A

An assessment of how well the model is doing given the number of DVs.
Want to minimise the AIC
Model will stop running when removing a variable will increase the AIC

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

AIC equation

A

AIC = 2k - 2 ln(L)

where k is the number of parameters in the model and L is the maximum likelihood of …

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Using AIC to compare models

A

AIC < 2 means models are similarly good
AIC 4-7 means one model is probably better
AIC > 10 means there is stron g evidence that the lower model is better.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Principle Component Analysis

A

A tool for exploring the structure of multivariate data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

PCA as a Data Reduction Technique

A

Allows us to reduce the number of variables to a manageable number of new variables or components without sacrifising too much information

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What do the new variables represent in PCA

A

The variance in the data set
If the data represents measurements from model cars that are all exactly to scale but have different sizes, then the only variable would be size.
additional differences cause additional variables

17
Q

Limitations of PCA

A

Not a test, hence no hypotheses
Categorical variables cannot be used.
Missing values cannot be accomodated

18
Q

Assumption of PCA

A

Assumes variables are continuous or on an interval scale

19
Q

Two types oof PCA

A

Covariance or correlation matrix

20
Q

Covariance matrix

A

Applies more weight to some variables than others depending on their variance.
Use when scales are similar.

21
Q

Correlation matrix

A

Effectively expresses each variable (column of matrix) in standard deviation units, giving each variable equal weight

22
Q

Standardised measurement for PCA

A

Standardised measurement = (measurement - mean of that type of measurement)/SD of that type of measurement

This is the same as the Z score

23
Q

The first PC expresses the … portion of the variance, the second the next … and so on

A

Biggest

24
Q

What to look for in a scree plot?

A

An obvious elbow which tells us which PC are the most important

25
Q

Best way to visualise what’s happening on the first two PCs is by …

A

Using a biplot.
Variables plotted closely together are closely correlated.

26
Q

Arrows in a biplot indicate…

A

…the contribution of each variable to each component

27
Q

Hierarchical, Agglomerative Cluster Analysis

A

A technique for exploring the structure of complex multivariaet data.
Aims to find groups of objects with the data and represent it graphically.

28
Q

Three principle stages in classical clustering

A
  1. Transforming or scaling the data
  2. Producing a triangular distance matrix between all possible pairs of objects
  3. Making clusters.
29
Q

Transformation in clustering

A

PCA of correlations effectively rescales the variables to unit variance so no single variance overpowers the model.
Usual procedure is to convert each variable to a set of Z scores.

30
Q

Distance matrix

A

Gives the distance between each object and any other

31
Q

Making clusters

A

Methods for this diverge based on how differences between groups are defined.

32
Q

Strengths of clustering

A

Variables can be of any type provided the distance metric is appropriate.
Many procedures allow for missing values

33
Q

Cophenetic Correlation

A

Simplest measure of how well a particular dendrogram fits the distance matrix
Correlation between distance matrix and matrix of distances between clusters as shown in dendrogram

34
Q

Limitations of clustering

A

Depends on how clusters are defined
Depends on how distances between clusters are defined.

35
Q

Height on dendrogram

A

Indicates the degree of similarity between two variables.
Lower height of horizontal connector = more similar

36
Q

Cophenetic function

A

Creates a triangular distance matrix from the dendrogram.
Values will be repeated because distances between variables are not shown over distances between clusters