1.Multivariate Analysis Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

rows=______

columns= _____

A

observattions

variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Multicollinearity

A

however, large datasets are often composed of multiple variables or cases that are
similar, and therefore could be considered redundant
• for instance, income data could be broken down into a large number of
subcategories, each being a separate variable
• but many of these subcategories could be highly correlated with each other such
that the data from one variable could explain the data from the other
• in this case the second variable is redundant and unnecessary
• the same situation holds for observations that are similar
• if 2 census tracts behave similarly, they could be treated as a single census
tract without losing too much information

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Data reduction is

and two main ways to do so

A

Combining like factors and clusters together to reduce data size

1• factor analysis: this groups together variables that behave similarly and produces
factors – each factor is composed of one or more variables
-produces factors
2• cluster analysis: this groups together observations that behave similarly and
produces clusters – each cluster is composed of one or more observations
-produces clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Factor Analysis

A

• in large datasets, where many of the variables are correlated with each other, factor
analysis can be used to reduce or collapse the large number of variables into a small
number of factors
• in this case, the word ‘factor’ is equivalent to ‘variable’, and a factor could be
considered a group of similar variables
• the identification of the factors and how many factors are important is where
understanding factor analysis is important

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

the common factors or unique components of a factors analysis are

A

length, width, and depth

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Components must be perpendicular and therefore perfectly _______(r=0)

A

unrelated

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Standardization

A

.When you have large data sets with multiple levels of units, STANDARDIZATION is the best way to make them be on the same level, common ground

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

In a factor analysis, how do we know which ones behave the same?

A

Principle component analysis:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

PCA VS. FA

A

• principle components analysis vs factor analysis (PCA vs FA)
• PCA decomposes the variation in the data set of p variables into p principle
components
• these components are linear combinations of the original variables
-PCA ends with same number of components

• FA models the variability in the dataset using a reduced number of factors (k < p)
• these factors are linear combinations of the original variables plus an error
component, known as a uniqueness
• PCA is done as part of FA, and FA then uses a selection of the most important principle
components
-FA ends with less components then when it started

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

eigenvalues

A

the axes are defined by their length, which are known as eigenvalues (a long axis has a
high eigenvalue…)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

PCA

A

PCA essentially reorganizes the data and sorts them by where the most important variation is versus the least important

• in this example, we have 10 variables, which means we are working in a 10-dimensional
space and we should find 10 perpendicular axes for the ellipse
• the axes are defined by their length, which are known as eigenvalues (a long axis has a
high eigenvalue…)
• PCA (and factor analysis as a whole) requires software, so there are no mathematics to
worry about here, only outputs from programs like SPSS
• depending on the specific software, the eigenvalues may also be called “sum of
square loadings”
• the process of defining the components is known as extraction, and the extraction
process in factor analysis is called principle components

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

PCA is done when..

A

we have identified the length of each of the 10 axis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

FA: To Find which components are important from the PCA we have to determine: 2

A

-have to have an eigenvalue of >1 and they will become the factors
Or
-examine the screen plot and find inflection point which separates the steep part from flat

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

factor loadings

A

the correlations between the chosen factors
and the original variables

-tells you how much they play a role within the different factors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Uniqueness

A

• by extracting 3 of the 10 components, we have necessarily lost some of the information
– the 3 factors only account for 72.638% of the variance in the dataset
• the rest of the variability (27.362%) remains unexplained, and is called the uniqueness
• uniqueness can be quantified by examining the communalities

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

5 parts of factor analysis

A
  1. PCA
  2. Isolate significant components
  3. Factor Loadings
  4. Determine uniqueness of selected components
  5. Factor/Component scores
17
Q

Using factor analysis data in regression?

A

the factor scores can be used in regression analysis
• instead of including all the variables in a regression model, we could include the
factors
• this has the added benefit of reducing the multicollinearity to 0
• remember, the components and factors are perpendicular to each other and
therefore are perfectly uncorrelated (and so the independent variables – or
factors in this case – must be independent)
• however, interpreting the regression coefficients – an important part of
regression analysis – is much more difficult, since the factors are combination of
variables and in some cases can be very abstract

18
Q

the primary goal of cluster analysis is to

A

the primary goal of cluster analysis is to reduce the within-group variability and
maximize the between-group variability
-this means that a cluster of observations will be very similar amongst
themselves, but will be very different from any other groups

19
Q

there are 2 primary methods of cluster analysis:

A

• agglomerative or hierarchical
• this method starts with each observation being its own cluster, then, 2 observations are combined into 1 cluster, leaving n-1 clusters; this process
continues until all of the observations are combined into 1

• non-agglomerative or non-hierarchical
• this methods begins with a decision to make k clusters – the process of
assigning an individual observation to a specific cluster requires software
since it is a complex and iterative calculation
• the non-hierarchical approach is preferred because it is easier – in SPSS you can
choose either, listed as hierarchical cluster analysis and k-means cluster analysis

20
Q

Standardization

A

• when the data is not standardized, the results of the cluster analysis will be
dependent on the units and relative size of the values for each variable – this is
not what we’re looking for

21
Q

zscore

A

• these results are z-scores, such that cluster
1 has a centre points with above average
everything (all variables are +, and
therefore above the mean), while cluster 2
has below average everything (all variables
are
-, and therefore below the mean)

-
Zscore=standard deviation from the mean

22
Q

Low communality means _____ uniqueness

A

high

23
Q

• unlike many of the statistical approaches already discussed, cluster analysis has few
assumptions: 2

A
  1. representativeness: it is up to the researcher to ensure that the sample on which
    the analysis is performed is representative of the population
  2. multicollinearity: if several observations are correlated, they will be more likely to
    generate clusters, potentially resulting in a false number of clusters and poor data
    structure
24
Q

which is better – hierarchical or k-means clustering?

A

• hierarchical is easier to implement and interpret, but becomes much more
difficult as the sample size increases
• k-means clustering suffers for requiring a predefined number of clusters – this
requires several iterations of the analysis for different values of k, requiring
the researcher to ultimately choose one of the scenarios
• it is possible to combine the approaches:
• start with a hierarchical clustering approach to help define how many clusters
exist and approximately where their centres occur
• use k-means clustering with the defined k value to discern cluster membership