1.Multivariate Analysis Flashcards

Question 1

Q

rows=______

columns= _____

Answer

A

observattions

variables

Question 2

Q

Multicollinearity

Answer

A

however, large datasets are often composed of multiple variables or cases that are
similar, and therefore could be considered redundant
• for instance, income data could be broken down into a large number of
subcategories, each being a separate variable
• but many of these subcategories could be highly correlated with each other such
that the data from one variable could explain the data from the other
• in this case the second variable is redundant and unnecessary
• the same situation holds for observations that are similar
• if 2 census tracts behave similarly, they could be treated as a single census
tract without losing too much information

Question 3

Q

Data reduction is

and two main ways to do so

Answer

A

Combining like factors and clusters together to reduce data size

1• factor analysis: this groups together variables that behave similarly and produces
factors – each factor is composed of one or more variables
-produces factors
2• cluster analysis: this groups together observations that behave similarly and
produces clusters – each cluster is composed of one or more observations
-produces clusters

Question 4

Q

Factor Analysis

Answer

A

• in large datasets, where many of the variables are correlated with each other, factor
analysis can be used to reduce or collapse the large number of variables into a small
number of factors
• in this case, the word ‘factor’ is equivalent to ‘variable’, and a factor could be
considered a group of similar variables
• the identification of the factors and how many factors are important is where
understanding factor analysis is important

Question 5

Q

the common factors or unique components of a factors analysis are

Answer

A

length, width, and depth

Question 6

Q

Components must be perpendicular and therefore perfectly _______(r=0)

Answer

A

unrelated

Question 7

Q

Standardization

Answer

A

.When you have large data sets with multiple levels of units, STANDARDIZATION is the best way to make them be on the same level, common ground

Question 8

Q

In a factor analysis, how do we know which ones behave the same?

Answer

A

Principle component analysis:

Question 9

Q

PCA VS. FA

Answer

A

• principle components analysis vs factor analysis (PCA vs FA)
• PCA decomposes the variation in the data set of p variables into p principle
components
• these components are linear combinations of the original variables
-PCA ends with same number of components

• FA models the variability in the dataset using a reduced number of factors (k < p)
• these factors are linear combinations of the original variables plus an error
component, known as a uniqueness
• PCA is done as part of FA, and FA then uses a selection of the most important principle
components
-FA ends with less components then when it started

Question 10

Q

eigenvalues

Answer

A

the axes are defined by their length, which are known as eigenvalues (a long axis has a
high eigenvalue…)

Question 11

Q

PCA

Answer

A

PCA essentially reorganizes the data and sorts them by where the most important variation is versus the least important

• in this example, we have 10 variables, which means we are working in a 10-dimensional
space and we should find 10 perpendicular axes for the ellipse
• the axes are defined by their length, which are known as eigenvalues (a long axis has a
high eigenvalue…)
• PCA (and factor analysis as a whole) requires software, so there are no mathematics to
worry about here, only outputs from programs like SPSS
• depending on the specific software, the eigenvalues may also be called “sum of
square loadings”
• the process of defining the components is known as extraction, and the extraction
process in factor analysis is called principle components

Question 12

Q

PCA is done when..

Answer

A

we have identified the length of each of the 10 axis

Question 13

Q

FA: To Find which components are important from the PCA we have to determine: 2

Answer

A

-have to have an eigenvalue of >1 and they will become the factors
Or
-examine the screen plot and find inflection point which separates the steep part from flat

Question 14

Q

factor loadings

Answer

A

the correlations between the chosen factors
and the original variables

-tells you how much they play a role within the different factors

Question 15

Q

Uniqueness

Answer

A

• by extracting 3 of the 10 components, we have necessarily lost some of the information
– the 3 factors only account for 72.638% of the variance in the dataset
• the rest of the variability (27.362%) remains unexplained, and is called the uniqueness
• uniqueness can be quantified by examining the communalities

Question 16

Q

5 parts of factor analysis

Answer

Study These Flashcards

A

PCA
Isolate significant components
Factor Loadings
Determine uniqueness of selected components
Factor/Component scores

Question 17

Q

Using factor analysis data in regression?

Answer

Study These Flashcards

A

the factor scores can be used in regression analysis
• instead of including all the variables in a regression model, we could include the
factors
• this has the added benefit of reducing the multicollinearity to 0
• remember, the components and factors are perpendicular to each other and
therefore are perfectly uncorrelated (and so the independent variables – or
factors in this case – must be independent)
• however, interpreting the regression coefficients – an important part of
regression analysis – is much more difficult, since the factors are combination of
variables and in some cases can be very abstract

Question 18

Q

the primary goal of cluster analysis is to

Answer

Study These Flashcards

A

the primary goal of cluster analysis is to reduce the within-group variability and
maximize the between-group variability
-this means that a cluster of observations will be very similar amongst
themselves, but will be very different from any other groups

Question 19

Q

there are 2 primary methods of cluster analysis:

Answer

Study These Flashcards

A

• agglomerative or hierarchical
• this method starts with each observation being its own cluster, then, 2 observations are combined into 1 cluster, leaving n-1 clusters; this process
continues until all of the observations are combined into 1

• non-agglomerative or non-hierarchical
• this methods begins with a decision to make k clusters – the process of
assigning an individual observation to a specific cluster requires software
since it is a complex and iterative calculation
• the non-hierarchical approach is preferred because it is easier – in SPSS you can
choose either, listed as hierarchical cluster analysis and k-means cluster analysis

Question 20

Q

Standardization

Answer

Study These Flashcards

A

• when the data is not standardized, the results of the cluster analysis will be
dependent on the units and relative size of the values for each variable – this is
not what we’re looking for

Question 21

Q

zscore

Answer

Study These Flashcards

A

• these results are z-scores, such that cluster
1 has a centre points with above average
everything (all variables are +, and
therefore above the mean), while cluster 2
has below average everything (all variables
are
-, and therefore below the mean)

-
Zscore=standard deviation from the mean

Question 22

Q

Low communality means _____ uniqueness

Answer

Study These Flashcards

A

high

Question 23

Q

• unlike many of the statistical approaches already discussed, cluster analysis has few
assumptions: 2

Answer

Study These Flashcards

A

representativeness: it is up to the researcher to ensure that the sample on which
the analysis is performed is representative of the population
multicollinearity: if several observations are correlated, they will be more likely to
generate clusters, potentially resulting in a false number of clusters and poor data
structure

Question 24

Q

which is better – hierarchical or k-means clustering?

Answer

Study These Flashcards

A

• hierarchical is easier to implement and interpret, but becomes much more
difficult as the sample size increases
• k-means clustering suffers for requiring a predefined number of clusters – this
requires several iterations of the analysis for different values of k, requiring
the researcher to ultimately choose one of the scenarios
• it is possible to combine the approaches:
• start with a hierarchical clustering approach to help define how many clusters
exist and approximately where their centres occur
• use k-means clustering with the defined k value to discern cluster membership

1.Multivariate Analysis Flashcards

(24 cards)