1.Multivariate Analysis Flashcards
rows=______
columns= _____
observattions
variables
Multicollinearity
however, large datasets are often composed of multiple variables or cases that are
similar, and therefore could be considered redundant
• for instance, income data could be broken down into a large number of
subcategories, each being a separate variable
• but many of these subcategories could be highly correlated with each other such
that the data from one variable could explain the data from the other
• in this case the second variable is redundant and unnecessary
• the same situation holds for observations that are similar
• if 2 census tracts behave similarly, they could be treated as a single census
tract without losing too much information
Data reduction is
and two main ways to do so
Combining like factors and clusters together to reduce data size
1• factor analysis: this groups together variables that behave similarly and produces
factors – each factor is composed of one or more variables
-produces factors
2• cluster analysis: this groups together observations that behave similarly and
produces clusters – each cluster is composed of one or more observations
-produces clusters
Factor Analysis
• in large datasets, where many of the variables are correlated with each other, factor
analysis can be used to reduce or collapse the large number of variables into a small
number of factors
• in this case, the word ‘factor’ is equivalent to ‘variable’, and a factor could be
considered a group of similar variables
• the identification of the factors and how many factors are important is where
understanding factor analysis is important
the common factors or unique components of a factors analysis are
length, width, and depth
Components must be perpendicular and therefore perfectly _______(r=0)
unrelated
Standardization
.When you have large data sets with multiple levels of units, STANDARDIZATION is the best way to make them be on the same level, common ground
In a factor analysis, how do we know which ones behave the same?
Principle component analysis:
PCA VS. FA
• principle components analysis vs factor analysis (PCA vs FA)
• PCA decomposes the variation in the data set of p variables into p principle
components
• these components are linear combinations of the original variables
-PCA ends with same number of components
• FA models the variability in the dataset using a reduced number of factors (k < p)
• these factors are linear combinations of the original variables plus an error
component, known as a uniqueness
• PCA is done as part of FA, and FA then uses a selection of the most important principle
components
-FA ends with less components then when it started
eigenvalues
the axes are defined by their length, which are known as eigenvalues (a long axis has a
high eigenvalue…)
PCA
PCA essentially reorganizes the data and sorts them by where the most important variation is versus the least important
• in this example, we have 10 variables, which means we are working in a 10-dimensional
space and we should find 10 perpendicular axes for the ellipse
• the axes are defined by their length, which are known as eigenvalues (a long axis has a
high eigenvalue…)
• PCA (and factor analysis as a whole) requires software, so there are no mathematics to
worry about here, only outputs from programs like SPSS
• depending on the specific software, the eigenvalues may also be called “sum of
square loadings”
• the process of defining the components is known as extraction, and the extraction
process in factor analysis is called principle components
PCA is done when..
we have identified the length of each of the 10 axis
FA: To Find which components are important from the PCA we have to determine: 2
-have to have an eigenvalue of >1 and they will become the factors
Or
-examine the screen plot and find inflection point which separates the steep part from flat
factor loadings
the correlations between the chosen factors
and the original variables
-tells you how much they play a role within the different factors
Uniqueness
• by extracting 3 of the 10 components, we have necessarily lost some of the information
– the 3 factors only account for 72.638% of the variance in the dataset
• the rest of the variability (27.362%) remains unexplained, and is called the uniqueness
• uniqueness can be quantified by examining the communalities