Chapter 4 Flashcards
dimension reduction
dimension of a dataset
the number of variables
several dimension reduction approaches
(1) Incorporating
domain knowledge to remove or combine categories, (2) using data summaries
to detect information overlap between variables (and remove or combine
redundant variables or categories), (3) using data conversion techniques such
as converting categorical variables into numerical variables, and (4) employing
automated reduction techniques, such as principal components analysis (PCA),
where a new set of variables
domain knowledge
common sense of what variables are need
summary of data
Average Median Minimum Maximum Standard deviation Counts & percentages
Reducing Categories
In particular, a variable with m categories will be transformed
into either m or m 1 dummy variables (depending on the method).
This means that even if we have very few original categorical variables, they can
greatly inflate the dimension of the dataset. One way to handle this is to reduce
the number of categories by combining close or similar categories. Combining
categories requires incorporating expert knowledge and common sense. Pivot
tables are useful for this task: We can examine the sizes of the various categories
and how the outcome variable behaves in each category. Generally, categories
that contain very few observations are good candidates for combining with other
categories.
Principal Components Analysis
Goal: Reduce a set of numerical variables.
The idea: Remove the overlap of information between these variable.
“Information” is measured by the sum of the variances of the variables.
Final product: A smaller number of numerical variables that contain most of the information
How does PCA do this?
Create new variables that are linear combinations of the original variables (i.e., they are weighted averages of the original variables).
These linear combinations are uncorrelated (no information overlap), and only a few of them contain most of the original information.
The new variables are called principal components.
general process
X1, X2, X3, … Xp, original p variables
Z1, Z2, Z3, … Zp, weighted averages of original variables, aka Principal Components
All pairs of Z variables have 0 correlation
Order Z’s by variance (z1 largest, Zp smallest)
Usually the first few Z variables contain most of the information, and so the rest can be dropped.
general process
X1, X2, X3, … Xp, original p variables
Z1, Z2, Z3, … Zp, weighted averages of original variables, aka Principal Components
All pairs of Z variables have 0 correlation
Order Z’s by variance (z1 largest, Zp smallest)
Usually the first few Z variables contain most of the information, and so the rest can be dropped.
what to do after pca
Normalize each variable to remove scale effect
Divide by std. deviation (may subtract mean first)
Normalization (= standardization) is usually performed in PCA; otherwise measurement units affect results