Chapter 4 Flashcards

dimension reduction

1
Q

dimension of a dataset

A

the number of variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

several dimension reduction approaches

A

(1) Incorporating
domain knowledge to remove or combine categories, (2) using data summaries
to detect information overlap between variables (and remove or combine
redundant variables or categories), (3) using data conversion techniques such
as converting categorical variables into numerical variables, and (4) employing
automated reduction techniques, such as principal components analysis (PCA),
where a new set of variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

domain knowledge

A

common sense of what variables are need

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

summary of data

A
Average
Median
Minimum
Maximum
Standard deviation
Counts & percentages
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Reducing Categories

A

In particular, a variable with m categories will be transformed
into either m or m 􀀀 1 dummy variables (depending on the method).
This means that even if we have very few original categorical variables, they can
greatly inflate the dimension of the dataset. One way to handle this is to reduce
the number of categories by combining close or similar categories. Combining
categories requires incorporating expert knowledge and common sense. Pivot
tables are useful for this task: We can examine the sizes of the various categories
and how the outcome variable behaves in each category. Generally, categories
that contain very few observations are good candidates for combining with other
categories.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Principal Components Analysis

A

Goal: Reduce a set of numerical variables.

The idea: Remove the overlap of information between these variable.
“Information” is measured by the sum of the variances of the variables.

Final product: A smaller number of numerical variables that contain most of the information

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How does PCA do this?

A

Create new variables that are linear combinations of the original variables (i.e., they are weighted averages of the original variables).

These linear combinations are uncorrelated (no information overlap), and only a few of them contain most of the original information.

The new variables are called principal components.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

general process

A

X1, X2, X3, … Xp, original p variables

Z1, Z2, Z3, … Zp, weighted averages of original variables, aka Principal Components

All pairs of Z variables have 0 correlation

Order Z’s by variance (z1 largest, Zp smallest)

Usually the first few Z variables contain most of the information, and so the rest can be dropped.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

general process

A

X1, X2, X3, … Xp, original p variables

Z1, Z2, Z3, … Zp, weighted averages of original variables, aka Principal Components

All pairs of Z variables have 0 correlation

Order Z’s by variance (z1 largest, Zp smallest)

Usually the first few Z variables contain most of the information, and so the rest can be dropped.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

what to do after pca

A

Normalize each variable to remove scale effect
Divide by std. deviation (may subtract mean first)
Normalization (= standardization) is usually performed in PCA; otherwise measurement units affect results

How well did you know this?
1
Not at all
2
3
4
5
Perfectly