Chapter 4 Flashcards

Question 1

Q

dimension of a dataset

Answer

A

the number of variables

Question 2

Q

several dimension reduction approaches

Answer

A

(1) Incorporating
domain knowledge to remove or combine categories, (2) using data summaries
to detect information overlap between variables (and remove or combine
redundant variables or categories), (3) using data conversion techniques such
as converting categorical variables into numerical variables, and (4) employing
automated reduction techniques, such as principal components analysis (PCA),
where a new set of variables

Question 3

Q

domain knowledge

Answer

A

common sense of what variables are need

Question 4

Q

summary of data

Answer

A

Average
Median
Minimum
Maximum
Standard deviation
Counts &amp; percentages

Question 5

Q

Reducing Categories

Answer

A

In particular, a variable with m categories will be transformed
into either m or m 􀀀 1 dummy variables (depending on the method).
This means that even if we have very few original categorical variables, they can
greatly inflate the dimension of the dataset. One way to handle this is to reduce
the number of categories by combining close or similar categories. Combining
categories requires incorporating expert knowledge and common sense. Pivot
tables are useful for this task: We can examine the sizes of the various categories
and how the outcome variable behaves in each category. Generally, categories
that contain very few observations are good candidates for combining with other
categories.

Question 6

Q

Principal Components Analysis

Answer

A

Goal: Reduce a set of numerical variables.

The idea: Remove the overlap of information between these variable.
“Information” is measured by the sum of the variances of the variables.

Final product: A smaller number of numerical variables that contain most of the information

Question 7

Q

How does PCA do this?

Answer

A

Create new variables that are linear combinations of the original variables (i.e., they are weighted averages of the original variables).

These linear combinations are uncorrelated (no information overlap), and only a few of them contain most of the original information.

The new variables are called principal components.

Question 8

Q

general process

Answer

A

X1, X2, X3, … Xp, original p variables

Z1, Z2, Z3, … Zp, weighted averages of original variables, aka Principal Components

All pairs of Z variables have 0 correlation

Order Z’s by variance (z1 largest, Zp smallest)

Usually the first few Z variables contain most of the information, and so the rest can be dropped.

Question 9

Q