Week 10: Feature Engineering & Dimensionality Reduction Flashcards

1
Q

Standardise Numeric Values

A

Subtract the mean \mu from each value and divide each centred value by the standard deviation \sigma

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Convert Numeric Values into Percentiles

A

x-th percentile means that x percent of the samples is less than the current sample.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Convert Counts into Rates

A

This is useful when tracking events over time. Can measure how often an even occurs in a specific time unit.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Replace Categorical Variables with Numeric Variables

A

Replacements include numeric descriptor variables and binary indicator variables.

For example, cities can be categorised by a series of numerical variables (i.e. population, medium income, annual rainfall).

Can perform one-hot-encoding for a variable with only a few categories.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Replace Numerical Variables with Categorical Variables

A

Binning is a common technique, with equal width, equal weight, and supervised binning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Combining Variables

A

Common examples include BMI and Price-to-Earnings ratio.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Problems with High Dimensional Data

A

Issues include risk of correlation between input variables and overfitting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Sparse Data Problem

A

Sparse data may result in isolated points without many neighbours. This can make pattern recognition tougher.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Variable Selection

A

This is key for reducing the number of predictors in high-dimensional problems. The relevance of variables depends on independence, correlation, and average mutual information.

It’s important to remove attributes with low mutual information with the target attribute, attributes correlated with other attributes, and attributes independent of the target attribute.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Correlation

A

\rho(\boldsymbol{x}j, \boldsymbol{y}) = \frac{\sum{i=1}^m (x_{i,j} - \overline{x}j)(y_i y \overline{y})}{\sqrt{\sum{i=1}^n (x_{i,j}-\overline{x}i)^2 \sum{i=1}^n (y_i - \overline{x})^2}}

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Average Mutual Information

A

I(\boldsymbol{y}; \boldsymbol{x}_j) = H(\boldsymbol{y}) - H(\boldsymbol{y} \mid \boldsymbol{x}_j)

Note that H(\boldsymbol{y} \mid \boldsymbol{x}j) is conditional entropy.
H(\boldsymbol{y} \mid \boldsymbol{x}j) = - \sum{y \in Dom(\boldsymbol{y})} \sum
{x \in Dom(\boldsymbol{x}_j)} P(x,y) \log [P(x \mid y)]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Exhaustive Feature Selection

A

Exhaustively try all combinations of a set of variables. This approach is impractical for high-dimensional data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Forward Selection

A

A sequential feature selection method. Start without any variables in the model. Build a family of models with one input variable per model. Pick the best input variable. Repeat by adding one variable at a time. Terminate once a predefined maximum number of variables reached, or if adding a new variable doesn’t improve the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Backward Selection

A

Start with all the variables initially included in the model. Each variable is removed one at a time to test its importance to the model. The least important variable is removed. Variables are removed until a minimum number of variables is reached or if the remaining variables are all above a certain level of importance.

This is typically a time-consuming approach.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Projections

A

Transform points from the \mathbb{R}^n space to the \mathbb{R}^k space. For example, projecting a 3-D ball onto a 2-D plane results in a circle.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Eigenvectors and Eigenvalues

A

Given a square matrix \boldsymbol{A}, eigenvector \boldsymbol{u}, and eigenvalue \lambda,

\boldsymbol{Au} = \lambda \boldsymbol{u}

17
Q

Eigendecomposition

A

The eigendecomposition of a n \times n square matrix \boldsymbol{A} is
\boldsymbol{A} = \boldsymbol{U \Lambda U}^_{-1}

\boldsymbol{U} is a n \times n square matrix whose i-th column is the i-th eigenvector \boldsymbol{u}_i of \boldsymbol{A}.

\boldsymbol{\Lambda} is a n \times n diagonal matrix whose i-th diagonal element \lambda_i is the eigenvalue corresponding to the eigenvector \boldsymbol{u}_i.

18
Q

Singular Vectors and Singular Values

A

A non-negative number \sigma is a singular value of a m \times n matrix \boldsymbol{A} if there exists unit-length vectors \boldsymbol{u} (left-singular vector) and \boldsymbol{v} (right-singular vector) such that:

\boldsymbol{Av} = \boldsymbol{\sigma u} and
\boldsymbol{A}^T\boldsymbol{u} = \boldsymbol{\sigma v}

19
Q

Singular Value Decomposition

A

A factorisation \boldsymbol{A} = \boldsymbol{U \Sigma V}^T of a n \times m matrix \boldsymbol{A}.

\boldsymbol{U} = m \times m orthogonal matrix whose columns are the left singular vectors.

\boldsymbol{\Sigma} = m \times n diagonal matrix whose diagonal elements \sigma_{i,i} are the singular values.

\boldsymbol{V} = n \times n orthogonal matrix whose columns are the right singular vectors.

In the context of data mining, each row \boldsymbol{u}_i of \boldsymbol{U} corresponds to a document and each column \boldsymbol{v}_j of \boldsymbol{V} to a term.

20
Q

Latent Semantic Indexing

A

This method is used in NLP and uses SVD to reduce the number of rows while preserving the similarity structure among columns. Reduces the m \times n matrix to a m \times k matrix. Termes with similar meaning are expected to be merged in the same dimension after reducing dimensionality.

21
Q

Projection Pursuit Regression

A

Nonlinear transformation of linear combinations of variables by finding the most interesting projections of the data.

\hat{y} = \sum_{j=1}^k w_J h_j (\boldsymbol{\alpha}_j^T \boldsymbol{x})

k = number of new variables, usually much smaller than n, the original number of variables

\boldsymbol{\alpha}_j^T \boldsymbol{x} = projection of vector \boldsymbol{x} to the j-th weight vector \boldsymbol{\alpha}_j