Week 7: Feature Extraction Flashcards

Question 1

Q

Feature Engineering

Answer

A

The process of changing some measurements to make them more useful for classification. Examples include translating dates to the day of weekl

Question 2

Q

Feature Selection

Answer

A

The process of choosing the most relevant variables that are most useful for classification. This helps reduce the risk of over-fitting and reduces the training time required for classification. It’s important to factor in how difficult it is to obtain a measurement and how informative it is. Domain knowledge and data analysis may be needed to find the proper subset of features.

Question 3

Q

Feature Extraction

Answer

A

The process of finding and calculating features that are functions of raw data or selected features. This helps reduce the dimensionality of the data. An example would be calculating BMI from body height and body weight.

Question 4

Q

Principal Component Analysis (PCA)

Answer

A

The process of taking N-D data and finding M (M \le N) orthogonal directions in which the data has the most variance.

Question 5

Q

Karhunen-Loève Transformation (KLT)

Answer

A

A popular method for performing PCA. First calculate the mean of all vectors. Then calculate the covariance matrix. Then being the eigenvalues (E) and eigenvectors (V) of the covariance matrix. C = VEV^T. Form matrix \hat{V}, which has the M principal components, and apply to the data: y_i = \hat{V}^T(x_i - \mu)

Question 6

Q

Neural Networks for PCA

Answer

A

Using zero-mean data, linear neurons, and specific learning rules, PCA can be done with neural networks.

Question 7

Q

Hebbian Learning

Answer

A

A learning method for PCA Neural Networks. \Delta w = \eta y x^t. This method aligns w with the 1st principal component.

Question 8

Q

Oja’s Rule

Answer

A

A learning method for PCA Neural Networks. \Delta w = \eta y (x^t - yw). This method aligns w with the 1st principal component, with the weight decay term causing w to approach unit length.

Question 9

Q

Sanger’s Rule

Answer

A

For each neuron, subtract from x the contribution of neurons representing the 1st j-1 principal components, then apply Oja’s Rule. W-j will learn the j-th principal component.

Question 10

Q

Oja’s Subspace Rule

Answer

A

This rule will learn the same subspace spanned by 1st n principal components (i.e. PC’s are in a random order). The rule is identical to the rule used to update the Negative Feedback Network. \Delta W = \eta y (x - W^t y)^t

Question 11

Q

Autoencoder and PCA

Answer

A

An autoencoder with linear units can also learn PCA using negative feedback learning (Oja’s subspace rule) and backpropagation error.

Question 12

Q

Whitening Transform

Answer

A

This process makes a covariance matrix in a new space equal to the identity matrix. As such, each dimension has the same variance. This is opposed to PCA, whose variances after very often unequal, with the first principal component having the greatest level of variance for the data.

Question 13

Q

Linear Discriminant Analysis (LDA)

Answer

A

As PCA is unsupervised, it may remove discriminative dimensions. This may result in overlap between samples of different classes when projecting onto the principal components. LDA looks for maximally discriminative projections. This requires feature extraction to be informed by class labels (i.e. supervised learning). Overlap between elements from different categories is minimised.

Question 14

Q

Fisher’s Method

Answer

A

This is used in LDA. Fisher’s Method seeks to find w that maximises J(w) = sb/sw. Between class scatter (sb) needs to be maximised. Within class scatter (sw) needs to be minimised. sb = (w(m_1 - m2))^2, m_i = (1/n) \sum_{x \in \omega_i} x. sw = s_1^2 + s_2^2. s_i^2 = \sum_{x \in \omega_i}(w(x - m_i))^2

Question 15

Q

Independent Component Analysis (ICA)

Answer

A

ICA finds statistically independent components, unlike PCA, which finds uncorrelated components. For Gaussian distributions, uncorrelated is the same as independence. But for non-Gaussian distributions, ICA and PCA results diverge.

Question 16

Q

Neural Networks and ICA

Answer

Study These Flashcards

A

Oja’s subspace rule is used. \Delta W = \eta g(y)(x - W^t g(y))^t. g is a nonlinear function. This method will learn independent components. Inputs must be zero-mean and whitened. Other algorithms for doing ICA exist.

Question 17

Q

Random Projections

Answer

Study These Flashcards

A

This method initialises V, a matrix with random values. y = g(Vx). Random Projections are often combined with simple/linear classifiers and a neural network architecture. Common architectures include:

Extreme Learning Machines
Echo State Networks
Liquid State Machines
Reservoir Computing

Question 18

Q

Extreme Learning Machines

Answer

Study These Flashcards

A

This method uses hidden nonlinear projection using random weights (y = g(Vx)). The output uses a linear classifier: z = wy. The output weights must minimise MSE between hidden-layer outputs and desired targets (just lif RBF networks). w = TY^{\dagger}, with T = required output for each training exemplar, Y = hidden layer for each training exemplar.

Question 19

Q

Sparse Coding

Answer

Study These Flashcards

A

The data is projected into high-dimensional space to make the classes more separable. The projection isn’t random. y = g(Vx), such that y contains only a few non-zero elements where V is the matrix of weights. Neural Networks are used to find V, such as the competition via negative feedback. This method can minimise the cost as represented by \underset{y}{\min}(||x - V^ty||_2 + \lambda ||y||_0). \lambda is positive and defines the trade-off between accuracy and sparsity.

Question 20

Q

Dictionary-based Optimisation

Answer

Study These Flashcards

A

A method to find V^t, the matrix used for Sparse Coding. x \sim V^ty, such that V^t acts as basis vectors to efficiently interact with the sparse coefficients y to represent the original feature space x in a higher dimension.

Week 7: Feature Extraction Flashcards

(20 cards)