Week 7: Feature Extraction Flashcards
Feature Engineering
The process of changing some measurements to make them more useful for classification. Examples include translating dates to the day of weekl
Feature Selection
The process of choosing the most relevant variables that are most useful for classification. This helps reduce the risk of over-fitting and reduces the training time required for classification. It’s important to factor in how difficult it is to obtain a measurement and how informative it is. Domain knowledge and data analysis may be needed to find the proper subset of features.
Feature Extraction
The process of finding and calculating features that are functions of raw data or selected features. This helps reduce the dimensionality of the data. An example would be calculating BMI from body height and body weight.
Principal Component Analysis (PCA)
The process of taking N-D data and finding M (M \le N) orthogonal directions in which the data has the most variance.
Karhunen-Loève Transformation (KLT)
A popular method for performing PCA. First calculate the mean of all vectors. Then calculate the covariance matrix. Then being the eigenvalues (E) and eigenvectors (V) of the covariance matrix. C = VEV^T. Form matrix \hat{V}, which has the M principal components, and apply to the data: y_i = \hat{V}^T(x_i - \mu)
Neural Networks for PCA
Using zero-mean data, linear neurons, and specific learning rules, PCA can be done with neural networks.
Hebbian Learning
A learning method for PCA Neural Networks. \Delta w = \eta y x^t. This method aligns w with the 1st principal component.
Oja’s Rule
A learning method for PCA Neural Networks. \Delta w = \eta y (x^t - yw). This method aligns w with the 1st principal component, with the weight decay term causing w to approach unit length.
Sanger’s Rule
For each neuron, subtract from x the contribution of neurons representing the 1st j-1 principal components, then apply Oja’s Rule. W-j will learn the j-th principal component.
Oja’s Subspace Rule
This rule will learn the same subspace spanned by 1st n principal components (i.e. PC’s are in a random order). The rule is identical to the rule used to update the Negative Feedback Network. \Delta W = \eta y (x - W^t y)^t
Autoencoder and PCA
An autoencoder with linear units can also learn PCA using negative feedback learning (Oja’s subspace rule) and backpropagation error.
Whitening Transform
This process makes a covariance matrix in a new space equal to the identity matrix. As such, each dimension has the same variance. This is opposed to PCA, whose variances after very often unequal, with the first principal component having the greatest level of variance for the data.
Linear Discriminant Analysis (LDA)
As PCA is unsupervised, it may remove discriminative dimensions. This may result in overlap between samples of different classes when projecting onto the principal components. LDA looks for maximally discriminative projections. This requires feature extraction to be informed by class labels (i.e. supervised learning). Overlap between elements from different categories is minimised.
Fisher’s Method
This is used in LDA. Fisher’s Method seeks to find w that maximises J(w) = sb/sw. Between class scatter (sb) needs to be maximised. Within class scatter (sw) needs to be minimised. sb = (w(m_1 - m2))^2, m_i = (1/n) \sum_{x \in \omega_i} x. sw = s_1^2 + s_2^2. s_i^2 = \sum_{x \in \omega_i}(w(x - m_i))^2
Independent Component Analysis (ICA)
ICA finds statistically independent components, unlike PCA, which finds uncorrelated components. For Gaussian distributions, uncorrelated is the same as independence. But for non-Gaussian distributions, ICA and PCA results diverge.