Week 10: Feature Engineering & Dimensionality Reduction Flashcards
Standardise Numeric Values
Subtract the mean \mu from each value and divide each centred value by the standard deviation \sigma
Convert Numeric Values into Percentiles
x-th percentile means that x percent of the samples is less than the current sample.
Convert Counts into Rates
This is useful when tracking events over time. Can measure how often an even occurs in a specific time unit.
Replace Categorical Variables with Numeric Variables
Replacements include numeric descriptor variables and binary indicator variables.
For example, cities can be categorised by a series of numerical variables (i.e. population, medium income, annual rainfall).
Can perform one-hot-encoding for a variable with only a few categories.
Replace Numerical Variables with Categorical Variables
Binning is a common technique, with equal width, equal weight, and supervised binning.
Combining Variables
Common examples include BMI and Price-to-Earnings ratio.
Problems with High Dimensional Data
Issues include risk of correlation between input variables and overfitting.
Sparse Data Problem
Sparse data may result in isolated points without many neighbours. This can make pattern recognition tougher.
Variable Selection
This is key for reducing the number of predictors in high-dimensional problems. The relevance of variables depends on independence, correlation, and average mutual information.
It’s important to remove attributes with low mutual information with the target attribute, attributes correlated with other attributes, and attributes independent of the target attribute.
Correlation
\rho(\boldsymbol{x}j, \boldsymbol{y}) = \frac{\sum{i=1}^m (x_{i,j} - \overline{x}j)(y_i y \overline{y})}{\sqrt{\sum{i=1}^n (x_{i,j}-\overline{x}i)^2 \sum{i=1}^n (y_i - \overline{x})^2}}
Average Mutual Information
I(\boldsymbol{y}; \boldsymbol{x}_j) = H(\boldsymbol{y}) - H(\boldsymbol{y} \mid \boldsymbol{x}_j)
Note that H(\boldsymbol{y} \mid \boldsymbol{x}j) is conditional entropy.
H(\boldsymbol{y} \mid \boldsymbol{x}j) = - \sum{y \in Dom(\boldsymbol{y})} \sum{x \in Dom(\boldsymbol{x}_j)} P(x,y) \log [P(x \mid y)]
Exhaustive Feature Selection
Exhaustively try all combinations of a set of variables. This approach is impractical for high-dimensional data.
Forward Selection
A sequential feature selection method. Start without any variables in the model. Build a family of models with one input variable per model. Pick the best input variable. Repeat by adding one variable at a time. Terminate once a predefined maximum number of variables reached, or if adding a new variable doesn’t improve the model.
Backward Selection
Start with all the variables initially included in the model. Each variable is removed one at a time to test its importance to the model. The least important variable is removed. Variables are removed until a minimum number of variables is reached or if the remaining variables are all above a certain level of importance.
This is typically a time-consuming approach.
Projections
Transform points from the \mathbb{R}^n space to the \mathbb{R}^k space. For example, projecting a 3-D ball onto a 2-D plane results in a circle.