Dimensionality Reduction Flashcards
Covering subset selection and PCA
What is Dimensionality Reduction?
It is the process of reducing the number of variables under consideration by obtaining a smaller set of principal variables.
Methods to implement dimensionality reduction?
It can be implemented in two ways:
1. Feature Selection
2. Feature Extraction
What is Feature Selection?
Here we are interested in finding k of the total n features that give us the most information, and we discard the other (n-k) dimensions.
eg: subset selection method(forward and backward selection)
What is Feature Extraction?
Here, we are interested in finding a new set of k features that are combinations of original features.
How the error is measured in machine learning problems?
In regression:
We usually use the Mean Squared Error(MSE) or the Root Mean Squared Error(RMSE).
In Classification:
Here, we may use the misclassification rate as the measure of error.
What is MSE?
It is the sum of the square of the difference between the predicted and the actual target variables, divided by the number of data points.
What is the misclassification rate?
It is the ratio of misclassified examples by the total number of examples.
Why is dimensionality reduction useful?
- Decreasing the size of the dataset also decreases the complexity of the inference algorithm during testing.
- Saving the cost of extracting unnecessary data.
- Simpler models have less variance.
- We get a better idea about the process that underlies the data, which allows knowledge extraction.
- data with lesser dimensions can be plotted and analyzed visually for structure and outliers.
What is Subset Selection?
It is also known as feature selection or variable selection or attribute selection. It is the process of selecting a subset of relevant features for use in model selection.
What are the two approaches in subset selection
- Forward Selection
- Backward Selection
What is Forward selection? Explain in detail.
Here, we start with no variables and add them one by one such that the variables that decrease the error at the most get added. The addition occurs until there is no more decrease in the error rate.
Explain the algorithm.
What is Backward Selection? Explain in detail.
Here, we start with the set containing all the variables(features) and at each step, we remove that variable that causes the least error.
Explain the algorithm.
What is a Principal Component Analysis (PCA)?
It is a statistical procedure thet uses orthogonal transformation convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.
The number of principal components is always less than or equal to the number of original number of variables/observations.
What are the various steps in PCA? Explain.
- Consider a dataset having n features and N examples.
- Compute the means of the variables.
- Calculate the covariance matrix (S).
- Calculate the eigen values and eigen vectors of the covariance matrix(S).
- Derive a new dataset.
- New Dataset.
- Conclusion.