Lectures Flashcards
What is the drawbacks of increasing dimensionaity?
- Data becomes sparse
- It becomes harder to generalize the model
- Increasing the number of features will not always improve classification accuracy
What is the correlation between the number of training examples and dimensionality
The number of training examples required increases exponentially with dimensionality D
What is Hughes phenomeon?
If the number of training samples is fixed and we keep on increasing the number of dimensions then the predictive power of the machine learning model increases but after a certain point it tends to decrease
Why is dimensionality reduction required?
- The space required to store the dataset also gets reduced
- Less computation/training time is required
- It removes redundant and irrelevant features
- It helps in interpretation and visualization
How is dimensionality reduction achieved?
- Some features may contain negligible or irrelevant information
- Several features can be combined together without loss or gain of information
What is dimensionality reduction
It is a data preparation technique performed on data prior to modeling. It might be performed after data cleaning and data scaling and before training a predictive model
What are the dimensionality reduction techniques?
- Feature selection:
Chooses a subset of the original features - Feature extraction:
Computes a new set of features from the original features through some transformation f()
Explain the feature selection technique
- Selects the most relevant ones to build better, faster, and easier to understand learning models.
Filter approach:
These methods evaluate the relevance of features independently of the chosen learning algorithm based on statistical measurements
Wrapper approach:
These methods assess the performance of a specific machine learning algorithm by repeatedly training and evaluating models with different subsets of features and select the best ones
Embedding approach:
These methods integrate feature selection within the model building process itself
What are the common techniques of feature selection through filtering
- This method is done as one of the pre-processing step before passing the data to build a model
- Mutual information:
Calculate the MI (level of independence) of each feature with respect to the class variable. Next, rank the features based on their MI and select the top ones - Correlation coefficient:
Statistical measure of strength of linear association between two variables. It helps identify which variables closely resembles the other one. If the coefficient value is higher than the threshold value, we can remove one of the variable from the dataset. Ranges between -1 to 1 where value closer to 1 shows that they are highly correlated and value closer to -1 shows that they are negatively correlated - Variance threshold:
It removes all features which variance are lower than a given threshold
Set of features -> selecting best feature -> learning algorithm -> performance
How is feature selection done through wrapper methods
- The selection of features is done by considering it as a search problem, which different combinations are made, evaluated, and compared with other combinations.
1- Split data into subsets and train a model
2- Based on the output of the model, we add or subtract features and train the model again
3- It evaluates the accuracy of all the possible combinations of features
What are some common techniques for feature selection through wrapper methods?
1- Forward selection:
Start with an empty set of attributes S. At each step, add one more attribute that decreases the validation error the most then stop the selection when the validation error becomes stable or no significant improvement
2- Backward elimination:
Start with the set of all attributes then drop features with smallest impact on error
set of features -> (generate subset -> algorithm) -> performance
Filtering vs wrapper methods
Check slides
Explain feature extraction
- It transforms the space containing too many dimensions into a space with fewer dimensions
- It aims to reduce the number of features in a dataset by creating new features from existing ones
- The primary goal is to compress the data with the goal of maintaining most of the relevant information
What are feature extraction techniques?
- Principal components analysis (PCA):
Seeks a projection that preserves as much information in the data as possible - Linear discriminant analysis (LDA):
Seeks a projection that best discriminates the data
Explain the PCA technique
- It is an unsupervised linear dimensionality reduction method that increases interpretability and minmizes information loss
- PCA assumes linear relationships between variables
- It is a statistical process that converts the observations of the correlated features into a set of linearly uncorrelated features with the help of orthogonal transformation
- These new transformed features are called the principal components which capture the maximum variance in the data
- So they are straight lines that capture most of the variance of the data and they have a direction and magnitude
- These components are linear combinations of the original features and provide a new coordinate system for the data
What are the mathematical steps for the PCA algorithm
1- Standardize the data:
PCA requires standardized data so the first step is to standardize the data to ensure that all variables have a mean of 0 and a standard deviation of 1
2- Calculate the covariance matrix:
The next step is to calculate the covariance matrix of the standardized data. This matrix shows how each variable is related to every other variable
3- Calculate the eigenvectors and eigenvalues
4- Choose the principal components:
Computing the eigenvectors and ordering them by their eigenvalues in descending order, allow us to find the principal components in order of significance
5- Create the new feature vector:
The final step is to transform the original data into the lower-dimensional space defined by the principal components
What are some important properties of PCA
- PCA assumes that the relationship between variables are linear
- PCA assumes that the principal and components with larger variances are more important and should be retained
- PCA assumes that the principal components are orthogonal to each other
- PCA works best when the data is approximately normally distributed
- Number of principal components is always less that or equal to the number of attributes
- The priority of PCs decreases as their numbers increase
- In general, the first components explain the largest variance of the data
- PCA can handle missing data by using techniques such as mean imputation
What are the advantages of PCA
- Dimensionality reduction:
By determining the most crucial features or components, PCA reduces the dimensionality of the data, which is one of its primary benefits. This can be helpful when the initial data contains a lot of variables and is therefore challenging to visualize or analyze - Feature extraction:
PCA can also be used to derive new featires or elements from the original data that might be more insightful or understandable than the original features. This is particularly helpful when the initial features are correlated or noisy - Data visualization:
By projecting the data onto the first few principal components, PCA can be used to visualize high-dimensional data in two or three dimensions. This can aid in locating data patterns or clusters that may not have been visible in the initial high-dimensional space - Noise reduction:
By locating the underlying signal or pattern in the data, PCA can also be used to lessen the impacts of noise or measurement errors in the data
What are the limitations of PCA
- Interpretability:
The principal components may lack interpretability as they are linear combinations of the original features - Scale dependence:
PCA is sensitive to the scaling of the features, so features should be standardized before applying PCA - Linear assumption:
PCA is a linear technique and may not capture nonlinear relationships in the data - Outlier sensitivity:
PCA is sensitive to outliers in the data, which can distort the principal components - Computing complexity:
For big datasets, it may be costly to compute the eigenvectors and eigenvalues of the covariance matrix
How we should choose K in PCA?
- K is typically chosen based on how much information (variance) we want to preserve in the data:
- It is usually chosen to preserve 90% of the information in the data
- If K = D we preserve 100% of the information in the data
Use cross validation to determine the number of PCs that maximize the the model’s performance on unseen data
Explain feature extraction using LDA
- It is a supervised linear dimensionality reduction technique that aims to find a new set of variables that maximizes the separation between classes while minimizing the variation within each class
- The resulting components are ranked by their discriminative power and can be used to visualize and interpret the data, as well as for classification or regression tasks
- LDA assumes that the input data follows a Gaussian distribution therefore applying LDA to not Gaussian data can possibly lead to poor classification results
- LDA assumes that the classes are linearly separable in the lower-dimensional space
- LDA seeks to find directions along which the classes are best separated
- It takes into consideration the scatter within-classes and between classes
How does LDA work?
1- Computing the within-class and between-class scatter matrices
2- Computing the eigenvectors and their corresponding eigenvalues for the scatter matrices
3- Sorting the eigenvalues and selecting the top k
4- Creating a new matrix that will contain the eigenvectors mapped to the k eigenvalues
5- The data is then projected onto the eigenvectors with the largest eigenvalues, which represent the most discriminative directions
How to evaluate the performance of dimensionality reduction techniques?
- Explained variance ratio (PCA):
It refers to the amount of variance in the original data that is captured or explained by each principal component - Classification accuracy (LDA):
Train a classifier on the lower-dimensional data and measure the classification accuracy on a test set - Visualization (PCA/LDA):
Visualize the lower-dimensional data and assess if the classes are well-separated or if the structure of the data is preserved - Cross validation (PCA/LDA):
Use cross validation to estimate the generalization performance of the dimensionality reduction technique on unseen data
Explain the difference between linear discriminant analysis and PCA
- PCA ignores class labels and focuses on finding the principal components that maximizes the variance in a given data. Thus it is an unsupervised algorithm
- LDA is a supervised algorithm that intends to find the linear disciminants that represents those axes which maximize separation between different classes
- LDA is typically chosen over PCA when the goal is classification or when the class structure in the data is known and important
- PCA and LDA can be used together in a pipeline, where PCA is applied first to reduce the dimensionality of the data, followed by LDA for class separation
- PCA can have at most n_features components while LDA can have at most n_classes -1 components
When to choose LDA over PCA
- Supervised learning:
Maximize the separation between classes for better classification or visualization - Class separation:
Finding a lower-dimensional representation that best separates classes rather than simply capturing the maximum variance in PCA - Interpretability:
LDA components may be more interpretable than PCA components since they are directly related to class separation
Can LDA handle nonlinear relationships between features?
Not directly, but it can through extensions that LDA offers such as Kernel LDA and quadratic discriminant analysis
What is unsupervised learning?
- Unsupervised model uncovers interesting structure in the data.
- It can identify clusters of related datapoints without relying on pre-existing labels or target variables
What is clustering?
- It is the classification of objects into different groups, or more precisely, the partitioning of a dataset into subsets, so that the data in each subset share some common traits, often according to some defined distance measure
- The information clustering uses is the similarity between examples
- The data within the same cluster are very similar, while the data in distinct clusters are different
What are the reasons for data clustering?
- Discover the nature and structure of the data
- Data classification
- Data coding and compression
- Cluster data whose characteristics change over time