Feature engineering Flashcards
Feature engineering
Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data. In conclusion, feature engineering is an essential step that requires a mix of domain knowledge, intuition, and a bit of trial and error. When done correctly, it can significantly improve the performance of machine learning models.
- Definition
Feature engineering is a crucial step in the machine learning pipeline that involves creating new features or modifying existing features to improve machine learning model performance.
- Importance
The performance of machine learning models heavily depends on the quality of the features in the dataset. Even sophisticated models cannot learn from irrelevant features. Good feature engineering can often make the difference between a poor model and an excellent one.
- Domain Knowledge
Incorporating domain knowledge can help in creating features that make machine learning algorithms work better. By understanding the context of the problem, one can create relevant features that capture essential aspects of the problem.
- Categorical Encoding
Many machine learning models require the input data to be in numerical format. Categorical variables (like ‘color’, ‘city’ etc.) are typically converted to numerical format using techniques like one-hot encoding, label encoding, or target encoding.
- Handling Missing Values
Missing data is a common problem in real-world datasets. Techniques to handle missing data include imputation (filling missing values with statistical measures like mean or median) and creating an indicator feature to highlight when a value was missing.
- Feature Scaling
Certain machine learning algorithms like linear regression, logistic regression, SVM, k-nearest neighbors (KNN), and neural networks require the input features to be on similar scales. Techniques like min-max scaling and standardization are used to scale the features.
- Feature Transformation
Features can be transformed to better fit the assumptions of a machine learning algorithm. Common transformations include logarithmic transformation, square root transformation, square transformation, etc.
- Feature Selection
Feature selection involves selecting the most useful features to train your machine learning model. This can reduce overfitting, improve accuracy, and reduce training time. Methods include correlation coefficients, chi-square test, mutual information, and feature importance from tree-based models.
- Feature Extraction
This technique reduces the dimension of high-dimensional data. Techniques like Principal Component Analysis (PCA), t-SNE, and UMAP are used for feature extraction.
- Time-Series Specific
In time-series problems, features are often engineered from date-time variables, such as hour of day, day of week, quarter of year, month, year, etc.