Preprocessing & EDA Flashcards
Augmentation
A data preprocessing technique used to artificially increase the size of a training dataset by applying various transformations to existing data samples. These transformations can include rotation, scaling, translation, cropping, and flipping, among others. Augmentation helps improve model generalization by exposing it to a wider range of variations in the input data.
Bar Charts
Bar charts are graphical representations of categorical data using rectangular bars, where the length or height of each bar corresponds to the frequency or relative frequency of the category it represents. Each bar typically represents a discrete category or group, and the length or height of the bar reflects the numerical value associated with that category. Bar charts are useful for comparing the frequency or distribution of different categories visually and are commonly used for visualizing categorical data, such as survey responses, product sales, or demographic characteristics. They are especially effective for displaying discrete data with a small number of categories.
Binning
When you have a numerical feature but you want to convert it into a categorical one. Binning (also called bucketing) is the process of converting a continuous feature into multiple binary features called bins or buckets, typically based on value range.
A data preprocessing technique used to group continuous numerical data into discrete intervals or bins. It involves partitioning the range of values into equal-sized intervals and assigning data points to their corresponding bins. Binning is commonly used to simplify complex datasets, reduce noise, and handle outliers.
Box Plots
Box plots, also known as box-and-whisker plots, are graphical representations of the distribution of numerical data through quartiles. They consist of a box that spans the interquartile range (IQR), with a line inside representing the median. “Whiskers” extend from the edges of the box to the minimum and maximum values within 1.5 times the IQR from the first and third quartiles, respectively. Potential outliers beyond the whiskers are often displayed as individual data points. Box plots are useful for visualizing the spread and skewness of data and identifying outliers in a dataset.
Broadcasting
Used to perform operations on arrays of different shapes efficiently. It allows arrays with different dimensions to be combined or operated upon without explicit looping, enhancing computational performance and readability of code. when you perform an operation between arrays, Python and libraries like NumPy automatically adjust the smaller array’s shape to match the shape of the larger one. It’s like stretching or replicating the smaller array’s elements to make it the same size as the larger array, so they can be combined more easily. Broadcasting enables seamless integration of operations across multidimensional data structures, facilitating tasks such as batch processing, data augmentation, and model training
Cardinality
A number of unique values in a categorical variable or feature. High cardinality variables have a large number of distinct categories within, while low cardinality variables have fewer distinct categories within. Cardinality is an important consideration in feature engineering and can impact model performance and complexity.
Categorical Plots
A type of graph used to visualize the distribution and relationships within categorical data. Categorical data is data that falls into distinct groups or categories.
Categorical plots are useful for:
Distribution: Showing how frequently each category occurs.
Comparison: Comparing different categories side-by-side.
Relationships: Investigating potential relationships between different categorical variables.
Types of common Categorical Plots
- Bar Plots:
- Pie Charts:
- Count Plots:
- Box Plots:
- Strip Plots:
- Swarm Plots:
Class imbalance
An unequal distribution of classes or categories in a classification dataset, where one class is significantly more prevalent than others. Class imbalance can lead to biased model predictions, as the model may have a tendency to favor the majority class and overlook minority classes. Addressing class imbalance often requires specific techniques such as resampling methods, cost-sensitive learning, or ensemble methods.
Correlation Analysis
Statistical technique used to measure and assess the strength and direction of the relationship between two or more variables in a dataset. It quantifies the degree of association between variables using correlation coefficients, such as Pearson correlation coefficient, Spearman rank correlation coefficient, or Kendall tau rank correlation coefficient. Correlation analysis helps identify patterns, dependencies, and causal relationships among variables, facilitating feature selection, model building, and predictive modeling in machine learning and data analysis.
Data Balancing
Data balancing, also known as class imbalance correction or oversampling/undersampling, is a preprocessing technique used in machine learning to address imbalanced datasets where one class is significantly more prevalent than others. It involves modifying the dataset to ensure that each class is represented fairly during model training. Techniques for data balancing include random undersampling, random oversampling, Synthetic Minority Over-sampling Technique (SMOTE), and ensemble methods. Data balancing is crucial for improving the performance and fairness of classification models, particularly in applications where class distribution is skewed.
Data Cleaning
Process within data preparation that involves identifying and addressing errors, inconsistencies, outliers, and missing values within a dataset to enhance its quality and reliability.
Data imputation
A process of filling in missing values in a dataset with estimated or predicted values. It is a common technique used to handle missing data before performing analysis or training machine learning models. Imputation methods can range from simple strategies like mean or median imputation to more complex techniques such as regression-based imputation or k-nearest neighbors imputation.
Data Preprocessing
Techniques used to condition raw data before feeding it into a machine learning model. This includes tasks like scaling, normalization, transforming data types, feature engineering, and reducing the number of features (dimensionality reduction).
Data Quality Assessment
Process of evaluating the accuracy, completeness, consistency, and reliability of data to ensure that it meets the requirements of the intended use. It involves identifying and correcting errors, anomalies, and inconsistencies in the data, as well as assessing its fitness for specific purposes. Data quality assessment encompasses various techniques and methodologies, including data profiling, data cleansing, outlier detection, and validation. It is essential for ensuring the integrity and trustworthiness of data in decision-making, analysis, and modeling processes.
Data Sampling
Process of selecting a subset of observations or data points from a larger dataset to represent the population or distribution of interest. Sampling techniques can be random or non-random and may involve techniques such as simple random sampling, stratified sampling, systematic sampling, or cluster sampling. Data sampling is widely used in statistics, survey research, and machine learning for estimating population parameters, reducing computational complexity, and generating training datasets for model training and evaluation.
Data Transformation
process of converting or modifying raw data into a more suitable format for analysis, modeling, or visualization. It involves operations such as normalization, standardization, scaling, encoding, imputation, aggregation, and feature engineering. Data transformation aims to improve the quality, interpretability, and performance of data in machine learning, statistical analysis, and data-driven decision-making processes. It plays a crucial role in preprocessing pipelines, where it prepares the data for subsequent tasks such as modeling, clustering, or classification.
Data Visualization
Graphical representation of data and information to facilitate understanding and interpretation. It encompasses a wide range of techniques and tools for creating visual representations such as charts, graphs, maps, and dashboards. Data visualization is used to explore patterns, trends, and relationships in data, communicate insights, and support decision-making in various fields including business, science, and engineering. It plays a crucial role in exploratory data analysis, storytelling, and conveying complex information to diverse audiences.
Dealing with missing features
- removing rows or columns
- imputing values
- using domain knowledge to create derived features
- getting new and more data from source or other data sets
Decoder
A component or algorithm that transforms encoded data or representations back into their original format or domain. Decoders are commonly used in autoencoders, generative models, and communication systems to recover information from compressed or encoded representations. In natural language processing, decoders are used in sequence-to-sequence models for generating output sequences from encoded input representations.
Density Plots
Density plots are best suited for continuous numeric data. A density plot is a smoothed version of a histogram, used to visualize the distribution of a continuous numerical variable. It shows the estimated probability density of the data. The plot consists of a curve that represents the probability density function (PDF) of the variable. The area under the curve always totals to 1.
- Peaks in the curve indicate regions where data points are more concentrated.
- Valleys represent areas where data is less frequent.
- The overall shape gives insights into the spread, skewness, and whether the distribution has multiple modes (peaks).
It is usefull for identifing distributions with multiple peaks, which histograms might obscure.
The smoothness of a density plot is controlled by a parameter called the bandwidth. Experimenting with different bandwidths can change the level of detail revealed.
Encoder
A component or algorithm that converts raw input data into a suitable format for processing, analysis, or modeling. Encoders transform data from one representation to another, such as converting categorical variables into numerical representations or compressing high-dimensional data into low-dimensional embeddings. In deep learning, encoders are commonly used in autoencoders, sequence-to-sequence models, and neural network architectures to learn compact and informative representations of input data for downstream tasks such as classification, regression, and generation.
Encoding
A process of converting categorical variables or features into numerical representations that can be used as input for machine learning algorithms. Encoding allows categorical information to be effectively incorporated into machine learning models, which typically require numerical input data.
We can encoding ordinal data (intrinsic order) or nominal data (without intrinsic order). For ordinal data we often use label encoding. For nominal one-hot encoding. Other common encoding techniques include binary encoding, frequency encoding, target encoding
Feature engeneering
The problem of transforming raw data into a dataset is called feature engineering
Feature Engineering is a process of creating new features or modifying existing features in a dataset to improve the performance of machine learning models. It involves selecting relevant features, transforming data, creating derived features, and reducing dimensionality. Effective feature engineering can enhance model interpretability, accuracy, and generalization to new data.
Feature Importance Analysis
Feature importance analysis is a set of techniques used to rank the features (input variables) in a machine learning model based on how much they contribute to the model’s predictions. Feature importance doesn’t mean causation. Highly important features may be correlated with other important features.
Common Techniques
1) Permutation Importance:
Shuffle the values of a single feature randomly.
Re-evaluate the model’s performance.
A large drop in performance indicates that the feature is important.
Repeat for all features to see relative importance.
2) Mean Decrease in Impurity (Tree-based Models):
For decision trees and random forests, calculate how much each feature decreases the impurity (e.g., Gini index or entropy) across the splits in the trees.
Features that create purer splits are assigned higher importance.
3) Coefficients (in Linear Models):
For linear models like linear regression and logistic regression, the magnitude of feature coefficients indicates a feature’s impact (assuming features are scaled properly).
4) Partial Dependence Plots (PDP):
Show the marginal effect of one feature on the predicted outcome.
Helps visualize how changes in a feature influence the prediction, even if the relationship is non-linear.
5) Information Gain:
Used in decision trees and similar models to determine the most informative features for splitting nodes.
It measures the reduction in entropy (or increase in information) achieved by splitting data based on a particular feature.
Features with higher information gain are considered more important for classification tasks.
6) SHAP Values (SHapley Additive exPlanations):
Provides a unified measure of feature importance based on game theory concepts.
It calculates the contribution of each feature to the difference between the actual prediction and the average prediction across all samples.
Positive SHAP values indicate features that increase the prediction, while negative values indicate features that decrease the prediction.
7) L1 (Lasso) Regularization:
In regularized linear models like Lasso Regression, features with non-zero coefficients after regularization are considered important.
L1 regularization encourages sparsity by penalizing the absolute values of the coefficients, effectively selecting a subset of the most important features.
Libraries for Feature Importance
Scikit-learn (Python): Offers permutation importance and built-in methods for tree-based models.
ELi5 (Python): Great for explaining model predictions and visualizing feature importance.
DALEX (R): Provides a range of feature importance methods and explainers.
Feature selection
A process of choosing a subset of relevant features from a larger set of features in a dataset. It aims to reduce dimensionality, improve model performance, and enhance interpretability by focusing on the most informative features. Feature selection techniques include filter methods, wrapper methods, and embedded methods, which assess feature importance based on statistical measures, model performance, or feature relevance to the target variable.
Filter Methods
- Information Gain: Measures how much information a feature provides about the target variable.
- Chi-Squared Test: Evaluates the independence between a feature and the target variable.
- Correlation Analysis: Identifies features that are highly correlated with the target variable or strongly correlated among themselves (potentially leading to redundancy).
- Variance Threshold: Removes features with low variance, as they are unlikely to carry much predictive information.
Wrapper Methods
- Forward Selection: Starts with an empty feature set and iteratively adds the feature that most improves model performance.
- Backward Elimination: Starts with all features and iteratively removes the least important feature until performance drops below a threshold.
- Recursive Feature Elimination (RFE): A variant of backward elimination that uses a model to rank feature importance and recursively eliminates the least important ones.
Embedded Methods
- Regularization (L1/Lasso, L2/Ridge): Penalizes model complexity, forcing coefficients of less important features towards zero.
- Decision Trees and Tree-Based Ensembles: Tree-based algorithms (like Random Forests) provide feature importance scores that can be used for selection.