Data Pre processing Flashcards
What is Data Preprocessing?
A step in data science that transforms raw data into a format that can be understood and analyzed by computers and machine learning.
Why is Data Preprocessing important?
Real-world data is often dirty, incomplete, noisy, or inconsistent, and preprocessing helps clean and prepare it for analysis.
What are the major tasks involved in Data Preprocessing?
Data cleaning, data integration, data reduction, and data transformation.
What is Data Cleaning?
The process of handling missing values, smoothing noisy data, identifying/removing outliers, and resolving inconsistencies.
What are the common issues in real-world data?
Missing values, noise, errors, duplicate records, and inconsistent data formats.
What are some methods to handle missing data?
Removal, interpolation, replacement with mean or most common value.
What is noisy data?
Random errors or variance in a dataset that can affect model performance.
What are common techniques for handling noisy data?
Binning, clustering, and combined computer-human inspection.
What is Data Integration?
The process of combining data from multiple sources to create a unified dataset.
What challenges arise in Data Integration?
Schema mismatches, redundancy, and inconsistencies across different sources.
What is Data Reduction?
Reducing the amount of data while maintaining its integrity to improve efficiency.
What are the types of Data Reduction?
Numerosity reduction and dimensionality reduction.
What is Numerosity Reduction?
Reducing the number of data objects (rows) in a dataset.
What is Dimensionality Reduction?
Reducing the number of features (columns) while preserving data structure.
What are common Numerosity Reduction methods?
Random sampling, stratified sampling, and random over/undersampling.
What is Random Sampling?
Randomly selecting a subset of data points to reduce computational cost.
What is Stratified Sampling?
Selecting a sample that maintains the original proportions of different groups in the dataset.
What is Random Over/Undersampling?
Altering the sample proportions to balance class distributions for classification tasks.
What are common Dimensionality Reduction methods?
Linear Regression, Decision Trees, Random Forest, PCA, Functional Data Analysis (FDA).
How does Linear Regression help in Dimensionality Reduction?
By identifying independent variables with weak predictive power and eliminating them.
What is PCA (Principal Component Analysis)?
A technique that transforms data into new components, capturing the most variance while reducing dimensionality.
What is the difference between Supervised and Unsupervised Dimensionality Reduction?
Supervised focuses on improving predictions, while Unsupervised reduces data size without prediction considerations.
What is Data Transformation?
Modifying the dataset to ensure it is suitable for analysis and improves model performance.
What are the key types of Data Transformation?
Normalization and Standardization.
What is Normalization?
Scaling data so that all values fall within a specific range, usually [0,1].
What is Standardization?
Transforming data to have a mean of 0 and standard deviation of 1.
When should Normalization be used?
When distance-based algorithms like K-Means and KNN are used.
When should Standardization be used?
When maintaining data variance is important for analysis.
What is Clustering in Data Preprocessing?
A method to identify and group similar data points to detect outliers or patterns.
What are the types of Clustering?
Partitioning methods (K-Means), Hierarchical methods, and Density-based methods (DBSCAN).
What is K-Means Clustering?
An iterative clustering method that partitions data into K groups based on distance to centroids.
What is the Elbow Method?
A technique to determine the optimal number of clusters in K-Means by analyzing inertia.
What is DBSCAN Clustering?
A density-based clustering method that identifies clusters based on regions of high point density.
What is the Silhouette Score?
A measure of how well a data point fits within its assigned cluster, ranging from -1 to 1.
What is Inertia in Clustering?
The sum of squared distances between data points and their respective cluster centroids.
What is Hierarchical Clustering?
A method that builds a hierarchy of clusters, merging them step by step.
What is the difference between Agglomerative and Divisive Clustering?
Agglomerative starts with individual points and merges clusters, while Divisive starts with one cluster and splits it.
What is an Outlier?
A data point significantly different from the rest of the dataset.
What are common Outlier Detection methods?
Statistical methods, clustering-based methods, and density-based methods.
What is an example of Outlier Handling using Clustering?
Using DBSCAN to separate noise points from valid clusters.
What is Feature Engineering?
Creating new features from raw data to improve predictive performance.
What is Feature Selection?
Choosing the most relevant features to improve model accuracy and reduce complexity.
Why is Data Preprocessing crucial for machine learning?
Ensures high-quality input data, improving model accuracy and efficiency.
What happens if data preprocessing is not done correctly?
Models may learn misleading patterns, perform poorly, or be biased.