Chapter 6 - Data Preprocessing for Machine Learning Flashcards
What is data preprocessing and why is it important?
Data preprocessing is the essential first step in creating a machine learning model. It involves transforming raw data into a clean and structured format, ready for analysis. Without it, machine learning models may suffer from poor performance due to noise, missing values, or unstandardized scales. The goal is to prepare the dataset so the machine learning algorithm can train on it effectively.
What are some common issues with real-world data that require preprocessing?
Real-world data is often incomplete, inconsistent, and may contain outliers, missing values, or noisy data
What are the typical steps involved in data preprocessing?
Data preprocessing typically involves data cleaning, data transformation, data reduction, and data splitting.
What does data cleaning involve?
Data cleaning includes identifying and handling missing values (e.g., replacing with mean, median, or mode), addressing inconsistencies, removing duplicates, and detecting and handling outliers.
What is feature scaling?
Feature scaling ensures that numerical data is on a similar scale. Common methods include normalization (scaling data between 0 and 1) and standardization (scaling data to have a mean of 0 and a standard deviation of 1).
Why is feature scaling is important?
Feature scaling is important because many algorithms converge faster when features are normalized and algorithms that rely on distances between points are sensitive to feature scales
Explain the difference between normalization and standardization.
Normalization scales data between 0 and 1, while standardization scales data to have a mean of 0 and a standard deviation of 1.
How is categorical data converted to numerical form?
Categorical data can be converted to numerical form using one-hot encoding or label encoding.
What is dimensionality reduction and why is it used?
Dimensionality reduction techniques, like PCA, are used to reduce the number of input variables without losing much information.
Why is it important to split data into training and testing sets?
Splitting the dataset into training and testing sets is important to evaluate the model’s performance. A common split is 80% for training and 20% for testing.
What is data integration?
Data integration is the merging of data from multiple data stores.
What is data discretization?
Data discretization is part of data reduction and is particularly important for numerical data.
What are the benefits of normalization?
Normalization can improve convergence speed, enhance model performance, and balance feature contributions.
How are missing values handled in regression models?
Missing values in regression models can be handled by imputation (replacing with mean, median, or mode) or by dropping rows or columns that contain missing data.
Why is feature scaling important for regression models?
Feature scaling is critical for linear regression models because unscaled features can lead to biased results or slow learning during gradient-based optimization.