Chapter 6 - Data Preprocessing for Machine Learning Flashcards

Question 1

Q

What is data preprocessing and why is it important?

Answer

A

Data preprocessing is the essential first step in creating a machine learning model. It involves transforming raw data into a clean and structured format, ready for analysis. Without it, machine learning models may suffer from poor performance due to noise, missing values, or unstandardized scales. The goal is to prepare the dataset so the machine learning algorithm can train on it effectively.

Question 2

Q

What are some common issues with real-world data that require preprocessing?

Answer

A

Real-world data is often incomplete, inconsistent, and may contain outliers, missing values, or noisy data

Question 3

Q

What are the typical steps involved in data preprocessing?

Answer

A

Data preprocessing typically involves data cleaning, data transformation, data reduction, and data splitting.

Question 4

Q

What does data cleaning involve?

Answer

A

Data cleaning includes identifying and handling missing values (e.g., replacing with mean, median, or mode), addressing inconsistencies, removing duplicates, and detecting and handling outliers.

Question 5

Q

What is feature scaling?

Answer

A

Feature scaling ensures that numerical data is on a similar scale. Common methods include normalization (scaling data between 0 and 1) and standardization (scaling data to have a mean of 0 and a standard deviation of 1).

Question 6

Q

Why is feature scaling is important?

Answer

A

Feature scaling is important because many algorithms converge faster when features are normalized and algorithms that rely on distances between points are sensitive to feature scales

Question 7

Q

Explain the difference between normalization and standardization.

Answer

A

Normalization scales data between 0 and 1, while standardization scales data to have a mean of 0 and a standard deviation of 1.

Question 8

Q

How is categorical data converted to numerical form?

Answer

A

Categorical data can be converted to numerical form using one-hot encoding or label encoding.

Question 9

Q

What is dimensionality reduction and why is it used?

Answer

A

Dimensionality reduction techniques, like PCA, are used to reduce the number of input variables without losing much information.

Question 10

Q

Why is it important to split data into training and testing sets?

Answer

A

Splitting the dataset into training and testing sets is important to evaluate the model’s performance. A common split is 80% for training and 20% for testing.

Question 11

Q

What is data integration?

Answer

A

Data integration is the merging of data from multiple data stores.

Question 12

Q

What is data discretization?

Answer

A

Data discretization is part of data reduction and is particularly important for numerical data.

Question 13

Q

What are the benefits of normalization?

Answer

A

Normalization can improve convergence speed, enhance model performance, and balance feature contributions.

Question 14

Q

How are missing values handled in regression models?

Answer

A

Missing values in regression models can be handled by imputation (replacing with mean, median, or mode) or by dropping rows or columns that contain missing data.

Question 15

Q

Why is feature scaling important for regression models?

Answer

A

Feature scaling is critical for linear regression models because unscaled features can lead to biased results or slow learning during gradient-based optimization.

Question 16

Q

How are categorical variables encoded for use in regression models?

Answer

Study These Flashcards

A

Categorical variables are converted to numerical form using one-hot encoding or label encoding.

Question 17

Q

What is the impact of batch size on training neural networks?

Answer

Study These Flashcards

A

he batch size of a neural network affects its training. If the batch size is too large relative to the training data, it can lead to underfitting and low training accuracy.

Chapter 6 - Data Preprocessing for Machine Learning Flashcards

(17 cards)