Data Pre processing Flashcards

Question 1

Q

What is Data Preprocessing?

Answer

A

A step in data science that transforms raw data into a format that can be understood and analyzed by computers and machine learning.

Question 2

Q

Why is Data Preprocessing important?

Answer

A

Real-world data is often dirty, incomplete, noisy, or inconsistent, and preprocessing helps clean and prepare it for analysis.

Question 3

Q

What are the major tasks involved in Data Preprocessing?

Answer

A

Data cleaning, data integration, data reduction, and data transformation.

Question 4

Q

What is Data Cleaning?

Answer

A

The process of handling missing values, smoothing noisy data, identifying/removing outliers, and resolving inconsistencies.

Question 5

Q

What are the common issues in real-world data?

Answer

A

Missing values, noise, errors, duplicate records, and inconsistent data formats.

Question 6

Q

What are some methods to handle missing data?

Answer

A

Removal, interpolation, replacement with mean or most common value.

Question 7

Q

What is noisy data?

Answer

A

Random errors or variance in a dataset that can affect model performance.

Question 8

Q

What are common techniques for handling noisy data?

Answer

A

Binning, clustering, and combined computer-human inspection.

Question 9

Q

What is Data Integration?

Answer

A

The process of combining data from multiple sources to create a unified dataset.

Question 10

Q

What challenges arise in Data Integration?

Answer

A

Schema mismatches, redundancy, and inconsistencies across different sources.

Question 11

Q

What is Data Reduction?

Answer

A

Reducing the amount of data while maintaining its integrity to improve efficiency.

Question 12

Q

What are the types of Data Reduction?

Answer

A

Numerosity reduction and dimensionality reduction.

Question 13

Q

What is Numerosity Reduction?

Answer

A

Reducing the number of data objects (rows) in a dataset.

Question 14

Q

What is Dimensionality Reduction?

Answer

A

Reducing the number of features (columns) while preserving data structure.

Question 15

Q

What are common Numerosity Reduction methods?

Answer

A

Random sampling, stratified sampling, and random over/undersampling.

Question 16

Q

What is Random Sampling?

Answer

A

Randomly selecting a subset of data points to reduce computational cost.

Question 17

Q

What is Stratified Sampling?

Answer

A

Selecting a sample that maintains the original proportions of different groups in the dataset.

Question 18

Q

What is Random Over/Undersampling?

Answer

A

Altering the sample proportions to balance class distributions for classification tasks.

Question 19

Q

What are common Dimensionality Reduction methods?

Answer

A

Linear Regression, Decision Trees, Random Forest, PCA, Functional Data Analysis (FDA).

Question 20

Q

How does Linear Regression help in Dimensionality Reduction?

Answer

A

By identifying independent variables with weak predictive power and eliminating them.

Question 21

Q

What is PCA (Principal Component Analysis)?

Answer

A

A technique that transforms data into new components, capturing the most variance while reducing dimensionality.

Question 22

Q

What is the difference between Supervised and Unsupervised Dimensionality Reduction?

Answer

A

Supervised focuses on improving predictions, while Unsupervised reduces data size without prediction considerations.

Question 23

Q

What is Data Transformation?

Answer

A

Modifying the dataset to ensure it is suitable for analysis and improves model performance.

Question 24

Q

What are the key types of Data Transformation?

Answer

A

Normalization and Standardization.

Question 25

Q

What is Normalization?

Answer

A

Scaling data so that all values fall within a specific range, usually [0,1].

Question 26

Q

What is Standardization?

Answer

A

Transforming data to have a mean of 0 and standard deviation of 1.

Question 27

Q

When should Normalization be used?

Answer

A

When distance-based algorithms like K-Means and KNN are used.

Question 28

Q

When should Standardization be used?

Answer

A

When maintaining data variance is important for analysis.

Question 29

Q

What is Clustering in Data Preprocessing?

Answer

A

A method to identify and group similar data points to detect outliers or patterns.

Question 30

Q

What are the types of Clustering?

Answer

A

Partitioning methods (K-Means), Hierarchical methods, and Density-based methods (DBSCAN).

Question 31

Q

What is K-Means Clustering?

Answer

A

An iterative clustering method that partitions data into K groups based on distance to centroids.

Question 32

Q

What is the Elbow Method?

Answer

A

A technique to determine the optimal number of clusters in K-Means by analyzing inertia.

Question 33

Q

What is DBSCAN Clustering?

Answer

A

A density-based clustering method that identifies clusters based on regions of high point density.

Question 34

Q

What is the Silhouette Score?

Answer

A

A measure of how well a data point fits within its assigned cluster, ranging from -1 to 1.

Question 35

Q

What is Inertia in Clustering?

Answer

A

The sum of squared distances between data points and their respective cluster centroids.

Question 36

Q

What is Hierarchical Clustering?

Answer

A

A method that builds a hierarchy of clusters, merging them step by step.

Question 37

Q

What is the difference between Agglomerative and Divisive Clustering?

Answer

A

Agglomerative starts with individual points and merges clusters, while Divisive starts with one cluster and splits it.

Question 38

Q

What is an Outlier?

Answer

A

A data point significantly different from the rest of the dataset.

Question 39

Q

What are common Outlier Detection methods?

Answer

A

Statistical methods, clustering-based methods, and density-based methods.

Question 40

Q

What is an example of Outlier Handling using Clustering?

Answer

A

Using DBSCAN to separate noise points from valid clusters.

Question 41

Q

What is Feature Engineering?

Answer

A

Creating new features from raw data to improve predictive performance.

Question 42

Q

What is Feature Selection?

Answer

A

Choosing the most relevant features to improve model accuracy and reduce complexity.

Question 43

Q

Why is Data Preprocessing crucial for machine learning?

Answer

A

Ensures high-quality input data, improving model accuracy and efficiency.

Question 44

Q

What happens if data preprocessing is not done correctly?

Answer

A

Models may learn misleading patterns, perform poorly, or be biased.