Data Pre processing Flashcards

1
Q

What is Data Preprocessing?

A

A step in data science that transforms raw data into a format that can be understood and analyzed by computers and machine learning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why is Data Preprocessing important?

A

Real-world data is often dirty, incomplete, noisy, or inconsistent, and preprocessing helps clean and prepare it for analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the major tasks involved in Data Preprocessing?

A

Data cleaning, data integration, data reduction, and data transformation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is Data Cleaning?

A

The process of handling missing values, smoothing noisy data, identifying/removing outliers, and resolving inconsistencies.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the common issues in real-world data?

A

Missing values, noise, errors, duplicate records, and inconsistent data formats.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are some methods to handle missing data?

A

Removal, interpolation, replacement with mean or most common value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is noisy data?

A

Random errors or variance in a dataset that can affect model performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are common techniques for handling noisy data?

A

Binning, clustering, and combined computer-human inspection.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is Data Integration?

A

The process of combining data from multiple sources to create a unified dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What challenges arise in Data Integration?

A

Schema mismatches, redundancy, and inconsistencies across different sources.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is Data Reduction?

A

Reducing the amount of data while maintaining its integrity to improve efficiency.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the types of Data Reduction?

A

Numerosity reduction and dimensionality reduction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is Numerosity Reduction?

A

Reducing the number of data objects (rows) in a dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is Dimensionality Reduction?

A

Reducing the number of features (columns) while preserving data structure.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are common Numerosity Reduction methods?

A

Random sampling, stratified sampling, and random over/undersampling.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is Random Sampling?

A

Randomly selecting a subset of data points to reduce computational cost.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is Stratified Sampling?

A

Selecting a sample that maintains the original proportions of different groups in the dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is Random Over/Undersampling?

A

Altering the sample proportions to balance class distributions for classification tasks.

19
Q

What are common Dimensionality Reduction methods?

A

Linear Regression, Decision Trees, Random Forest, PCA, Functional Data Analysis (FDA).

20
Q

How does Linear Regression help in Dimensionality Reduction?

A

By identifying independent variables with weak predictive power and eliminating them.

21
Q

What is PCA (Principal Component Analysis)?

A

A technique that transforms data into new components, capturing the most variance while reducing dimensionality.

22
Q

What is the difference between Supervised and Unsupervised Dimensionality Reduction?

A

Supervised focuses on improving predictions, while Unsupervised reduces data size without prediction considerations.

23
Q

What is Data Transformation?

A

Modifying the dataset to ensure it is suitable for analysis and improves model performance.

24
Q

What are the key types of Data Transformation?

A

Normalization and Standardization.

25
Q

What is Normalization?

A

Scaling data so that all values fall within a specific range, usually [0,1].

26
Q

What is Standardization?

A

Transforming data to have a mean of 0 and standard deviation of 1.

27
Q

When should Normalization be used?

A

When distance-based algorithms like K-Means and KNN are used.

28
Q

When should Standardization be used?

A

When maintaining data variance is important for analysis.

29
Q

What is Clustering in Data Preprocessing?

A

A method to identify and group similar data points to detect outliers or patterns.

30
Q

What are the types of Clustering?

A

Partitioning methods (K-Means), Hierarchical methods, and Density-based methods (DBSCAN).

31
Q

What is K-Means Clustering?

A

An iterative clustering method that partitions data into K groups based on distance to centroids.

32
Q

What is the Elbow Method?

A

A technique to determine the optimal number of clusters in K-Means by analyzing inertia.

33
Q

What is DBSCAN Clustering?

A

A density-based clustering method that identifies clusters based on regions of high point density.

34
Q

What is the Silhouette Score?

A

A measure of how well a data point fits within its assigned cluster, ranging from -1 to 1.

35
Q

What is Inertia in Clustering?

A

The sum of squared distances between data points and their respective cluster centroids.

36
Q

What is Hierarchical Clustering?

A

A method that builds a hierarchy of clusters, merging them step by step.

37
Q

What is the difference between Agglomerative and Divisive Clustering?

A

Agglomerative starts with individual points and merges clusters, while Divisive starts with one cluster and splits it.

38
Q

What is an Outlier?

A

A data point significantly different from the rest of the dataset.

39
Q

What are common Outlier Detection methods?

A

Statistical methods, clustering-based methods, and density-based methods.

40
Q

What is an example of Outlier Handling using Clustering?

A

Using DBSCAN to separate noise points from valid clusters.

41
Q

What is Feature Engineering?

A

Creating new features from raw data to improve predictive performance.

42
Q

What is Feature Selection?

A

Choosing the most relevant features to improve model accuracy and reduce complexity.

43
Q

Why is Data Preprocessing crucial for machine learning?

A

Ensures high-quality input data, improving model accuracy and efficiency.

44
Q

What happens if data preprocessing is not done correctly?

A

Models may learn misleading patterns, perform poorly, or be biased.