Data Cleaning and Processing Flashcards

1
Q

Data types

A
  • numeric, categorical
  • static, dynamic (temporal)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Other kinds of data

A
  • distributed data
  • text, Web, meta data
  • images, audio/video
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

missing attribute values, lack of certain attributes of
interest, or containing only aggregate data

e.g., occupation=“”

A

incomplete data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

containing errors or outliers

e.g., Salary=“-10”

A

noisy data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

containing discrepancies in codes or names

Age=“42” Birthday=“03/07/1997”
Was rating “1,2,3”, now rating “A, B, C”

A

inconsistent data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Why is data Processing Important?

A
  • No quality data, no quality mining results
  • Quality decision must be based on quality data
  • Duplicate or missing data may cause incorrect or even misleading statistics
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Multi-Dimensional Measure of Data Quality

A
  • Accuracy
  • Completeness
  • Consistency
  • Timeliness
  • Believability
  • Value added
  • Interpretability
  • Accessibility
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Major Tasks in Data Processing

A
  • Data Cleaning
  • Data Integration
  • Data Transformation
  • Data Reduction
  • Data discretion
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Fill in missing values, smooth noisy data, identify or remove outliers and noisy data, and resolve inconsistencies

number one problem in data warehousing

A

Data Cleaning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Data Cleaning Tasks:

A
  • Fill in Missing Values
  • Identify outliers and smooth out noisy data
  • Correct Inconsistent Data
  • Resolve Redundancy caused by data integration
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Data is not always available (many tuples have no recorded values for several attributes, such as customer income in sales data)

A

Missing Data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Missing data causes:

A
  • Equipment malfunction
  • Inconsistent with other recorded data and thus deleted
  • Data not entered due to misunderstanding
  • Certain data may not be considered important at the time of entry
  • Not register history or changes of the data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Handling missing data

A
  • Ignore the tuple
  • Fill in missing values manually: tedious and infeasible
  • Fill it automatically
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Fill missing data with

A
  • A global constant e.g., unknown
  • The attribute mean
  • The most probable value:inferenced-based such as Bayesian formula, decision tree or EM algorithm
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Random error or variance in a measured variable

A

Noisy Data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Incorrect attribute values may due to:

A
  • Faulty data collection instruments
  • Data entry problems
  • Data transmission problems
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Handling Noisy Data

A
  • Binning Method
  • Clustering
  • Combined computer and human inspection
18
Q

When reducing noise and trend analysis is needed

A

Smoothing by bin means

19
Q

When keeping real-world constraints and preserving limits is important

A

When keeping real-world constraints and preserving limits is important

20
Q

Detect and remove outliers, Data points inconsistent with the majority of data

A

Clustering

21
Q

Integration of multiple databases or files

A

Data Integration

22
Q

Integrate metadata from different sources
Entity identification problem: identify real world entities from multiple data

A

Schema Integration

23
Q

Removing noise from data

24
Q

scaled to fall within a small, specified range

A

Normalization

25
summarization
Aggregation
26
concept hierarchy climbing
Generalization
27
Normalization
* Min-max normalization * Z-score normalization * Normalization by decimal scaling
28
Obtains reduced representation in volume but produces the same or similar analytical results. Data is too big to work with.
Data Reduction
29
Data Reduction Strategies
* Dimension reduction—remove unimportant attributes * Aggregation and clustering * Sampling
30
Feature selection (i.e., attribute subset selection): Select a minimum set of attributes (features) that is sufficient for the data mining task.
Dimension Reduction
31
Popular reduction technique. Divide data into buckets and store average (sum) for each bucket
Histograms
32
Choose a representative subset of data
Sampling
33
Data discretion three types of attributes
* Nominal * Ordinal * Continuous
34
# Data discretion three types of attributes values from an unordered set
Nominal
35
# Data discretion three types of attributes values from an ordered set
Ordinal
36
# Data discretion three types of attributes real numbers
Continuous
37
Data discretion techniques
* Binning Method - equal-width, equal-frequency * Entropy-based (1) * Entropy-based (2)
38
for bin width of e.g., 10:
Equi-width binning
39
for bin density of e.g., 3
Equi-frequency binning
40
Reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. Interval labels can then be used to replace actual data values
Discretization
41
Reduce the data by collecting and replacing low level concepts (such as numeric values for the attribute age) by higher level concepts (such as young, middle-aged, or senior)
Concept Hierarchies