numeric, categorical static, dynamic (temporal)

distributed data text, Web, meta data images, audio/video

Fill in Missing Values Identify outliers and smooth out noisy data Correct Inconsistent Data Resolve Redundancy caused by data integration

Equipment malfunction Inconsistent with other recorded data and thus deleted Data not entered due to misunderstanding Certain data may not be considered important at the time of entry Not register history or changes of the data

Binning Method Clustering Combined computer and human inspection

Data Cleaning and Processing Flashcards by KHARYLL MAE ANDRES

Data types

numeric, categorical
static, dynamic (temporal)

How well did you know this?

Not at all

Perfectly

Other kinds of data

distributed data
text, Web, meta data
images, audio/video

How well did you know this?

Not at all

Perfectly

missing attribute values, lack of certain attributes of
interest, or containing only aggregate data

e.g., occupation=“”

incomplete data

How well did you know this?

Not at all

Perfectly

containing errors or outliers

e.g., Salary=“-10”

noisy data

How well did you know this?

Not at all

Perfectly

containing discrepancies in codes or names

Age=“42” Birthday=“03/07/1997”
Was rating “1,2,3”, now rating “A, B, C”

inconsistent data

How well did you know this?

Not at all

Perfectly

Why is data Processing Important?

No quality data, no quality mining results
Quality decision must be based on quality data
Duplicate or missing data may cause incorrect or even misleading statistics

How well did you know this?

Not at all

Perfectly

Multi-Dimensional Measure of Data Quality

Accuracy
Completeness
Consistency
Timeliness
Believability
Value added
Interpretability
Accessibility

How well did you know this?

Not at all

Perfectly

Major Tasks in Data Processing

Data Cleaning
Data Integration
Data Transformation
Data Reduction
Data discretion

How well did you know this?

Not at all

Perfectly

Fill in missing values, smooth noisy data, identify or remove outliers and noisy data, and resolve inconsistencies

number one problem in data warehousing

Data Cleaning

How well did you know this?

Not at all

Perfectly

Data Cleaning Tasks:

Fill in Missing Values
Identify outliers and smooth out noisy data
Correct Inconsistent Data
Resolve Redundancy caused by data integration

How well did you know this?

Not at all

Perfectly

Data is not always available (many tuples have no recorded values for several attributes, such as customer income in sales data)

Missing Data

How well did you know this?

Not at all

Perfectly

Missing data causes:

Equipment malfunction
Inconsistent with other recorded data and thus deleted
Data not entered due to misunderstanding
Certain data may not be considered important at the time of entry
Not register history or changes of the data

How well did you know this?

Not at all

Perfectly

Handling missing data

Ignore the tuple
Fill in missing values manually: tedious and infeasible
Fill it automatically

How well did you know this?

Not at all

Perfectly

Fill missing data with

A global constant e.g., unknown
The attribute mean
The most probable value:inferenced-based such as Bayesian formula, decision tree or EM algorithm

How well did you know this?

Not at all

Perfectly

Random error or variance in a measured variable

Noisy Data

How well did you know this?

Not at all

Perfectly

Incorrect attribute values may due to:

Faulty data collection instruments
Data entry problems
Data transmission problems

How well did you know this?

Not at all

Perfectly

Handling Noisy Data

Study These Flashcards

Binning Method
Clustering
Combined computer and human inspection

When reducing noise and trend analysis is needed

Study These Flashcards

Smoothing by bin means

When keeping real-world constraints and preserving limits is important

Study These Flashcards

When keeping real-world constraints and preserving limits is important

Detect and remove outliers, Data points inconsistent with the majority of data

Study These Flashcards

Clustering

Integration of multiple databases or files

Study These Flashcards

Data Integration

Integrate metadata from different sources
Entity identification problem: identify real world entities from multiple data

Study These Flashcards

Schema Integration

Removing noise from data

Study These Flashcards

Smoothing

scaled to fall within a small, specified range

Study These Flashcards

Normalization

summarization

Aggregation

concept hierarchy climbing

Generalization

Normalization

* Min-max normalization * Z-score normalization * Normalization by decimal scaling

Obtains reduced representation in volume but produces the same or similar analytical results. Data is too big to work with.

Data Reduction

Data Reduction Strategies

* Dimension reduction—remove unimportant attributes * Aggregation and clustering * Sampling

Feature selection (i.e., attribute subset selection): Select a minimum set of attributes (features) that is sufficient for the data mining task.

Dimension Reduction

Popular reduction technique. Divide data into buckets and store average (sum) for each bucket

Histograms

Choose a representative subset of data

Sampling

Data discretion three types of attributes

* Nominal * Ordinal * Continuous

# Data discretion three types of attributes values from an unordered set

Nominal

# Data discretion three types of attributes values from an ordered set

Ordinal

# Data discretion three types of attributes real numbers

Continuous

Data discretion techniques

* Binning Method - equal-width, equal-frequency * Entropy-based (1) * Entropy-based (2)

for bin width of e.g., 10:

Equi-width binning

for bin density of e.g., 3

Equi-frequency binning

Reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. Interval labels can then be used to replace actual data values

Discretization

Reduce the data by collecting and replacing low level concepts (such as numeric values for the attribute age) by higher level concepts (such as young, middle-aged, or senior)

Concept Hierarchies

Data Cleaning and Processing Flashcards

(41 cards)