Data Cleaning and Processing Flashcards

1
Q

Data types

A
  • numeric, categorical
  • static, dynamic (temporal)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Other kinds of data

A
  • distributed data
  • text, Web, meta data
  • images, audio/video
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

missing attribute values, lack of certain attributes of
interest, or containing only aggregate data

e.g., occupation=“”

A

incomplete data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

containing errors or outliers

e.g., Salary=“-10”

A

noisy data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

containing discrepancies in codes or names

Age=“42” Birthday=“03/07/1997”
Was rating “1,2,3”, now rating “A, B, C”

A

inconsistent data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Why is data Processing Important?

A
  • No quality data, no quality mining results
  • Quality decision must be based on quality data
  • Duplicate or missing data may cause incorrect or even misleading statistics
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Multi-Dimensional Measure of Data Quality

A
  • Accuracy
  • Completeness
  • Consistency
  • Timeliness
  • Believability
  • Value added
  • Interpretability
  • Accessibility
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Major Tasks in Data Processing

A
  • Data Cleaning
  • Data Integration
  • Data Transformation
  • Data Reduction
  • Data discretion
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Fill in missing values, smooth noisy data, identify or remove outliers and noisy data, and resolve inconsistencies

number one problem in data warehousing

A

Data Cleaning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Data Cleaning Tasks:

A
  • Fill in Missing Values
  • Identify outliers and smooth out noisy data
  • Correct Inconsistent Data
  • Resolve Redundancy caused by data integration
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Data is not always available (many tuples have no recorded values for several attributes, such as customer income in sales data)

A

Missing Data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Missing data causes:

A
  • Equipment malfunction
  • Inconsistent with other recorded data and thus deleted
  • Data not entered due to misunderstanding
  • Certain data may not be considered important at the time of entry
  • Not register history or changes of the data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Handling missing data

A
  • Ignore the tuple
  • Fill in missing values manually: tedious and infeasible
  • Fill it automatically
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Fill missing data with

A
  • A global constant e.g., unknown
  • The attribute mean
  • The most probable value:inferenced-based such as Bayesian formula, decision tree or EM algorithm
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Random error or variance in a measured variable

A

Noisy Data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Incorrect attribute values may due to:

A
  • Faulty data collection instruments
  • Data entry problems
  • Data transmission problems
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Handling Noisy Data

A
  • Binning Method
  • Clustering
  • Combined computer and human inspection
18
Q

When reducing noise and trend analysis is needed

A

Smoothing by bin means

19
Q

When keeping real-world constraints and preserving limits is important

A

When keeping real-world constraints and preserving limits is important

20
Q

Detect and remove outliers, Data points inconsistent with the majority of data

A

Clustering

21
Q

Integration of multiple databases or files

A

Data Integration

22
Q

Integrate metadata from different sources
Entity identification problem: identify real world entities from multiple data

A

Schema Integration

23
Q

Removing noise from data

24
Q

scaled to fall within a small, specified range

A

Normalization

25
Q

summarization

A

Aggregation

26
Q

concept hierarchy climbing

A

Generalization

27
Q

Normalization

A
  • Min-max normalization
  • Z-score normalization
  • Normalization by decimal scaling
28
Q

Obtains reduced representation in volume but produces the same or similar analytical results.
Data is too big to work with.

A

Data Reduction

29
Q

Data Reduction Strategies

A
  • Dimension reduction—remove unimportant attributes
  • Aggregation and clustering
  • Sampling
30
Q

Feature selection (i.e., attribute subset selection):
Select a minimum set of attributes (features) that is sufficient for the data mining task.

A

Dimension Reduction

31
Q

Popular reduction technique. Divide data into buckets and store average (sum) for each bucket

A

Histograms

32
Q

Choose a representative subset of data

33
Q

Data discretion three types of attributes

A
  • Nominal
  • Ordinal
  • Continuous
34
Q

Data discretion three types of attributes

values from an unordered set

35
Q

Data discretion three types of attributes

values from an ordered set

36
Q

Data discretion three types of attributes

real numbers

A

Continuous

37
Q

Data discretion techniques

A
  • Binning Method - equal-width, equal-frequency
  • Entropy-based (1)
  • Entropy-based (2)
38
Q

for bin width of e.g., 10:

A

Equi-width binning

39
Q

for bin density of e.g., 3

A

Equi-frequency binning

40
Q

Reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. Interval labels can then be used to replace actual data values

A

Discretization

41
Q

Reduce the data by collecting and replacing low level concepts (such as numeric values for the attribute age) by higher level concepts (such as young, middle-aged, or senior)

A

Concept Hierarchies