lec 5(done) Flashcards

1
Q

Measures for data quality:

A
  • Accuracy
  • Completeness
  • Consistency
  • Timeliness
  • Believability
  • Interpretability
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Major Tasks in Data Preprocessing:

A
  • Data cleaning
  • Data integration
  • Data transformation
  • Data reduction
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Data cleaning include:

A

1-Fill in missing values
2-Smothing noise
3-identify or remove outliers
4-resolve inconsistencies

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Data integration is:

A

Integration of multiple
1-databases
2- data cubes
3- files

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Data reduction include:

A

1-Dimensionality reduction

2-Numerosity reduction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Data transformation include:

A

1-Normalization
2-Data discretization
3-Concept hierarchy generation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Real-world data tend to be :

A

1-Incomplete (missing): missing attribute values
2-Inaccurate (noisy): containing errors, or outliers
3-Inconsistent: containing discrepancies

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Reasons for missing data

A

1-Information is not collected
2-Attributes may not be applicable to all cases
3-Inconsistent with other recorded data and thus deleted
4-Human/Hardware/Software problems

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How to Handle Missing Data?

A

1-Ignore the tuple
2-Fill in the missing value manually: tedious + infeasible
3-Fill in the missing value automatically with:
o-A global constant : e.g., “unknown” or ∞
o-The attribute mean or median
o-The attribute mean for all samples belonging to the same class
o-The most probable value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Reasons for noisy data:

A
1-Faulty data collection instruments
2-Human errors at data entry
3-Data transmission problems
4-Technology limitation
5-Inconsistency in naming convention
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Data Smoothing techniques

A
1-Binning:
   o-Smooth by bin means
   o-Smooth by bin medians
   o-Smooth by bin boundaries
2-Regression:
    o-Smooth by fitting the data into regression functions
3-Clustering
    o-Detect and remove outliers
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Binning Methods steps:

A

1-Sort the data values
2-Partition data into
o-equal depth bins (equal-frequency) or
o-equal width bins, where interval range of values in
each bin is constant.

3-Smooth the data:
o-Smooth by bin means: replace each bin value by
the bin mean.
o-Smooth by bin medians: replace each bin value by
the bin median.
o-Smooth by bin boundaries: replace each bin value
with the closest boundary value (min, max).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

how to apply regression

A

1-Linear Regression: finding the best line to fit two variables, so that one variable can be used to predict the other.

2-Multiple Linear Regression: more than two attributes are involved and the data are fit into a multidimensional surface.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly