• Dates: One may expect a uniform distribution over months and days. − When formats DD-MM-YYYY and MM-DD-YYYY are mixed up this is no longer the case. − Days 1-12 are more frequent than 13-31 Age: An age of 123 is not impossible, but unlikely. Price: An item priced 120.000 (rather than 120). Unlikely values are identified based on domain knowledge. Outlier values are identified based on the distribution.

Visualization and Exploration Flashcards by Eric van Lessen

Unstructured data (tree)

How well did you know this?

Not at all

Perfectly

Investigating individual features

Continuous features

Feature

Count

% Miss

Card

Min

1st Qrt

Mean

Median

3rd Qrt

Max

Std Dev

How well did you know this?

Not at all

Perfectly

Investigating individual feature

Categorical Features

Feature

Count

% Miss

Card Mode Freq

Mode %

2nd Mode

2nd Mode Freq

2nd Mode %

How well did you know this?

Not at all

Perfectly

Investigating individual features

Feature

Count

% Miss

Card

1st Qrt

Mean

Median

3rd Qrt

Max

Std Dev

Investigating individual features

Feature

Count - number of instances

% Miss - percentage missing

Card - cardinality: number of unique values

Min - minimum

1st Qrt

Mean - mean

Median - median (middle value)

3rd Qrt

Max - maximum

Std Dev - standard deviation

How well did you know this?

Not at all

Perfectly

Investigation individual features

Categorical Features

Feature

Count

% Miss

Card

Mode

Mode Freq

Mode %

2nd Mode

2nd Mode Freq

2nd Mode %

Investigation individual features

Categorical Features

Feature

Count - number of instances

% Miss - percentage missing

Card - cardinality. number of unique values

Mode - mode: most common value

Mode Freq - frequency of mode

Mode % - percentage of mode

2nd Mode

2nd Mode Freq . similiar values for second most common value

2nd Mode %

How well did you know this?

Not at all

Perfectly

Different types of histograms

Uniform

Normal (Unimodal)

Unimodal (skewed left/right)

Exponential

Multimodal

Uniform

Normal (Unimodal)

Unimodal (skewed left/right)

Exponential

Multimodal (more than one peak)

How well did you know this?

Not at all

Perfectly

Normal distribution (68-95-99.7)

In statistics, the 68–95–99.7 rule, also known as the empirical rule, is a shorthand used to remember the percentage of values that lie within an interval estimate in a normal distribution: 68%, 95%, and 99.7% of the values lie within one, two, and three standard deviations of the mean, respectively.

How well did you know this?

Not at all

Perfectly

Six sigma

rocesses that operate with “six sigma quality” are assumed to have less than 3.4 defects per million cases. This is based on six sigma with a “drift” of +/- 1.5 sigma.

How well did you know this?

Not at all

Perfectly

Data quality - ypical problems

• Data may be …

incomplete (missing instances/attributes),
invalid (impossible values),
inconsistent (conflicting values),
imprecise (approximated or rounded), and/or
outdated (based on old observations)

How well did you know this?

Not at all

Perfectly

Missing values handling

4 options

Feature is missing for some instances.
Options:

Remove feature completely.
Only consider instances that have a value (per
feature).
Remove all instances that have one of the features
missing.
Repair missing features (imputation)

How well did you know this?

Not at all

Perfectly

Impossible values

Can be handled like missing values.

How well did you know this?

Not at all

Perfectly

Unlikely values

• Dates: One may expect a uniform distribution over months
and days.

− When formats DD-MM-YYYY and MM-DD-YYYY are mixed
up this is no longer the case.

− Days 1-12 are more frequent than 13-31

Age: An age of 123 is not impossible, but unlikely.
Price: An item priced 120.000 (rather than 120).
Unlikely values are identified based on domain knowledge.
Outlier values are identified based on the distribution.

How well did you know this?

Not at all

Perfectly

Box plots

• Median value (middle) depicted
by “Bar”

• IQR = Interquartile Range
(covers 50% of “middle”
instances) depicted by “Box”.

• Upper whisker: maximal value
below 3rd quartile + 1.5 * IQR.

• Lower whisker: minimal value
above 1st quartile - 1.5 * IQR.

• Outliers are drawn separately.

How well did you know this?

Not at all

Perfectly

• Median value (middle) depicted
by “Bar”

• IQR = Interquartile Range
(covers 50% of “middle”
instances) depicted by “Box”.

• Upper whisker: maximal value
below 3rd quartile + 1.5 * IQR.

• Lower whisker: minimal value
above 1st quartile - 1.5 * IQR.

• Outliers are drawn separately.

How well did you know this?

Not at all

Perfectly

Handling outliers with Boxplots

Remove values above and below thresholds (e.g.,
upper and lower fences).
Clamp values above and below thresholds to these
thresholds.

How well did you know this?

Not at all

Perfectly

Some basic descriptive statistics

sample mean formula

Study These Flashcards

Soma basic descriptive statistics

sample variance

Study These Flashcards

Some basic descriptive statistics

Study These Flashcards

Sample covariance

Study These Flashcards

Correlation

Study These Flashcards

Matrix to know correlation

Study These Flashcards

Example correlation matrix

Preparing for analysis

Study These Flashcards

Normalization (make comparable)

Binning (make categorical)

Sampling (make data smaller or to change the bias)

Normalization typically maps values
onto a predefined range (e.g. [0,1], [-1,1])
while maintaining relative differences.

Study These Flashcards

Standard score uses the standard deviation to normalize.

Study These Flashcards

1. Why binning 2. What is a bin 3. Two types of binning

• Binning is used to make continuous features categorical. • Bins = a series of ranges. • Equal-width binning versus equal-frequency binning

Equal-width binning

Bins have a fixed width, but the number of items per bin may vary greatly.

Equal-frequency binning

Bins have a variable width, but the number of items per bin is fixed.

Different types of sampling

top, random, stratified, and under/over sampling to make data smaller or to remove/introduce a sample bias

Top sampling Random sampling Stratified sampling Under-sampling Over-sampling

Top sampling Random sampling Stratified sampling - hat relative frequencies are maintained, e.g., by taking the same percentage from every group Under-sampling - a balance by leaving out instances of the over-represented group Over-sampling - a balance by possible duplication of under-represented instances

Visualization and Exploration Flashcards

(29 cards)