Visualization and Exploration Flashcards

1
Q

Unstructured data (tree)

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Investigating individual features

Continuous features

A

Feature

Count

% Miss

Card

Min

1st Qrt

Mean

Median

3rd Qrt

Max

Std Dev

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Investigating individual feature

Categorical Features

A

Feature

Count

% Miss

Card Mode Freq

Mode %

2nd Mode

2nd Mode Freq

2nd Mode %

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Investigating individual features

Feature

Count

% Miss

Card

1st Qrt

Mean

Median

3rd Qrt

Max

Std Dev

A

Investigating individual features

Feature

Count - number of instances

% Miss - percentage missing

Card - cardinality: number of unique values

Min - minimum

1st Qrt

Mean - mean

Median - median (middle value)

3rd Qrt

Max - maximum

Std Dev - standard deviation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Investigation individual features

Categorical Features

Feature

Count

% Miss

Card

Mode

Mode Freq

Mode %

2nd Mode

2nd Mode Freq

2nd Mode %

A

Investigation individual features

Categorical Features

Feature

Count - number of instances

% Miss - percentage missing

Card - cardinality. number of unique values

Mode - mode: most common value

Mode Freq - frequency of mode

Mode % - percentage of mode

2nd Mode

2nd Mode Freq . similiar values for second most common value

2nd Mode %

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Different types of histograms

Uniform

Normal (Unimodal)

Unimodal (skewed left/right)

Exponential

Multimodal

A

Uniform

Normal (Unimodal)

Unimodal (skewed left/right)

Exponential

Multimodal (more than one peak)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Normal distribution (68-95-99.7)

A

In statistics, the 68–95–99.7 rule, also known as the empirical rule, is a shorthand used to remember the percentage of values that lie within an interval estimate in a normal distribution: 68%, 95%, and 99.7% of the values lie within one, two, and three standard deviations of the mean, respectively.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Six sigma

A

rocesses that operate with “six sigma quality” are assumed to have less than 3.4 defects per million cases. This is based on six sigma with a “drift” of +/- 1.5 sigma.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Data quality - ypical problems

A

• Data may be …

  • incomplete (missing instances/attributes),
  • invalid (impossible values),
  • inconsistent (conflicting values),
  • imprecise (approximated or rounded), and/or
  • outdated (based on old observations)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Missing values handling

4 options

A
  • Feature is missing for some instances.
  • Options:
  1. Remove feature completely.
  2. Only consider instances that have a value (per
    feature).
  3. Remove all instances that have one of the features
    missing.
  4. Repair missing features (imputation)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Impossible values

A

Can be handled like missing values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Unlikely values

A

• Dates: One may expect a uniform distribution over months
and days.

− When formats DD-MM-YYYY and MM-DD-YYYY are mixed
up this is no longer the case.

− Days 1-12 are more frequent than 13-31

  • Age: An age of 123 is not impossible, but unlikely.
  • Price: An item priced 120.000 (rather than 120).
  • Unlikely values are identified based on domain knowledge.
  • Outlier values are identified based on the distribution.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Box plots

A

• Median value (middle) depicted
by “Bar”

• IQR = Interquartile Range
(covers 50% of “middle”
instances) depicted by “Box”.

• Upper whisker: maximal value
below 3rd quartile + 1.5 * IQR.

• Lower whisker: minimal value
above 1st quartile - 1.5 * IQR.

• Outliers are drawn separately.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

• Median value (middle) depicted
by “Bar”

• IQR = Interquartile Range
(covers 50% of “middle”
instances) depicted by “Box”.

• Upper whisker: maximal value
below 3rd quartile + 1.5 * IQR.

• Lower whisker: minimal value
above 1st quartile - 1.5 * IQR.

• Outliers are drawn separately.

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Handling outliers with Boxplots

A
  1. Remove values above and below thresholds (e.g.,
    upper and lower fences).
  2. Clamp values above and below thresholds to these
    thresholds.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Some basic descriptive statistics

sample mean formula

A
17
Q

Soma basic descriptive statistics

sample variance

A
18
Q

Some basic descriptive statistics

A
19
Q

Sample covariance

A
20
Q

Correlation

A
21
Q

Matrix to know correlation

A

Example correlation matrix

22
Q

Preparing for analysis

A

Normalization (make comparable)

Binning (make categorical)

Sampling (make data smaller or to change the bias)

23
Q

Normalization typically maps values
onto a predefined range (e.g. [0,1], [-1,1])
while maintaining relative differences.

A
24
Q

Standard score uses the standard deviation to normalize.

A
25
Q
  1. Why binning
  2. What is a bin
  3. Two types of binning
A

• Binning is used to make
continuous features
categorical.

• Bins = a series of ranges.

• Equal-width binning versus
equal-frequency binning

26
Q

Equal-width binning

A

Bins have a fixed width, but the number of items per bin may vary greatly.

27
Q

Equal-frequency binning

A

Bins have a variable width, but the number of items per bin is fixed.

28
Q

Different types of sampling

A

top, random, stratified, and under/over sampling

to make data smaller or to remove/introduce a sample bias

29
Q

Top sampling

Random sampling

Stratified sampling

Under-sampling

Over-sampling

A

Top sampling

Random sampling

Stratified sampling - hat relative frequencies are
maintained, e.g., by taking the same percentage from every group

Under-sampling - a balance by leaving out instances
of the over-represented group

Over-sampling - a balance by possible duplication of
under-represented instances