Outliers Flashcards

1
Q

An _ is a data point that differs significantly from other values in a dataset.

A

Outlier

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

A single data point that significantly deviates from the rest of the dataset.

A

Point Outlier

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

This type of outlier is an isolated data point that is far away from the main body of the data.

A

Global outlier

It is often easy to identify and remove.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

This type of outlier is a data point that is unusual in a specific context but may not be outlier in a different context.

A

Contextual outlier

It is often more difficult to identify and may require additional information or domain knowledge to determine its significance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What causes outliers?

A
  • variability in data
  • measurement errors
  • novel phenomena
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Why is it important to identify and handle outliers?

A

Outliers can skew results and affect the performance of machine learning models. By identifying and removing or handling outliers effectively, we can prevent them from biasing the model, reducing its performance, and hindering its interpretability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

The analysis of outlier data is referred to as _.

A

Outlier Analysis or Outlier Mining

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

_ plays a crucial role in ensuring the quality and accuracy of machine learning models.

A

Outlier detection

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are some techniques to detect outliers?

A
  • statistical tests
  1. Z-Score
  2. Interquartile Range (IQR)
  • visualization techniques
    1. box plots
    2. histograms
  • machine learning algorithms
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are some ways to handle outliers?

A
  • removing outliers
  • transforming data
  • using models that are robust to outliers
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

A _ is relatively unaffected by extreme values, such as the median.

A

resistant statistic

A statistic is resistant if it is relatively unaffected by extreme values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Which statistic is resistant, the mean or the median?

A

The median (middle value) is resistant while the mean (average) is not.

df.cgpa.mean()

Example: World Gross (in millions)

With Harry Potter
Mean = $150,742,300
Median = $76,658,500

Without Harry Potter
Mean = $141,889,900
Median = $75,009,000

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What should you do if an outlier is not a mistake?

A

Run the analysis twice: once with the outlier and once without, to assess its impact.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

This measures the spread of the data from the mean.

A

Standard deviation

df.cgpa.std()

Sample standard deviation: s
Population standard deviation:  (“sigma”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What does a larger standard deviation indicate?

A

A larger standard deviation indicates more variability and that the data are more spread out.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

For a bell-shaped distribution, about _ of the data falls within two standard deviations of the mean.

A

95%

For a population, 95% of the data will be between µ – 2 and µ + 2
 = sigma symbol

17
Q

A _ indicates how many standard deviations a value is from the mean.

A

Z-score

Z = (X - mean) / Standard Deviation

For a population, !𝑥 is replaced with µ and s is replaced with 
 = sigma symbol

Remove outliers using z-score
from scipy import stats
df[‘cgpa_zscore’] = stats.zscore(df.cgpa)
df[(df.cgpa-zscore > -3) & (df.cgpa-zscore < 3)]

18
Q

Advantages of Z-score

A
  • Straightforward
  • Easy to use
  • Useful for normally distributed data
  • Quantifies deviation
19
Q

Disadvantages of Z-score

A
  • Sensitive to non-uniform or skewed data distributions
  • Not effective in datasets with many outliers
  • Assumes data follows a normal distribution, making it less reliable for non-normal distributions
20
Q

What is considered an extreme z-score?

A

A z-score beyond -2 or 2

21
Q

The _ divide data into four equal parts.

A

Quartiles

Q1 is the median of the lower half, Q3 is the median of the upper half.

22
Q

What is a five-number summary?

A

minimum (Min) = smallest data value
Q1= median of the values below m
median (m) = middle data value
Q3 = median of the values above m
maximum (Ma) = largest data value

23
Q

The _ is the value which is greater than P% of the data

A

Pth percentile

We already used z-scores to determine whether an SAT score of 2100 or
an ACT score of 28 is better

 We could also have used percentiles:
 ACT score of 28: 91st percentile
 SAT score of 2100: 97th percentile

24
Q

The _ is Q3 - Q1, representing the middle 50% of the data.

A

Interquartile Range (IQR)

Remove outliers using IQR

q1 = df.placement-exam-marks.quantile(0.25)
q3 = df.placement-exam-marks.quantile(0.75)
iqr = q3 - q1
iqr

upper = q3 + (1.5 * iqr)
lower = q1 - (1.5 * iqr)

df[(df[‘placement-exam-marks’] < upper) & (df[‘placement-exam-marks’] > lower)]

25
Q

Advantages of IQR

A
  • Robust in non-normal distributions
  • Less affected by extreme values or outliers
  • Effective for skewed datasets
26
Q

Disadvantages of IQR

A
  • Will not detect outliers effectively in small datasets
  • May miss outliers in the tails if the distribution is highly skewed
27
Q

Is the IQR resistant to outliers?

A

Yes, the IQR is resistant to outliers.

28
Q

How can outliers be identified using the IQR?

A

Outliers are data points smaller than Q1 − 1.5(IQR) or larger than Q3 +1.5(IQR).

upper = q3 + (1.5 * iqr)
lower = q1 - (1.5 * iqr)

df[(df[‘placement-exam-marks’] < upper) & (df[‘placement-exam-marks’] > lower)]

29
Q

A _ visualizes the five-number summary and shows outliers as points beyond the whiskers.

A

Boxplot

sb.boxplot(df.placement-exam-marks)

Lines (“whiskers”) extend from each quartile to the most extreme value that is not an outlier

30
Q

What is the relationship between range and outliers?

A

The range is not resistant to outliers because it depends entirely on extreme values.

Range = Max – Min
IQR = Q3 – Q1

31
Q

When should you use the mean and standard deviation?

A

Use the mean and standard deviation when you want to incorporate all data points into the analysis, but be cautious of outliers.

32
Q

What percentile is a data value at if it is greater than 75% of the data?

A

The 75th percentile (or Q3)