Outliers Flashcards by Gabbie Pedrialva

An _ is a data point that differs significantly from other values in a dataset.

Outlier

How well did you know this?

Not at all

Perfectly

A single data point that significantly deviates from the rest of the dataset.

Point Outlier

How well did you know this?

Not at all

Perfectly

This type of outlier is an isolated data point that is far away from the main body of the data.

Global outlier

It is often easy to identify and remove.

How well did you know this?

Not at all

Perfectly

This type of outlier is a data point that is unusual in a specific context but may not be outlier in a different context.

Contextual outlier

It is often more difficult to identify and may require additional information or domain knowledge to determine its significance.

How well did you know this?

Not at all

Perfectly

What causes outliers?

variability in data
measurement errors
novel phenomena

How well did you know this?

Not at all

Perfectly

Why is it important to identify and handle outliers?

Outliers can skew results and affect the performance of machine learning models. By identifying and removing or handling outliers effectively, we can prevent them from biasing the model, reducing its performance, and hindering its interpretability.

How well did you know this?

Not at all

Perfectly

The analysis of outlier data is referred to as _.

Outlier Analysis or Outlier Mining

How well did you know this?

Not at all

Perfectly

_ plays a crucial role in ensuring the quality and accuracy of machine learning models.

Outlier detection

How well did you know this?

Not at all

Perfectly

What are some techniques to detect outliers?

statistical tests

Z-Score
Interquartile Range (IQR)

visualization techniques
1. box plots
2. histograms
machine learning algorithms

How well did you know this?

Not at all

Perfectly

What are some ways to handle outliers?

removing outliers
transforming data
using models that are robust to outliers

How well did you know this?

Not at all

Perfectly

A _ is relatively unaffected by extreme values, such as the median.

resistant statistic

A statistic is resistant if it is relatively unaffected by extreme values.

How well did you know this?

Not at all

Perfectly

Which statistic is resistant, the mean or the median?

The median (middle value) is resistant while the mean (average) is not.

df.cgpa.mean()

Example: World Gross (in millions)

With Harry Potter
Mean = $150,742,300
Median = $76,658,500

Without Harry Potter
Mean = $141,889,900
Median = $75,009,000

How well did you know this?

Not at all

Perfectly

What should you do if an outlier is not a mistake?

Run the analysis twice: once with the outlier and once without, to assess its impact.

How well did you know this?

Not at all

Perfectly

This measures the spread of the data from the mean.

Standard deviation

df.cgpa.std()

Sample standard deviation: s
Population standard deviation:  (“sigma”)

How well did you know this?

Not at all

Perfectly

What does a larger standard deviation indicate?

A larger standard deviation indicates more variability and that the data are more spread out.

How well did you know this?

Not at all

Perfectly

For a bell-shaped distribution, about _ of the data falls within two standard deviations of the mean.

Study These Flashcards

95%

For a population, 95% of the data will be between µ – 2 and µ + 2
 = sigma symbol

A _ indicates how many standard deviations a value is from the mean.

Study These Flashcards

Z-score

Z = (X - mean) / Standard Deviation

For a population, !𝑥 is replaced with µ and s is replaced with 
 = sigma symbol

Remove outliers using z-score
from scipy import stats
df[‘cgpa_zscore’] = stats.zscore(df.cgpa)
df[(df.cgpa-zscore > -3) & (df.cgpa-zscore < 3)]

Advantages of Z-score

Study These Flashcards

Straightforward
Easy to use
Useful for normally distributed data
Quantifies deviation

Disadvantages of Z-score

Study These Flashcards

Sensitive to non-uniform or skewed data distributions
Not effective in datasets with many outliers
Assumes data follows a normal distribution, making it less reliable for non-normal distributions

What is considered an extreme z-score?

Study These Flashcards

A z-score beyond -2 or 2

The _ divide data into four equal parts.

Study These Flashcards

Quartiles

Q1 is the median of the lower half, Q3 is the median of the upper half.

What is a five-number summary?

Study These Flashcards

minimum (Min) = smallest data value
Q1= median of the values below m
median (m) = middle data value
Q3 = median of the values above m
maximum (Ma) = largest data value

The _ is the value which is greater than P% of the data

Study These Flashcards

Pth percentile

We already used z-scores to determine whether an SAT score of 2100 or
an ACT score of 28 is better

 We could also have used percentiles:
 ACT score of 28: 91st percentile
 SAT score of 2100: 97th percentile

The _ is Q3 - Q1, representing the middle 50% of the data.

Study These Flashcards

Interquartile Range (IQR)

Remove outliers using IQR

q1 = df.placement-exam-marks.quantile(0.25)
q3 = df.placement-exam-marks.quantile(0.75)
iqr = q3 - q1
iqr

upper = q3 + (1.5 * iqr)
lower = q1 - (1.5 * iqr)

df[(df[‘placement-exam-marks’] < upper) & (df[‘placement-exam-marks’] > lower)]

Advantages of IQR

* Robust in non-normal distributions * Less affected by extreme values or outliers * Effective for skewed datasets

Disadvantages of IQR

* Will not detect outliers effectively in small datasets * May miss outliers in the tails if the distribution is highly skewed

Is the IQR resistant to outliers?

Yes, the IQR is resistant to outliers.

How can outliers be identified using the IQR?

Outliers are data points smaller than Q1 − 1.5(IQR) or larger than Q3 +1.5(IQR). ## Footnote upper = q3 + (1.5 * iqr) lower = q1 - (1.5 * iqr) df[(df['placement-exam-marks'] < upper) & (df['placement-exam-marks'] > lower)]

A _ visualizes the five-number summary and shows outliers as points beyond the whiskers.

Boxplot | sb.boxplot(df.placement-exam-marks) ## Footnote Lines (“whiskers”) extend from each quartile to the most extreme value that is not an outlier

What is the relationship between range and outliers?

The range is **not resistant** to outliers because it **depends entirely on extreme values**. | Range = Max – Min IQR = Q3 – Q1

When should you use the mean and standard deviation?

Use the mean and standard deviation when you want to incorporate all data points into the analysis, but be cautious of outliers.

What percentile is a data value at if it is greater than 75% of the data?

The 75th percentile (or Q3)

Outliers Flashcards

(32 cards)