Exploratory Data Anlaysis Flashcards by Dawn Gunter

What does EDA stand for

exploratory data analyis

How well did you know this?

Not at all

Perfectly

What is the purpose of EDA?

convert the available data from their raw form to an informative one in which main features of the data are illuminated

How well did you know this?

Not at all

Perfectly

What are the three things we should always do when performing EDA?

use visual displays plus numerical summaries
describe the overall pattern and mention any striking deviations from that pattern
interpret the results we got in context

How well did you know this?

Not at all

Perfectly

What are the two catagories of variable when examining the distribution of a single variable?

catagorical and quantitative

How well did you know this?

Not at all

Perfectly

What are the four types of methods used to summarize distribution of a categorical variable

pie chart
bar chart
pictogram (can be misleading)
category (group) percentages

How well did you know this?

Not at all

Perfectly

What are the three types of methods used to summarize distribution of a quantitative variable

histogram
stemplot
descriptive statistics

How well did you know this?

Not at all

Perfectly

When describing the distribution displayed by a histogram or stemplot, what are the four factors that should be described?

Overall pattern:

shape
center
spread

Deviations from the pattern
4. outliers

How well did you know this?

Not at all

Perfectly

What do descriptive statistics generally cover?

measure of center plus measures of spread

How well did you know this?

Not at all

Perfectly

What descriptive statistics in the numerical summary of a quantitative summary should be included when the distribution is symmetric with no outliers?

mean

2. standard deviation

How well did you know this?

Not at all

Perfectly

What descriptive statistics should be included for the summary of a quantitative summary when the distribution is skewed

five number summary w/median and IQR

How well did you know this?

Not at all

Perfectly

What makes up the five number summary?

Min (minimum value)
Q1 (quartile 1)
M (median)
Q3
Max (max value)

How well did you know this?

Not at all

Perfectly

What rule is used for identifying outliers?

IQR: Intraquadrant range 1.5 criterion

How well did you know this?

Not at all

Perfectly

What are the two measurements for 1.5IQR Criterion for outliers?

below Q1 - 1.5(IQR)

above Q3 + 1.5(IQR)

How well did you know this?

Not at all

Perfectly

How do you find 1.5(IQR)

Q3 - Q1 = IQR
Q1 - 1.5(IQR)
Q1 + 1.5(IQR)

How well did you know this?

Not at all

Perfectly

What are three factors to be considered with whether or not to include outliers in your data?

Even if it is an extreme value, if it was produced by the same physical/biological process as rest of the data, and is expected to eventually occur again, then it should be included in the data
if outlier was produced under fundamentally different conditions/process from rest of data, outlier can be removed from data if goal is to investigate oly process that produced the rest of the data
may indicate a mistake in data (like typo or measuring error), and should be corrected if possible or removed from data

How well did you know this?

Not at all

Perfectly

Which relationship uses boxplots for examination?

Study These Flashcards

C > Q

In which distribution shape does the Standard Deviation Rule apply?

Study These Flashcards

normal distribution

What is the Standard Deviation Rule?

Study These Flashcards

tells us what percentage of observations fall within 1, 2 or 3 deviations away from the mean

What are the percentage ranges under the Standard Deviation Rule?

Study These Flashcards

99.7% = 3rd deviation
95% = 2nd deviation
68% = 1st deviation

what is the symbol for mean

Study These Flashcards

x with a line over it

What is the symbol for median

Study These Flashcards

How is standard deviation calculated? (5 steps)

Study These Flashcards

find the mean of the data
find the deviations from the mean (subtract the mean from each observation)
square each of the deviations
find the variance of the data: average each of the devations by adding them up and dividing by (n-1)…. (n = the sample size)
find the SD by finding the square root of the variance

What type(s) of graphical display and/or numerical summaries are used for C > Q? (2)

Study These Flashcards

boxplots

2. numerical summaries w/conditional percentages

What types of graphical display and/or numerical summaries are used for C > C? (2)

Study These Flashcards

two-way table

2. conditional percentages

What types of graphical display and/or numerical summaries are used for Q > C? (2)

1. scatterplot | 2. numerical summary (only if scatterplot displays a linear relationship

When describing the relationship displayed by the scatterplot, what 3 items should be considered?

1. overall pattern (direction, form, strength) 2. deviations from the pattern (outliers) 3. labeling scatterplot may add insight into relationship

What is the correlation coefficient?

(r) measures the direction and strength of the linear relationship

Will a strong correlation coefficient prove a linear relationship without reviewing a scatterplot?

no, it must be combined with a scatterplot to prove a linear relationship, not just a relationship

What is the least squares regression line?

the line that has the smallest sum of squared vertical deviations

What is the intercept of a line?

the value that Y takes when x = 0

what is the symbol for intercept

what is the slope of a line?

the change in y for every 1 unit increase of x

what is the symbol for slope

What is the equation for least squares regression of a line?

Y = a + bX ``` Y = y axis point a = intercept b = slope X = x axis point ```

what is the equation for finding the slope (b) of a line

b = r (Sy / Sx) ``` r = correlation coefficient Sx = standard deviation of explanatory variable's values Sy = standard deviation of response variables values ```

What is the formula for line intercept (a)

Yw/lineoverit - bXw/lineoverit

Why does an observed relationship not imply causation?

due to possible of lurking variables

What is Simpson's paradox?

when including a lurking variable in the analysis leads us to rethink the direction of the relationship

Exploratory Data Anlaysis Flashcards

(38 cards)