Exam 3 Lecture 4 Flashcards
Weird number? Data literacy = asking: Did they have a plan?
With all data, the goal is always to use as much as possible!
- Some people think- if a data point looks weird, chuck it!
BUT ITS NOT THAT SIMPLE!
- Yes, errors add BIAS
- But chucking data can also introduce BIAS
Like in ‘real life’ -> BIAS is bad!
To avoid BIAS, you need ______________ about what you are chucking, and why.
A clear plan about what you are chucking, and why.
Errors vs. outliers
Errors = not ‘real’ data -> always discard
Outlier = ‘real’ data -> decide/ discard or transform
Statisticians have _______ about discarding a data point. They also have multiple ways to ________ a data point to make it more ‘manageable’ but still ‘different’.
Statisticians have rules about discarding a data point. They also have multiple ways to transform a data point to make it more ‘manageable’ but still ‘different’.
A good statistician…
Tries to use all available data whenever possible.
Just because it’s weird/unexpected…
Rare numbers (at tail end of a range) are rare, so you can’t just say- nah, I doubt it!
Remember: your data are from a SAMPLE that represents a POPULATION. Sometimes a sample represents every possible thing, sometimes it doesn’t.
There is no single rule about what an outlier is.
- All data ‘behave’ differently but you can look at the DISTRIBUTION of your data (that is, look at the list of numbers)
Even if a data point is real and correct, it might be so extreme as to add BIAS to your results- affecting both the mean and the variance- just like as if it was an error. When a data point adds bias, it is called an influential outlier.
When a data point adds bias, it is called an
Influential outlier
Aren’t you important! The power of outliers example
When a data point is so extreme that it throws off the central tendency, the variance or the distribution, it is called an influential outlier.
Data: The number of students who fail Art History per semester
1+2+3+2+2= 10/5= 2 (average), 0.7 (std dev)
1+2+3+2+22= 30/5= 6 (average), 9.0 (std dev)
Wow! One bad semester and look at the difference!
We need to determine if the 22 is real data. Go back to the source.
IF…
The class size was 20 until the course went online. In the 5th semester, 200 students registered.
Possibly real. Keep. NEED to TRANSFORM data! (% failing) to account for different class sizes.
IF…
The class size was CAPPED at 20 each semester.
Not real. Can’t fail more than took the class! Discard. You lose a data point.
1+2+3+2=8/4=2 (average), 0.8 (std dev)
Managing outliers
You get to make a rule about how extreme is too extreme. Some examples:
RULE: discard if >3 std dev from next closest point
RULE: discard if above a specific # (resting HR > 120; drinks per day > 20)
As long as:
- you believe the value is possible (not error)
- other people will agree that this is reasonable
- you have a good rationale AND
- you apply this rule to all data
It can still be good/reliable data.
If we don’t look at outliers/errors…
We might mislead people with data.
Managing outliers through transformations
You can ‘fix’ the distribution of your data (make it more normal!)
A transformation is an equation that is applied to ALL data points to make the overall data distribution more normal.
A common transformation is a log transformation. By taking the log of all values, you can ‘pull in the tails’, especially a right tail. But, since you can’t take a log of 0, people often first add a constant (x+1)
Takes the data from kurtotic to normal.
An equation that is applied to all data points to make the overall data distribution more normal
A transformation
What is a log transformation and what does it do?
A log transformation is a type of transformation- an equation that is applied to all data points to make the overall data distribution more normal. By taking the log of all values, you can ‘pull in the tails’, especially a right tail.
By log transforming the values, the outlier is still there, but now manageable.
Central tendencies + variability =
A full picture.
Most often research uses mean and standard deviation.
How to report mean and standard deviation
We report it like this:
- 300 +/- 10
- 300 +- 10
M: 300, SD: 10
Mean = 300, Std Dev = 10
Standard error of the mean is also a common variance term.
State examples of central tendency and examples of variability.
Central tendency: Mean, median, mode
Variability: Range, standard deviation, variance
Graph
A graph is a visual representation of a table (or dataset)
When and why is graphical data analysis performed?
Often performed early on to determine whether the data “look good”
- Have enough variance (without variance, there is nothing to predict)
- Are normal
- Have no outliers
Often performed prior to (or following) statistical analyses to visualize patterns and relationships
- Commonly reported in research studies to help readers digest what the statistical tests mean
What is a line graph?
It uses lines to connect individual data points.
Used mainly to show:
a) change over time (always on x-axis)
b) relationship between two variables
The graph CAN include more than 1 line.
The line CAN be straight or curved.
Line graphs are used to look at the relationship between two variables (what happens to the # on the y-axis as the # of the x-axis gets bigger?)
This type of graph is used when both variables (1 on the x- and 1 on the y-axis) are CONTINUOUS (in other words, no groups!)
This type of graph uses continuous data. To statistically ANALYZE it, we’ll need to do a REGRESSION or CORRELATION