Exam 3 Lecture 4 Flashcards

1
Q

Weird number? Data literacy = asking: Did they have a plan?

A

With all data, the goal is always to use as much as possible!
- Some people think- if a data point looks weird, chuck it!
BUT ITS NOT THAT SIMPLE!
- Yes, errors add BIAS
- But chucking data can also introduce BIAS
Like in ‘real life’ -> BIAS is bad!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

To avoid BIAS, you need ______________ about what you are chucking, and why.

A

A clear plan about what you are chucking, and why.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Errors vs. outliers

A

Errors = not ‘real’ data -> always discard
Outlier = ‘real’ data -> decide/ discard or transform

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Statisticians have _______ about discarding a data point. They also have multiple ways to ________ a data point to make it more ‘manageable’ but still ‘different’.

A

Statisticians have rules about discarding a data point. They also have multiple ways to transform a data point to make it more ‘manageable’ but still ‘different’.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

A good statistician…

A

Tries to use all available data whenever possible.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Just because it’s weird/unexpected…

A

Rare numbers (at tail end of a range) are rare, so you can’t just say- nah, I doubt it!
Remember: your data are from a SAMPLE that represents a POPULATION. Sometimes a sample represents every possible thing, sometimes it doesn’t.

There is no single rule about what an outlier is.
- All data ‘behave’ differently but you can look at the DISTRIBUTION of your data (that is, look at the list of numbers)

Even if a data point is real and correct, it might be so extreme as to add BIAS to your results- affecting both the mean and the variance- just like as if it was an error. When a data point adds bias, it is called an influential outlier.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

When a data point adds bias, it is called an

A

Influential outlier

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Aren’t you important! The power of outliers example

A

When a data point is so extreme that it throws off the central tendency, the variance or the distribution, it is called an influential outlier.

Data: The number of students who fail Art History per semester
1+2+3+2+2= 10/5= 2 (average), 0.7 (std dev)
1+2+3+2+22= 30/5= 6 (average), 9.0 (std dev)
Wow! One bad semester and look at the difference!

We need to determine if the 22 is real data. Go back to the source.

IF…
The class size was 20 until the course went online. In the 5th semester, 200 students registered.
Possibly real. Keep. NEED to TRANSFORM data! (% failing) to account for different class sizes.

IF…
The class size was CAPPED at 20 each semester.
Not real. Can’t fail more than took the class! Discard. You lose a data point.
1+2+3+2=8/4=2 (average), 0.8 (std dev)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Managing outliers

A

You get to make a rule about how extreme is too extreme. Some examples:
RULE: discard if >3 std dev from next closest point
RULE: discard if above a specific # (resting HR > 120; drinks per day > 20)

As long as:
- you believe the value is possible (not error)
- other people will agree that this is reasonable
- you have a good rationale AND
- you apply this rule to all data

It can still be good/reliable data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

If we don’t look at outliers/errors…

A

We might mislead people with data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Managing outliers through transformations

A

You can ‘fix’ the distribution of your data (make it more normal!)
A transformation is an equation that is applied to ALL data points to make the overall data distribution more normal.
A common transformation is a log transformation. By taking the log of all values, you can ‘pull in the tails’, especially a right tail. But, since you can’t take a log of 0, people often first add a constant (x+1)

Takes the data from kurtotic to normal.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

An equation that is applied to all data points to make the overall data distribution more normal

A

A transformation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a log transformation and what does it do?

A

A log transformation is a type of transformation- an equation that is applied to all data points to make the overall data distribution more normal. By taking the log of all values, you can ‘pull in the tails’, especially a right tail.
By log transforming the values, the outlier is still there, but now manageable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Central tendencies + variability =

A

A full picture.
Most often research uses mean and standard deviation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How to report mean and standard deviation

A

We report it like this:
- 300 +/- 10
- 300 +- 10
M: 300, SD: 10
Mean = 300, Std Dev = 10

Standard error of the mean is also a common variance term.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

State examples of central tendency and examples of variability.

A

Central tendency: Mean, median, mode
Variability: Range, standard deviation, variance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Graph

A

A graph is a visual representation of a table (or dataset)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

When and why is graphical data analysis performed?

A

Often performed early on to determine whether the data “look good”
- Have enough variance (without variance, there is nothing to predict)
- Are normal
- Have no outliers

Often performed prior to (or following) statistical analyses to visualize patterns and relationships
- Commonly reported in research studies to help readers digest what the statistical tests mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is a line graph?

A

It uses lines to connect individual data points.
Used mainly to show:
a) change over time (always on x-axis)
b) relationship between two variables

The graph CAN include more than 1 line.
The line CAN be straight or curved.

Line graphs are used to look at the relationship between two variables (what happens to the # on the y-axis as the # of the x-axis gets bigger?)

This type of graph is used when both variables (1 on the x- and 1 on the y-axis) are CONTINUOUS (in other words, no groups!)

This type of graph uses continuous data. To statistically ANALYZE it, we’ll need to do a REGRESSION or CORRELATION

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Line graphs use ____________ data.

A

Continuous

21
Q

To statistically analyze a line graph, you need to do a _____________ or a _______________

A

Regression or correlation

22
Q

Line graphs tell a story
Without knowing anything other than the SLOPE of a line, we can say

A
  • All datapoints belong to a SINGLE GROUP or CONDITION (there is only 1 line)
  • Group average ‘STARTING VALUE’ is 150 (y-intercept, the value of y when x=0)
  • There is a LINEAR relationship between x and y (it’s a straight line)
  • There is a POSITIVE relationship between x and y (as x gets bigger, so does y)
23
Q

Line graphs tell a story:
How many GROUPS/CONDITIONS?
What is ‘STARTING VALUE’?
Is this LINEAR?
Is it a POSITIVE or NEGATIVE (or null) relationship between x and y?

A

1 Group/condition if there is 1 line
2 Groups/conditions if there are 2 lines
Starting value is the number the line starts at
Yes linear, it is a line
DON’T MISTAKE THIS THOUGH. It is always a line graph, but not always linear. If the line is curved, it is not linear.
Positive if going up, negative if going down
Null if straight line

24
Q

Line graphs tell a story. These groups change the same but start at different places.
Example: 8th versus 12th males (weight), eating same diet.

A

Our result: 8th graders weighed less than 12th graders at the START of the study, but gained weight at the same RATE across the study period.

25
Q

Line graphs tell a story. These groups start at the same place but change differently.

A

Example: All participants weighed the same (160lbs) at the beginning of the study. When they were split into a group that did or did not get put on a calorie-restriction diet, their weights diverged (changed differently) over the study period.

26
Q

Protocol: 4 participants added 150 cal/day to diet; no other changes- this is what happened to average weight. What would a line graph tell you?

A

Tells you how many days to gain 5 lbs.
Also you can see:
- Weight kept increasing for every day that the extra calories were consumed.
- The rate of increase in weight was consistent across all days.

27
Q

What is missing in just line graphs? What doesn’t a line tell us?

A

Individual variability!
A line doesn’t tell us what every individual in a group is doing. It is just telling us about the average of a group.

We are missing error bars.

28
Q

What is a way to see the individual in a line graph?

A

ERROR bars.
They provide some information on INDIVIDUAL VARIABILITY.
These are the standard deviation around the mean. You can read these too!

29
Q

What is an error bar?

A

Shows the standard deviation around the mean. Provides information on individual variability.

30
Q

What is a scatterplot?

A

Scatterplots are used to look at the relationship between 2 variables (like line graphs).
(what happens to the # on the y-axis as the # of the x-axis gets bigger?)
But it also provides information about every individual in a group!
This type of graph is used when both variables (1 on the x- and 1 on the y-axis) are CONTINOUS (in other words, no groups!)
This type of graph uses continuous data. To statistically ANALYZE it, we’ll need to do a REGRESSION or CORRELATION

31
Q

How do you statistically analyze a scatterplot?

A

With a REGRESSION or CORRELATION

32
Q

Benefits of scatterplots

A

They show ACTUAL individual variability!
From the sleep study example, we can now see that people vary in how much they change.

33
Q

Related to scatterplots, a spaghetti plot.

A

Same data as scatterplot, uses lines rather than dots to show all data (not just average)
- Good for when time is on x-axis
- Still shows variability, but now you see data by person.
Conclusions:
- People are different from one another.
- Change varies over time (it is not a straight line).

34
Q

What does it mean by continuous? What does this apply to?

A

A graph is continuous when there are no groups (we are looking at the relationship between two variables). This would be a line graph, scatterplot, or spaghetti plot.

35
Q

Sometimes it’s not a relationship, it’s a competition.

A

Graphs can help you compare one group to another
- The x-axis isn’t a continuum, it is either
- Bins of numbers, called intervals (histogram)
- Categories/groups (bar chart)

36
Q

When comparing one group to another, this is a __________ variable (Categorical or continuous)

A

Categorical

37
Q

When comparing one group to another, what type of graphs would we use and which would we not use? Why?

A

When comparing one group to another, we would use a histogram or a bar chart because comparison is categorical. We could not use a line graph, scatterplot or spaghetti spot because those are for continuous variables.

38
Q

What is a bar graph?

A

Bar graphs are used to look at differences between groups or times. The CATEGORICAL (group or binned variable) is on the x-axis and the CONTINUOUS variable is on the y-axis.

This type of graph uses a combination of categorical and continuous data. To statistically analyze it, we’ll need to do a T-TEST or ANOVA.

39
Q

How do you statistically analyze a bar graph?

A

With a T-TEST or ANOVA

40
Q

Where is the categorical and where is the continuous variable on a bar graph?

A

For bar graphs, they have both categorical and continuous data. The categorical (group or binned variable ex. NJ, PA) is on the x-axis and the continuous variable (numbers) is on the y-axis.

41
Q

Graphs that compare groups

A

Bar Graph
This bar graph compares average sick days/year among workers in different states.
- The x-axis is the groups.
- The y-axis shows sick days data.
- Error bars = variability from year to year.

42
Q

What is a histogram?

A

A histogram is a special type of BAR graph.
The y-axis is always FREQUENCY (the # of participants in the bin) and the data are always from 1 variable (nothing being compared).
It is a graph used for descriptive statistics (to describe data)
This is a different type of graph! What’s the major difference?
a) Bars are created by BINNING values on the x-axis
b) We are not looking at a RELATIONSHIP between x and y (no line), we are COMPARING the height of the bars.

43
Q

This is a histogram. Example

A

This histogram partitions grades into 5-point bins. It tells us that most students received a B (80-85) on the exam (10 out of 23). 9 more earned a B+.
There are no gaps between the bars because actually the data on the x-axis are CONTINUOUS, but BINNED into GROUPS
*Sometimes continuous data isn’t necessary. I might not care who got an 88 versus an 89; I just care how many B+s there were.

44
Q

Graphs that bin & measure how common something is

A

A histogram.

45
Q

Bar graphs are more common, but this one gives more detail!

A

A BOX & WHISKER (a fancy bar graph)

46
Q

What is a box and whisker graph?

A

A fancy bar graph. Still a combo of categorical and continuous data. To statistically analyze it, we’ll need to do a T-TEST or ANOVA.

47
Q

Quartiles

A

Order data, cut dataset into 4 parts. Sometimes the quarter spans more #s (tails)

48
Q

How does a box & whisker show outliers?

A

An asterisk is used to denote that a value is outside the expected range based on other values, and is likely to BIAS the results. It is an influential outlier.

A circle is used to denote that a value is outside the expected range based on other values but should be OK to keep.