Exam 2 Lecture 1 Flashcards
Why are protocols necessary?
Once we know the question we want to ask, we have to devise a way to get the answer.
Why is the structure of data important?
Once we have a plan to get the answer, we have to have a plan to manage the data.
We got ourselves some data! Now, how do we know it’s not junk?
Quality checking
Garbage In, Garbage Out
In computer science, garbage in, garbage out (GIGO) is the concept that flawed, or nonsense input data produces nonsense output.
Data are never as easy as they seem… what are the 3 possible biases you can introduce
- Messy data
- Dirty data
- Missing data
You can’t assume anything!
What is bias?
When the data you have don’t actually represent the parameter you are studying, or the sample you have doesn’t actually represent the population you are interested in. Bias is not ‘by chance’, it’s systematic error
Fixable errors that do not make ASSUMPTIONS about what the right answer is + examples
Messy data
Examples:
Human error
- Question was clear, checked wrong box
- “Eight” instead of “8”
Computer error
- Zip code- started with 0, computer recorded as ‘8854’
Equipment error
- Noise in signal that can be removed because source of noise is known (ie interference from power line nearby)
What are messy data?
Fixable errors that do not make assumptions about what the right answer is
What are dirty data?
Unfixable errors. You cannot deduce what the answer should be. Data must be discarded.
Dirty data and examples
These are unfixable errors. You cannot deduce what the answer should be. Data must be discarded.
Examples
Equipment problems
- Noise in signal that cannot be explained
Protocol failure
- Some people smoked cigarettes right before breath test
Question problems
- Poor wording that creates confusion
Is an allergic property a key component of your health?
- Wording that biases the answer
How bad do you think marijuana is for you?
Response problems
- Open-ended answers that are hard to categorize/quantify
Sometimes, once when I was at my mom’s but later more often
- Incomplete response options so people don’t know how to answer
Real A: weekly. Actual options: Every day, once a month, once a year.
Even if you are perfect, if your study involves people (like YOU), there will be ____________
Problems
Unfixable errors are
Dirty data
Missing data
The absence of clean= dirt
- Very few datasets are complete. There are always little issues.
- Sometimes stuff is just missing
A subject skipped a question, a researcher forgot to weigh subject - Sometimes data need to be discarded
Heart rate monitor failed for 1 person - If there are clear reasons and rules for discarding data, OK, but you can’t just discard data because you don’t like it
You want to discard data from 3 adult subjects that shows their heights are 9ft, 12ft, and 25 in (you can google tallest and shortest adults)
You cannot discard data from young woman who says she can deadlift 350lbs because you don’t believe it
Having some missing data is usually _________
Acceptable, because statistics has a few ways to deal with it.
- You must check that there is A PATTERN to what is missing
- Is the missing stuff RANDOM? If not, then it can introduce BIAS into your results
College athletes vs. non athletes less likely to report drug use
- Your parameter estimates (RESULTS) will not be accurate for college athletes
Non-exercisers may be more likely to overestimate their activity level
- Your parameter estimates (RESULTS) will not be accurate/representative for non-exercisers
Hihg instead of High
Dirty, fixable. No doubt what they meant. 100 out of 100 people would say High
133 when the highest number is 100
Dirty, NOT fixable! 100 is the highest number possible so it is clearly wrong. But it is unclear what this was meant to be.
Raw data and what must you do with it
Raw data is direct from the source. Before you make any corrections or changes, even if there are ‘errors’
- Raw data must be checked and cleaned
Transformed data
Sometimes response options didn’t quite work, but you can go back and bin or fix the information so it makes more sense/is more usable for statistical analyses
- Chartreuse, evergreen-> GREEN (details lost, structure gained)
- Green = 1, Blue = 2, Purple = 3 (turns qualitative/unstructured/descriptive data into quantitative data)
- Heart rate in 15 sec? But HR is usually beats per MINUTE, so we multiply by 4 (a constant) -> 15*4=60
All stats require
Interpretation and a little common sense
Population
Everyone
Parameter
What you want to know about a population
Sample
Some of a population
Statistic
What you can compute from a sample to estimate a parameter
How good a statistic is depends on
how good your sample is
Accuracy
How likely your statistic (estimate) actually reflects your parameter
Since you can’t measure everyone, you ________ a _________ by using a _________
Estimate a parameter by using a statistic
What undermines accuracy?
Uncertainty and bias
A data point =
An exact value + noise + error (we are always trying to minimize noise and error)
Noise
Data irregularities that have no pattern. Chalk it up to the reality of life. Unavoidable and unpredictable. Always leaves just a little bit of uncertainty.
Error
Data irregularities that are explainable (but not always avoidable or detectable). There was a problem in what you did or how you did it. Intentional or unintentional, this is a common way that results become biased.
Noise and error=
Uncertainty and bias
Systematic error
Your parameter estimate is off because of an ERROR IN YOUR PROTOCOL. Bias everywhere!
Ex: average US fitness by surveying only people from gyms)
Measurement error
Your parameter estimate is off because of an UNEXPECTED, UNRELATED factor made some data different than the rest. BIAS in some data! Leads to missing data unless you can measure, transform during data cleaning steps
Ex: Today was windy, so running speeds were lower
Sampling error
Your parameter estimate is off because of A PROBLEM WITH YOUR SAMPLE (includes people that aren’t part of population, or doesn’t include people that are a part of it) BIAS before you even begin!
Ex: Average daily step counts, including people with leg injuries
Statistics ___________ error and uncertainty. This helps determine how _________ an estimate is to be correct.
Statistics measures error and uncertainty. This helps determine how likely an estimate is to be correct.
A data point =
An exact value + noise + error