Exam 2 Lecture 1 Flashcards
Why are protocols necessary?
Once we know the question we want to ask, we have to devise a way to get the answer.
Why is the structure of data important?
Once we have a plan to get the answer, we have to have a plan to manage the data.
We got ourselves some data! Now, how do we know it’s not junk?
Quality checking
Garbage In, Garbage Out
In computer science, garbage in, garbage out (GIGO) is the concept that flawed, or nonsense input data produces nonsense output.
Data are never as easy as they seem… what are the 3 possible biases you can introduce
- Messy data
- Dirty data
- Missing data
You can’t assume anything!
What is bias?
When the data you have don’t actually represent the parameter you are studying, or the sample you have doesn’t actually represent the population you are interested in. Bias is not ‘by chance’, it’s systematic error
Fixable errors that do not make ASSUMPTIONS about what the right answer is + examples
Messy data
Examples:
Human error
- Question was clear, checked wrong box
- “Eight” instead of “8”
Computer error
- Zip code- started with 0, computer recorded as ‘8854’
Equipment error
- Noise in signal that can be removed because source of noise is known (ie interference from power line nearby)
What are messy data?
Fixable errors that do not make assumptions about what the right answer is
What are dirty data?
Unfixable errors. You cannot deduce what the answer should be. Data must be discarded.
Dirty data and examples
These are unfixable errors. You cannot deduce what the answer should be. Data must be discarded.
Examples
Equipment problems
- Noise in signal that cannot be explained
Protocol failure
- Some people smoked cigarettes right before breath test
Question problems
- Poor wording that creates confusion
Is an allergic property a key component of your health?
- Wording that biases the answer
How bad do you think marijuana is for you?
Response problems
- Open-ended answers that are hard to categorize/quantify
Sometimes, once when I was at my mom’s but later more often
- Incomplete response options so people don’t know how to answer
Real A: weekly. Actual options: Every day, once a month, once a year.
Even if you are perfect, if your study involves people (like YOU), there will be ____________
Problems
Unfixable errors are
Dirty data
Missing data
The absence of clean= dirt
- Very few datasets are complete. There are always little issues.
- Sometimes stuff is just missing
A subject skipped a question, a researcher forgot to weigh subject - Sometimes data need to be discarded
Heart rate monitor failed for 1 person - If there are clear reasons and rules for discarding data, OK, but you can’t just discard data because you don’t like it
You want to discard data from 3 adult subjects that shows their heights are 9ft, 12ft, and 25 in (you can google tallest and shortest adults)
You cannot discard data from young woman who says she can deadlift 350lbs because you don’t believe it
Having some missing data is usually _________
Acceptable, because statistics has a few ways to deal with it.
- You must check that there is A PATTERN to what is missing
- Is the missing stuff RANDOM? If not, then it can introduce BIAS into your results
College athletes vs. non athletes less likely to report drug use
- Your parameter estimates (RESULTS) will not be accurate for college athletes
Non-exercisers may be more likely to overestimate their activity level
- Your parameter estimates (RESULTS) will not be accurate/representative for non-exercisers
Hihg instead of High
Dirty, fixable. No doubt what they meant. 100 out of 100 people would say High