The absence of clean= dirt - Very few datasets are complete. There are always little issues. - Sometimes stuff is just missing A subject skipped a question, a researcher forgot to weigh subject - Sometimes data need to be discarded Heart rate monitor failed for 1 person - If there are clear reasons and rules for discarding data, OK, but you can't just discard data because you don't like it You want to discard data from 3 adult subjects that shows their heights are 9ft, 12ft, and 25 in (you can google tallest and shortest adults) You cannot discard data from young woman who says she can deadlift 350lbs because you don't believe it

Exam 2 Lecture 1 Flashcards by Stephanie Muenzen

Why are protocols necessary?

Once we know the question we want to ask, we have to devise a way to get the answer.

How well did you know this?

Not at all

Perfectly

Why is the structure of data important?

Once we have a plan to get the answer, we have to have a plan to manage the data.

How well did you know this?

Not at all

Perfectly

We got ourselves some data! Now, how do we know it’s not junk?

Quality checking

How well did you know this?

Not at all

Perfectly

Garbage In, Garbage Out

In computer science, garbage in, garbage out (GIGO) is the concept that flawed, or nonsense input data produces nonsense output.

How well did you know this?

Not at all

Perfectly

Data are never as easy as they seem… what are the 3 possible biases you can introduce

Messy data
Dirty data
Missing data
You can’t assume anything!

How well did you know this?

Not at all

Perfectly

What is bias?

When the data you have don’t actually represent the parameter you are studying, or the sample you have doesn’t actually represent the population you are interested in. Bias is not ‘by chance’, it’s systematic error

How well did you know this?

Not at all

Perfectly

Fixable errors that do not make ASSUMPTIONS about what the right answer is + examples

Messy data

Examples:
Human error
- Question was clear, checked wrong box
- “Eight” instead of “8”

Computer error
- Zip code- started with 0, computer recorded as ‘8854’

Equipment error
- Noise in signal that can be removed because source of noise is known (ie interference from power line nearby)

How well did you know this?

Not at all

Perfectly

What are messy data?

Fixable errors that do not make assumptions about what the right answer is

How well did you know this?

Not at all

Perfectly

What are dirty data?

Unfixable errors. You cannot deduce what the answer should be. Data must be discarded.

How well did you know this?

Not at all

Perfectly

Dirty data and examples

These are unfixable errors. You cannot deduce what the answer should be. Data must be discarded.

Examples
Equipment problems
- Noise in signal that cannot be explained

Protocol failure
- Some people smoked cigarettes right before breath test

Question problems
- Poor wording that creates confusion
Is an allergic property a key component of your health?
- Wording that biases the answer
How bad do you think marijuana is for you?

Response problems
- Open-ended answers that are hard to categorize/quantify
Sometimes, once when I was at my mom’s but later more often
- Incomplete response options so people don’t know how to answer
Real A: weekly. Actual options: Every day, once a month, once a year.

How well did you know this?

Not at all

Perfectly

Even if you are perfect, if your study involves people (like YOU), there will be ____________

Problems

How well did you know this?

Not at all

Perfectly

Unfixable errors are

Dirty data

How well did you know this?

Not at all

Perfectly

Missing data

The absence of clean= dirt

Very few datasets are complete. There are always little issues.
Sometimes stuff is just missing
A subject skipped a question, a researcher forgot to weigh subject
Sometimes data need to be discarded
Heart rate monitor failed for 1 person
If there are clear reasons and rules for discarding data, OK, but you can’t just discard data because you don’t like it
You want to discard data from 3 adult subjects that shows their heights are 9ft, 12ft, and 25 in (you can google tallest and shortest adults)
You cannot discard data from young woman who says she can deadlift 350lbs because you don’t believe it

How well did you know this?

Not at all

Perfectly

Having some missing data is usually _________

Acceptable, because statistics has a few ways to deal with it.
- You must check that there is A PATTERN to what is missing
- Is the missing stuff RANDOM? If not, then it can introduce BIAS into your results

College athletes vs. non athletes less likely to report drug use
- Your parameter estimates (RESULTS) will not be accurate for college athletes

Non-exercisers may be more likely to overestimate their activity level
- Your parameter estimates (RESULTS) will not be accurate/representative for non-exercisers

How well did you know this?

Not at all

Perfectly

Hihg instead of High

Dirty, fixable. No doubt what they meant. 100 out of 100 people would say High

How well did you know this?

Not at all

Perfectly

133 when the highest number is 100

Study These Flashcards

Dirty, NOT fixable! 100 is the highest number possible so it is clearly wrong. But it is unclear what this was meant to be.

Raw data and what must you do with it

Study These Flashcards

Raw data is direct from the source. Before you make any corrections or changes, even if there are ‘errors’
- Raw data must be checked and cleaned

Transformed data

Study These Flashcards

Sometimes response options didn’t quite work, but you can go back and bin or fix the information so it makes more sense/is more usable for statistical analyses
- Chartreuse, evergreen-> GREEN (details lost, structure gained)
- Green = 1, Blue = 2, Purple = 3 (turns qualitative/unstructured/descriptive data into quantitative data)
- Heart rate in 15 sec? But HR is usually beats per MINUTE, so we multiply by 4 (a constant) -> 15*4=60

All stats require

Study These Flashcards

Interpretation and a little common sense

Population

Study These Flashcards

Everyone

Parameter

Study These Flashcards

What you want to know about a population

Sample

Study These Flashcards

Some of a population

Statistic

Study These Flashcards

What you can compute from a sample to estimate a parameter

How good a statistic is depends on

Study These Flashcards

how good your sample is

Accuracy

How likely your statistic (estimate) actually reflects your parameter

Since you can't measure everyone, you ________ a _________ by using a _________

Estimate a parameter by using a statistic

What undermines accuracy?

Uncertainty and bias

A data point =

An exact value + noise + error (we are always trying to minimize noise and error)

Noise

Data irregularities that have no pattern. Chalk it up to the reality of life. Unavoidable and unpredictable. Always leaves just a little bit of uncertainty.

Error

Data irregularities that are explainable (but not always avoidable or detectable). There was a problem in what you did or how you did it. Intentional or unintentional, this is a common way that results become biased.

Noise and error=

Uncertainty and bias

Systematic error

Your parameter estimate is off because of an ERROR IN YOUR PROTOCOL. Bias everywhere! Ex: average US fitness by surveying only people from gyms)

Measurement error

Your parameter estimate is off because of an UNEXPECTED, UNRELATED factor made some data different than the rest. BIAS in some data! Leads to missing data unless you can measure, transform during data cleaning steps Ex: Today was windy, so running speeds were lower

Sampling error

Your parameter estimate is off because of A PROBLEM WITH YOUR SAMPLE (includes people that aren't part of population, or doesn't include people that are a part of it) BIAS before you even begin! Ex: Average daily step counts, including people with leg injuries

Statistics ___________ error and uncertainty. This helps determine how _________ an estimate is to be correct.

Statistics measures error and uncertainty. This helps determine how likely an estimate is to be correct.

A data point =

An exact value + noise + error

Exam 2 Lecture 1 Flashcards

(37 cards)