large data set Flashcards
what is our large data set
data about individuals who took part in the American National Health and Nutrition Examination Survey (NHANES) in 2003-4.
what is the sample size
randoms sample of 200 people from the 5000 that took the survey
what is the age of the people in the sample / large data set
aged 16 and over
how is this data collected
combination of interview and physical examination to assess the health and nutritional status of adults and children in the United States
is there people in the actual 5000 people survey thats under 16
likely yes - but their data was not collected in the LDS
is it really random then?
What does N/A mean
data was not available for that data point
Why is N/A used and not leaving it blank
prevents some software from reading it as 0.
zero is a recorded value and cam not be used to represent no data being collected
Do we include data points with N/A in calculations - standard deviation / mean
No - we exclude that data point - therefore reduces the value of n
What does tr mean as a recorded value
Tr = trace amounts
Data is recorded but numerical value is so close to 0 it is negligible
Do we include data points with tr in calculations
Yes - we treat it as 0 in calculations
What is cleaning data
Fixing / removing incorrect, corrupted, incorrectly formatted, duplicate or incomplete data
if you have to extrapolate the data at any point…
not reliable
Use your knowledge of the large data set to suggest two reasons why the sample data in the table may not be representative of the population
think about whats in the data set
both males + females
wider age range in population
Should outliers be removed
- Think…
- Is this BMI possible
- Is this pulse rate possible etc
If PMCC is close to one….
can be modelled as a straight line