large data set Flashcards
what is our large data set
data about individuals who took part in the American National Health and Nutrition Examination Survey (NHANES) in 2003-4.
what is the sample size
randoms sample of 200 people from the 5000 that took the survey
what is the age of the people in the sample / large data set
aged 16 and over
how is this data collected
combination of interview and physical examination to assess the health and nutritional status of adults and children in the United States
is there people in the actual 5000 people survey thats under 16
likely yes - but their data was not collected in the LDS
is it really random then?
What does N/A mean
data was not available for that data point
Why is N/A used and not leaving it blank
prevents some software from reading it as 0.
zero is a recorded value and cam not be used to represent no data being collected
Do we include data points with N/A in calculations - standard deviation / mean
No - we exclude that data point - therefore reduces the value of n
What does tr mean as a recorded value
Tr = trace amounts
Data is recorded but numerical value is so close to 0 it is negligible
Do we include data points with tr in calculations
Yes - we treat it as 0 in calculations
What is cleaning data
Fixing / removing incorrect, corrupted, incorrectly formatted, duplicate or incomplete data
if you have to extrapolate the data at any point…
not reliable
Use your knowledge of the large data set to suggest two reasons why the sample data in the table may not be representative of the population
think about whats in the data set
both males + females
wider age range in population
Should outliers be removed
- Think…
- Is this BMI possible
- Is this pulse rate possible etc
If PMCC is close to one….
can be modelled as a straight line
what are the categories in the LDS
sex
age
marital status
weight
height
BMI
upper leg length
upper arm length
waist circumference
food in last 30 minutes
pulse readings
lowest and highest weight
41.4kg
193.1kg
lowest and highest height
140.9 cm
193.8cm
oldest and youngest
17 and 85
which arm was used for blood pressure measurements
everyone was either right or n/a
highest and lowest pulse - beats in 60 seconds
44 to 128
highest and lowest bmi
16.54 and 62.77
what is systolic blood pressure and diastolic blood pressure
The systolic blood pressure is the pressure at the time when the heart beats.
The diastolic blood pressure is the pressure between heart beats.
how many measurements of blood pressure were taken
up to 4
systolic + diastolic averages - how are they taken
The first reading is ignored when taking the average, unless there is only one reading
0 diastolic pressure
outlier - random error
N/A vs could not obtain for arm
N/A for arm - then all of the corresponding pulse readings were N/A
could not obtain for arm - then all of the corresponding pulse readings were recorded
n/a for arm - did not do that part of the exam at all
could not obtain for arm - did do that part of the exam but unsure which arm
why has a larger sample, more then 200, not been used
to ensure that dealing with missing data and copying into other software is not too time consuming