Presentation of Data Flashcards
what are the purposes of screening data
to detect blunders
locate outliers
determine distributional properties
determine number of missing values
how is a small data set screened?
by eye
how is a large data set screened
frequency table or histogram
what are the 2 main types of variables
categorical and quantitative
what are categorical variables
occur when individual falls into a category
divided into nominal (no ordering eg sex)
ordinal (have an ordering eg pain)
what is the frequency distribution
frequency of the occurrence of different values of a variable
what is relative frequency
frequency expressed as a proportion of the total frequency
how are interval scale variables graphically presented
histograms or box plots
what is a box plot
5 point summary of the data consisting of the minimum, 1st quartile, median, 3rd quartile and maximum valueas
what do summary statistics attempt to capture
a typical value (the location) or the spread (or dispersion)
what 2 measurements are used for location
mean and median
what is mean
sum of all the observations divided by the total number of observations
what is the median
middle value if a sample is arranged in increasing order. approx 50% of the sample is less than the median and 50% is greater than the median
what summary statistics measure the spread
range, interquartile range, variance, standard deviation, coefficient of variation
what is the range
difference between the largest and smallest observations in the sample - not recommended as it severely affected by outlying observations
what is the interquartile range
difference between the 3rd and 1st quartiles
what is variance (s2)
sum of the squared distance of each value from the mean, divided by the number of values - 1
what is the standard deviation
the square root of the variance. used in preference to variance as is in the original scale of measurement
what is the coefficient of variation defined by
c = s/x x100%
provides a measure of variation which is independent of the unit of measurement and hence can be used to compare the variation of variables measured on different scales
what summary statistics are used if the distribution is roughly symmetrical
mean and std deviation
what summary stats are used if the distribution is skewed
median and interquartile range