Week 2 - data wrangling Flashcards
variables
characteristics that differ among individuals or other sampling units.
categorical variables
Categorical variables: qualitative characteristics of individuals that do not have magnitude on a numerical scale.
Nominal - no order
Ordinal - can be ordered (letter grades)
numerical variables
Numerical data: quantitative measurements that have magnitude on a numerical scale.
Continuous
Discrete
explanatory variable
response variable
Explanatory variable: independent variable
Response variable: dependent variable
frequency distribution
the number of times each variable occurs in a sample.
probability distribution
distribution of a variable in the whole population.
normal distribution
bell-shaped curve.
sample mean
Sample mean: the sum of all observations in a sample divided by n, the number of observations.
standard deviation
Standard deviation: a common measure of the spread of a distribution. It indicates how far the different measurements typically are from the mean.
Calculated from variance
⅔ of data within 1 sd
95% within 2 sd
Can be calculated from a frequency table
coefficient of variation
Coefficient of variation: the standard deviation expressed as a percentage of the mean.
Higher CV means more variability and lower means more consistency relative to the mean
median
the middle measurement of a set of observations
interquartile range
the difference between the third and first quartiles of data. It is the middle 50% of the data.
box plot
Displays median, interquartile range, first and third quartiles, median, smallest and largest non-extreme values (not more than 1.5x IQR from box edge
how measures of location and spread compare
Mean is more sensitive to extremes than median (middle)
Standard deviation is more sensitive to extremes than mean. IQR is not and is a better measurement when there are extremes
when does data become information?
when processed