Week 2 - data wrangling Flashcards
variables
characteristics that differ among individuals or other sampling units.
categorical variables
Categorical variables: qualitative characteristics of individuals that do not have magnitude on a numerical scale.
Nominal - no order
Ordinal - can be ordered (letter grades)
numerical variables
Numerical data: quantitative measurements that have magnitude on a numerical scale.
Continuous
Discrete
explanatory variable
response variable
Explanatory variable: independent variable
Response variable: dependent variable
frequency distribution
the number of times each variable occurs in a sample.
probability distribution
distribution of a variable in the whole population.
normal distribution
bell-shaped curve.
sample mean
Sample mean: the sum of all observations in a sample divided by n, the number of observations.
standard deviation
Standard deviation: a common measure of the spread of a distribution. It indicates how far the different measurements typically are from the mean.
Calculated from variance
⅔ of data within 1 sd
95% within 2 sd
Can be calculated from a frequency table
coefficient of variation
Coefficient of variation: the standard deviation expressed as a percentage of the mean.
Higher CV means more variability and lower means more consistency relative to the mean
median
the middle measurement of a set of observations
interquartile range
the difference between the third and first quartiles of data. It is the middle 50% of the data.
box plot
Displays median, interquartile range, first and third quartiles, median, smallest and largest non-extreme values (not more than 1.5x IQR from box edge
how measures of location and spread compare
Mean is more sensitive to extremes than median (middle)
Standard deviation is more sensitive to extremes than mean. IQR is not and is a better measurement when there are extremes
when does data become information?
when processed
descriptions of data
Location (central tendency)
Width (spread)
distributions
If median, mean and mode values are similar, their values will be close to normal distribution
If median, mean and mode values are not similar, data will be skewed
Why is sample standard deviation formula divided by n-1 not n?
Corrects bias which occurs during the estimation process (aside from this we don’t need to know all the details about why this happens). Lose a degree of freedom
Why do we use squared deviation in order to calculate variance (Yi - Y bar)^2?
Gets rid of negatives
Weights larger deviations more heavily
estimation
Estimation: is the process of inferring a population parameter from sample data. The value of an estimate calculated from the data is almost never the exact same as the true value (the parameter).
Use standard error formula to see what variability we should expect and if we can trust our estimates
By repeatedly sampling tables we eventually end up with the probability distribution of all the values of an estimate we might have obtained if we sampled the population this way
standard error
an estimate of the standard deviation of the sampling distribution.
Predicts the sampling error of the estimate. It can indicate uncertainty and precision
In most cases we do not have the real sampling distribution so use standard error of the mean
sampling distribution
Sampling distribution: theoretical distribution of an estimate.
The larger the sample size the more narrow and precise the sampling distribution
standard error of the mean
Standard error of the mean: gives us understanding of the likely difference between our sample mean and the true population mean.
standard deviation and variance
Variance is larger than standard deviation
Standard deviation has same units as data values whereas variance does not
summary
Estimation is the process of inferring a population parameter from sample data.
All estimates have a sampling distribution, which is the probability distribution of all possible values of the estimate that might be obtained under random sampling with a given sample size.
The standard error of an estimate is the standard deviation of its sampling distribution. It measures precision. The smaller the SE, the more precise the estimate. We can approximate this estimate, fortunately, from a single sample.
As an approximation, 95% of the time the true parameter value will lie within two SE’s plus or minus from the sample estimate.
Standard errors (and standard error bars on graph) indicate uncertainty about the sample estimates (mean, sd, median etc), NOT variability in the raw data.