Week 2 - data wrangling Flashcards by Megan Anderson

variables

characteristics that differ among individuals or other sampling units.

How well did you know this?

Not at all

Perfectly

categorical variables

Categorical variables: qualitative characteristics of individuals that do not have magnitude on a numerical scale.
Nominal - no order
Ordinal - can be ordered (letter grades)

How well did you know this?

Not at all

Perfectly

numerical variables

Numerical data: quantitative measurements that have magnitude on a numerical scale.
Continuous
Discrete

How well did you know this?

Not at all

Perfectly

explanatory variable
response variable

Explanatory variable: independent variable

Response variable: dependent variable

How well did you know this?

Not at all

Perfectly

frequency distribution

the number of times each variable occurs in a sample.

How well did you know this?

Not at all

Perfectly

probability distribution

distribution of a variable in the whole population.

How well did you know this?

Not at all

Perfectly

normal distribution

bell-shaped curve.

How well did you know this?

Not at all

Perfectly

sample mean

Sample mean: the sum of all observations in a sample divided by n, the number of observations.

How well did you know this?

Not at all

Perfectly

standard deviation

Standard deviation: a common measure of the spread of a distribution. It indicates how far the different measurements typically are from the mean.
Calculated from variance
⅔ of data within 1 sd
95% within 2 sd
Can be calculated from a frequency table

How well did you know this?

Not at all

Perfectly

coefficient of variation

Coefficient of variation: the standard deviation expressed as a percentage of the mean.
Higher CV means more variability and lower means more consistency relative to the mean

How well did you know this?

Not at all

Perfectly

median

the middle measurement of a set of observations

How well did you know this?

Not at all

Perfectly

interquartile range

the difference between the third and first quartiles of data. It is the middle 50% of the data.

How well did you know this?

Not at all

Perfectly

box plot

Displays median, interquartile range, first and third quartiles, median, smallest and largest non-extreme values (not more than 1.5x IQR from box edge

How well did you know this?

Not at all

Perfectly

how measures of location and spread compare

Mean is more sensitive to extremes than median (middle)
Standard deviation is more sensitive to extremes than mean. IQR is not and is a better measurement when there are extremes

How well did you know this?

Not at all

Perfectly

when does data become information?

when processed

How well did you know this?

Not at all

Perfectly

descriptions of data

Study These Flashcards

Location (central tendency)
Width (spread)

distributions

Study These Flashcards

If median, mean and mode values are similar, their values will be close to normal distribution
If median, mean and mode values are not similar, data will be skewed

Why is sample standard deviation formula divided by n-1 not n?

Study These Flashcards

Corrects bias which occurs during the estimation process (aside from this we don’t need to know all the details about why this happens). Lose a degree of freedom

Why do we use squared deviation in order to calculate variance (Yi - Y bar)^2?

Study These Flashcards

Gets rid of negatives
Weights larger deviations more heavily

estimation

Study These Flashcards

Estimation: is the process of inferring a population parameter from sample data. The value of an estimate calculated from the data is almost never the exact same as the true value (the parameter).
Use standard error formula to see what variability we should expect and if we can trust our estimates
By repeatedly sampling tables we eventually end up with the probability distribution of all the values of an estimate we might have obtained if we sampled the population this way

standard error

Study These Flashcards

an estimate of the standard deviation of the sampling distribution.
Predicts the sampling error of the estimate. It can indicate uncertainty and precision
In most cases we do not have the real sampling distribution so use standard error of the mean

sampling distribution

Study These Flashcards

Sampling distribution: theoretical distribution of an estimate.
The larger the sample size the more narrow and precise the sampling distribution

standard error of the mean

Study These Flashcards

Standard error of the mean: gives us understanding of the likely difference between our sample mean and the true population mean.

standard deviation and variance

Study These Flashcards

Variance is larger than standard deviation
Standard deviation has same units as data values whereas variance does not

summary

Estimation is the process of inferring a population parameter from sample data. All estimates have a sampling distribution, which is the probability distribution of all possible values of the estimate that might be obtained under random sampling with a given sample size. The standard error of an estimate is the standard deviation of its sampling distribution. It measures precision. The smaller the SE, the more precise the estimate. We can approximate this estimate, fortunately, from a single sample. As an approximation, 95% of the time the true parameter value will lie within two SE’s plus or minus from the sample estimate. Standard errors (and standard error bars on graph) indicate uncertainty about the sample estimates (mean, sd, median etc), NOT variability in the raw data.

Week 2 - data wrangling Flashcards

(25 cards)