Week 2 - data wrangling Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

variables

A

characteristics that differ among individuals or other sampling units.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

categorical variables

A

Categorical variables: qualitative characteristics of individuals that do not have magnitude on a numerical scale.
Nominal - no order
Ordinal - can be ordered (letter grades)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

numerical variables

A

Numerical data: quantitative measurements that have magnitude on a numerical scale.
Continuous
Discrete

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

explanatory variable
response variable

A

Explanatory variable: independent variable

Response variable: dependent variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

frequency distribution

A

the number of times each variable occurs in a sample.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

probability distribution

A

distribution of a variable in the whole population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

normal distribution

A

bell-shaped curve.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

sample mean

A

Sample mean: the sum of all observations in a sample divided by n, the number of observations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

standard deviation

A

Standard deviation: a common measure of the spread of a distribution. It indicates how far the different measurements typically are from the mean.
Calculated from variance
⅔ of data within 1 sd
95% within 2 sd
Can be calculated from a frequency table

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

coefficient of variation

A

Coefficient of variation: the standard deviation expressed as a percentage of the mean.
Higher CV means more variability and lower means more consistency relative to the mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

median

A

the middle measurement of a set of observations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

interquartile range

A

the difference between the third and first quartiles of data. It is the middle 50% of the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

box plot

A

Displays median, interquartile range, first and third quartiles, median, smallest and largest non-extreme values (not more than 1.5x IQR from box edge

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

how measures of location and spread compare

A

Mean is more sensitive to extremes than median (middle)
Standard deviation is more sensitive to extremes than mean. IQR is not and is a better measurement when there are extremes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

when does data become information?

A

when processed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

descriptions of data

A

Location (central tendency)
Width (spread)

17
Q

distributions

A

If median, mean and mode values are similar, their values will be close to normal distribution
If median, mean and mode values are not similar, data will be skewed

18
Q

Why is sample standard deviation formula divided by n-1 not n?

A

Corrects bias which occurs during the estimation process (aside from this we don’t need to know all the details about why this happens). Lose a degree of freedom

19
Q

Why do we use squared deviation in order to calculate variance (Yi - Y bar)^2?

A

Gets rid of negatives
Weights larger deviations more heavily

20
Q

estimation

A

Estimation: is the process of inferring a population parameter from sample data. The value of an estimate calculated from the data is almost never the exact same as the true value (the parameter).
Use standard error formula to see what variability we should expect and if we can trust our estimates
By repeatedly sampling tables we eventually end up with the probability distribution of all the values of an estimate we might have obtained if we sampled the population this way

21
Q

standard error

A

an estimate of the standard deviation of the sampling distribution.
Predicts the sampling error of the estimate. It can indicate uncertainty and precision
In most cases we do not have the real sampling distribution so use standard error of the mean

22
Q

sampling distribution

A

Sampling distribution: theoretical distribution of an estimate.
The larger the sample size the more narrow and precise the sampling distribution

23
Q

standard error of the mean

A

Standard error of the mean: gives us understanding of the likely difference between our sample mean and the true population mean.

24
Q

standard deviation and variance

A

Variance is larger than standard deviation
Standard deviation has same units as data values whereas variance does not

25
Q

summary

A

Estimation is the process of inferring a population parameter from sample data.
All estimates have a sampling distribution, which is the probability distribution of all possible values of the estimate that might be obtained under random sampling with a given sample size.
The standard error of an estimate is the standard deviation of its sampling distribution. It measures precision. The smaller the SE, the more precise the estimate. We can approximate this estimate, fortunately, from a single sample.
As an approximation, 95% of the time the true parameter value will lie within two SE’s plus or minus from the sample estimate.
Standard errors (and standard error bars on graph) indicate uncertainty about the sample estimates (mean, sd, median etc), NOT variability in the raw data.