Chapter 3 (Summarizing Distributions) Flashcards

1
Q

Mode

A

most likely value of a variable to occur

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Central tendency

A

values that are central in the distribution of a variable; describes what is typical

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Variation

A

describes how dispersed the data is over the range of possible values and what is atypical

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

When a curve is bell-shaped (normal distribution), where does the mean, median, and mode lie?

A

They are all equal and lie in the middle of the distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Sample (arithmetic) mean or average

A

most common measure of centrality; applies only to data where adding and dividing the values makes sense (nominal); has minimal variance (if replaced with any other number, variance would increase)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Sample mean formula

A

the sum of all values of x from i=1 to n, divided by the number of observations or sample size (n)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Weighted average

A

each observation gets a weight of 1/n, the proportion of the sample that it represents

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Weighted average

A

each observation gets a weight of 1/n, the proportion of the sample that it represents

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Dummy or binary variable

A

a qualitative variable that indicates the presence or absence of an attribute; must be coded as 1=present and 0=absent; also has a mean despite being qualitative

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Mean of a dummy variable

A

the proportion of the sample with the associated attribute

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How do you describe central tendency for qualitative variables?

A

(1) Create a dummy variable for each level of the qualitative variable (2) Summarize the mean of the dummy variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Percentiles

A

a way of describing how extreme a particular observation is (median is not extreme); the s-th percentile is the value of x such that s% of the data lies below it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How do you get the median?

A

(1) Order x from smallest to largest (2) If n is odd, the median is the middle-most value. If n is even, the median is the average of the two middle-most values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Centrality of the median

A

the value that lies between two halves of all possible values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Residual (ei)

A

a measure of variation; the difference between the proposed “typical” value (i.e. the sample mean) and the actual values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Centrality of the sample mean

A

the sample mean is the value which is, on average, as close to the rest of the data as possible, and is subject to leverage by large or small values (i.e. outliers)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How can you deal with outliers?

A

(1) Remove them from the dataset (2) Choose statistics that are robust to outliers like the median instead of the mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

How is the residual a measure of variation?

A

it gives us a measure of how dispersed the value of xi is about the center xbar

19
Q

Bessel’s correction (n-1)

A

a statistical adjustment to make the sample variance and standard deviation more accurate or unbiased estimators of the population variance and standard deviation, particularly for small values of n

20
Q

Interquartile range (IQR)

A

IQR= x75 - x25; outlier robust because it is a percentile based measure like the median

21
Q

Sample standard deviation

A

square root of the sample variance

22
Q

Range

A

R= x100 - x0 = max(xi) - min(xi); not robust to outliers

23
Q

Percentile quintets

A

x0, x25, x50, x75, x100

24
Q

Covariance

A

Similar to variance but with two variables (instead of one) and has no equivalent of standard deviation

25
Q

Correlation or Pearson’s Correlation Coefficient (r)

A

a unitless statistic (just a number good for interpretation) that is always between -1 and 1; the covariance between x and y divided by the standard deviation of x and y

26
Q

Positive correlation

A

when r > 0, higher values of x result in higher values of y, and vice versa

27
Q

Negative correlation

A

when r < 0, higher values of x result in lower values of y, and vice versa

28
Q

No correlation

A

when r = 0, there is no relationship between values of x and y

29
Q

Perfect correlation

A

when r = +/- 1, the values of x and y can be perfectly predicted from one another

30
Q

How are correlation (r) and dependence related?

A

the r value is a numerical measurement of dependence in the data: close to -1 means strong negative dependence, close to +1 means strong positive dependence, close to 0 means a lack of dependence (“independent”)

31
Q

Population parameters

A

statistical objects that are sample analogues of important properties of the population distribution; population counterparts of the sample, as they have the same interpretation as the sample statistics

32
Q

Correspondences between sample statistics and population parameters

A

sample mean and population mean, sample variance and population variance, sample covariance and population covariance, sample correlation and population correlation

33
Q

Sampling distribution

A

shows every possible result a statistic can take in every possible (hypothetical) sample from a population and how often each result occurs; observations are unique samples and variables are statistics

34
Q

Empirical distribution

A

a very good estimate of the distribution in the population, gets closer if the data is representative and n is large

35
Q

Bootstrapping

A

simulating samples using the empirical distribution

36
Q

How do you implement the bootstrap?

A

(1) Randomly draw a new sample of the same size from the existing sample, with replacement (2) Do this hundreds or thousands of times to create a new sample of bootstrap samples (3) Compute the sampling distribution of your statistic from this sample

37
Q

Asymptotic behavior

A

how samples behave when n is large

38
Q

Asymptotic behavior of bootstrap

A

it gives us a sense of what the sampling distribution might look like and is centered around the sample statistics

39
Q

Law of Large Numbers (LLN)

A

if the sample is representative of the population (independent) and as n becomes large, the sample mean is a very close approximation of the population mean

40
Q

Central limit theorem (CLT)

A

given a population with a finite mean µ
and a finite non-zero variance σ2, the sampling distribution approaches a normal distribution with a variance of σ2/N, as N, the sample size, increases

41
Q

Useful properties of the normal or gaussian distribution

A

(1) symmetrical about the mean (2) the mean, median, and mode coincide (3) quantiles are closely related to the standard deviations (empirical rule): 68% of data within 1 sd, 95% within 2 sd, 99.7% within 3 sd

42
Q

Standard normal distribution

A

a normal distribution with a mean of 0 and a standard deviation of 1

43
Q

What is the difference between standard deviation and standard error (CLT)?

A

SD is the variation within a sample while SE is the variation between samples; SD is always bigger