S2016Q5 Flashcards

1
Q

What is designing in the language of statistics?

A

Setting up a hypothesis or question and deciding how to collect data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is describing in the language of statistics? (descriptive statistics)

A

Summarizing data with numbers and graph.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is inferences in the language of statistics? (inferential statistics)

A

Decisions and predictions based on the data.

  • Estimation
  • Test
  • Confidence intervals
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

The typical statistical model assumes what? (model = assumptions)

A
  • Independence of observations
  • The same underlying distribution for all observations
  • Some sort of systematic structure

(but this is not always the case)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the significance level? (rule of thumb)

A

5 %.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is random sampling and why is it important?

A

Making sure that each subject in the population has the same chance of being in the sample so that we make sure that the sample is a good reflection of the population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does inferential statistics refer to?

A

Methods of making decisions or predictions about a population, based on data obtained from a sample of that population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the difference between a parameter and a statistic?

A

Parameter: a numerical summary of the population

Statistic: a numerical summary of the sample taken from the population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

When is a variable categorical and when is it quantitative?

A
  • Categorical: if each observation belongs to one of a set of categories such as “Yes” and “No”
    • Ordered: “ordinal” (fx. exam grades)
    • Unordered “nominal”: male/female, type of business, zip codes
  • Quantitative: if observations take numerical values that represent different magnitudes of the variable (fx. age or annual income but NOT area code numbers).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is unordered (nominal) data and what type?

A

Categorical: e.g.: Male/female, type of business, ZIP code etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is ordered (ordinal) data and what type?

A

Categorical: e.g.: Grades, likert scales etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is a good graph?

A

Check colors: www.colorbrewer2.org

Remember to: use different lines, colors, different plotting symbols.

Remember it might be printed black/white

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a discrete variable and what type?

A

Numerical: Value in subset of natural numbers (typically integers)

E.g.: 0,1,2,3… (number of employees, number of companies etc.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a continuous variable?

A

Numerical: May take any value in an interval

E.g.: income, sales etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

When is a variable discrete and when is it continuous?

A
  • Discrete: it has separate possible values such as the integers 0, 1, 2, …. for a variable expressed as “the number of…”. (number of companies in a region/employees in a company etc.
  • Continuous: all possible values in an interval
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the median?

A

The middle observation

E.g.: 1,1,1,2,2,2,3,3,4,5,6,7,7,8,8

Median = 3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

When is it called modal category and when it is called mode?

A

Modal category and mode both refer to being the most frequent answer in a data set.

Modal category ⇒ the category with the highest frequency

Mode ⇒ the numerical value (quantitative) that occurs most frequently

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are the primary graphical display for summarizing a categorical variable?

A
  • Pie chart
  • Bar graph: the bar graph is usually preferred as it is easier to distinguish between two categories of approximately the same size
    • When ordering by frequency as here, it is called a Pareto Chart (Vilfredo Pareto)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What are the primary graphical display for summarizing quantitative variables?

A
  • Dot-plot: A dot plot shows a dot for each observation, placed just above the value on the number line for that observation (see picture). Can be useful for small data sets (<50 observations)
  • Stem-and-leaf plot: Can be useful for small data sets (<50 observations)
  • Histogram: The word is used for a graph with bars representing quantitative variables whereas bar graph is used for graphs with a categorical variable.
    • Gives more flexibility in defining intervals and is better for big data sets (+50 observations)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is the “mode” in a frequency table or histogram?

A

The highest point.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What does unimodal and bimodal refer to?

A

Whether the histogram or frequency table has a single mound or two distinct mounds.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What does symmetric and skewed shape refer to?

A
  • Skewed to the left if the left is longer than the right
    • The mean is smaller than the median
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is the “mean” of a distribution of a quantitative variable?

A

The sum of the observations divided by the number of observations.

(The average / The balance point of the distribution)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is the median?

A

The median is the middle value of the observations when the observations are ordered from smallest to the largest.

(in case you have 20 observations, you will take observation (10+11)/2 as your median)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is an outlier?

A

An observation that falls well above or well below the mean.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What does the “range” refer to?

A

Difference between largest and smallest observation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is the formula of a standard deviation?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What does a large “s” mean when working with standard deviations?

A

The large the standard deviation, S, the greater the variability of the data.

/S is a typical distance of observations from the mean (the average)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What is the empirical rule for BELL-SHAPED data distributions? (within 1 standard deviation + within 2) fun-fact

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What are quartiles and how do they relate to the median?

A

Median = Quartile 2 (50th percentile)

Q1 = 25th percentile

Q2 = 50th percentile

Q3 = 75th percentile

Using the distance between Q1 and Q2 and Q3 and Q2 you can also tell something about the shape of a data set distribution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What is the interquartile range?

A

The range between Q3 and Q1

IQR = Q3 - Q1

It is often better to use the IQR instead of the range or standard deviation to compare the variability for distributions that are very highly skewed or that have severe outliers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What is the 1.5 x IQR criterion and what is it for?

A

It is used to detect potential outliers.

You simply take IQR x 1.5 (IQR = Q3 - Q1).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What are the five numbers used in a box plot?

A
  1. Minimum value
  2. Maximum value
  3. Q1
  4. Q2
  5. Q3
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What is the Z-score and how is it calculated?

A

The number of standard deviations that an observation falls from the mean.

A positive Z-score means that the observation is above the mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

What are examples of response variables and explanatory variables?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

When do we say that there is an association between two variables?

A

As soon as the value for one variable is more likely to occur with certain values of the other variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

If there is no association between two variables, what do we call them?

A

Independent variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

What is a contingency table and how does it look?

A

A display for two categorical variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

What is conditional proportions?

A

When the proportion depends on fx the type

Contingency table:

22,8% and 73,3% are conditional proportions, while the total 73,1% is not a conditional proportion, as it does not depend on the type of food (called marginal proportion instead).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

What are the three types of cases that exits when investigating the association between two variables?

A
  • Two categorical variables: Food type and pesticide status
  • One quantitative and one categorical: Income and gender
  • Two quantitative
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

Which variable should be called x and which should be called y in a scatterplot?

A
  • Y-axis: The response variable
  • X-axis: the explanatory variable
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

What is a scatterplot?

A

A graphical display for two quantitative variables using the horizontal axis (x) for the explanatory variable and the vertical axis (y) for the response variable.

It is used to study association between two quantitative variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

When do we call it a positive association and when do we call it a negative association?

A
  • Positive association: As x increases, y increases
  • Negative: As x increases, y decreases.

Picture of NEGATIVE association

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

What does r = 1, r = -1 and r 0.4 mean?

A

r = The correlation between two quantitative variables. Always between -1 and 1.

1 = straight-line and fully connected positive association

minus 1 = straight-line and fully connected negative association

The closer to 1.0 or -1 the better.

r = 0.4 shows that the two variables are not closely associated.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

In case you have a data set with an outlier far from the rest of the data that makes the scatterplot hard to interpret, what do you do?

A

You take the log of the numbers.

(the correlation is not the same after using log)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

What is a quadrant?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

Name an example of a case where the correlation, r would be inappropriate to use

A

If the relation between two variables is curved.

Fx. medical expenses and age. High in a young age, then lower, and in the end higher again with age.

The association is definitely existing, but the correlation, r is not appropriate to describe this association (only for straight-lined associations).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

What is a regression line and what is it used for?

A

A regression line is a straight-line formed by the data of two quantitative variables showing their association.

It is used to predict the response value, Y of a certain x value.

Regression line = Prediction line

Regression equation = prediction equation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

What is a residual?

A

A residual is the prediction error. In other words, the distance between the real y and the expected y at a given x-value.

If y is bigger than the expected y, the residual error is positive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

How does software find the optimal regression line?

A

Using the least squares method.

The regression line has some positive and some negative residuals, and the sum (and mean) of the residuals equals 0.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

What is the primary difference between the correlation, r and the regression method?

A
  • Regression:
    • We must identify response and explanatory variables (we get a different line if we use x to predict y and y to predict x).
    • Can be any real number (NOT just between -1 to 1)
    • The values of the y-intercept and slope of the regression line depend on the units
  • Correlation:
    • We get the same correlation no matter if we take x to y or y to x.
    • Falls between -1 and 1
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

What are the classic pitfalls when analyzing associations?

A
  • Extrapolations are unreliable - especially when it is far into the future
  • Influential outliers: observations that fall far from the trend and have an influence of your regression line (especially with small data sets)
    • Best way to avoid them is to plot the data and realize that they are part of your data set.
  • Thinking that correlation implies causation
    • Fx. higher education level rates are correlated with higher crime rates. These two are not connected but both are connected to a higher urbanization rate ⇒ More highly educated people in cities, where the crime rate also seems to be higher.
  • Simpsons Paradox
  • Confounding
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

What must hold before you can call an observation an influential outlier?

A
  • Its x value is relatively low or high compared to the rest of the data
  • The observation is a regression outlier, falling quite far from the trend that the rest of the data follow

It is always a good idea to subtract the outliers from the data set to plot the data again and see whether the regression line changes a lot or not.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

What is a lurking variable?

A

A variable, usually unobserved, that influences the association between the variables of primary interest.

Fx. the two variables, number of people drowning on Cold Coast in a given month and number of ice creams sold in a given month.

The lurking variable is the number of people using the beaches in the given months (could also be correlated to the monthly mean temperature).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

What is Simpson’s paradox?

A

That the direction of an association between two variables can change after we include a third variable and analyze the at separate levels of that variable.

Smoking example:

Was smoking actually beneficial for your health since a lower percentage of the smokers in the study died over the 20-year period?

No. Not when we took the age of the women studied at the beginning of the study into account.

The smokers were younger, and therefore less likely to die.

Correlation is positive 0.85 if you take all the data together (clearly not the case) However, if you split them into two groups, you get two negative correlations (one is -0.9). Looks correct. ALWAYS PLOT THE DATA

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

What is confounding?

A

When two explanatory variables are both associated with a response variable but are also associated with each other.

  • Smokers had a greater survival rate than nonsmokers
  • However, AGE was a confounding variable
  • Older subjects were less likely to be smokers, and older subjects were more likely to die.
  • Within each age group, smokers had a lower survival rate than non-smokers.
  • Age had conclusively a dramatic influence on the association between smoking and survival status
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

What is the difference between confounding and lurking variables?

A

It is essentially the same BUT a lurking variable it not measured in the study whereas the confounding variable is.

In other words, the lurking variable is a potential confounding variable that has not been taken into account.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

What is an experimental study and what is an observational study?

A
  • Experimental: the researcher conducts an experiment by assigning subject to certain experimental condition and then observing the outcomes on the response variable.
    • The experimental conditions, which correspond to assigned values of the explanatory variable, are called treatments
  • Observational: non-experimental; The researcher observes values of the response variable and explanatory variables for the sampled subjects, without anything being done to the subjects.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
59
Q

What kind of study is most reliable in terms of explanatory variables?

A

Because it is easier to adjust for lurking variables in an experiment than in an observational study, we can study the effect of an explanatory variable on a response variable more accurately with an experiment than with an observational study.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
60
Q

What are good places to collect available data?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
61
Q

What is simple random sampling?

A

A way where each possible sample has the same chance of being selected.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
62
Q

What are the ways of collecting data in sample surveys?

A
  • Interviews: likely longer questions but also unlikely that you get honest answers to more sensitive areas such as drugs, alcohol, sex etc.
  • Telephone interviews: like a normal interview but less costly but subjects might not be as patient as with personal interviews
    • Usually, the one used for national surveys by GSS and Gallup etc.
  • Self-administered questionnaire: cheap but many might fail to participate
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
63
Q

What are the primary sources of potential bias in sample surveys?

A
  • Sampling bias: the use of an inappropriate sampling method that does not take the entire population into account for example or is biased towards a certain group of people
  • Nonresponse bias: fx. 70 % of American women being married for 5+ years had an affair in a survey where only 4,500 out of 100,000 women replied. The data is simply useless as it might only be the ones who had an affair that replied.
  • Response bias: fx. if the interviewer asks a question in a leading way, such that subjects are more likely to respond a certain way. It can also be that subjects don’t give honest answers as the true answer might not be ethically correct or socially acceptable.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
64
Q

What are some classic surveys that are unreliable?

A

Surveys made with:

  • Convenience samples: fx. talking to people coming out of a shopping mall. Unlikely that these people are representative of the entire population due to time, interest, etc.
  • Volunteer samples: when people voluntarily answer questionnaires online
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
65
Q

What are some elements of good experiments?

A
  • Comparison control groups (often using a placebo treatment to make sure it seems identical)
  • Randomization
  • Blinding the study: making sure that the two groups don’t know whether they get the placebo or the actual pill for example
  • Replicating: doing the studies again to make sure that you get approximately the same result from time to time.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
66
Q

What does it mean that a study has statistically significant results?

A

The number of observations/subjects has been large enough to make chance a small enough factor that it can not explain the difference between the results.

For example: Before = 44 % and after = 52 %. If the number of subjects chosen by randomization is large enough, it does not explain the 8 percent difference.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
67
Q

What is cluster random sampling?

A

As simple random sampling can often be hard, you can divide the population into a large number of cluster, such as city blocks.

Then you select a simple random sample of the clusters and use the subjects in those clusters as the sample.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
68
Q

What is stratified random sampling?

A

You divide the population into separate groups, called strata, and then selects a simple random sample from each stratum.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
69
Q

What are the advantages and disadvantages of the 3 different sampling methods?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
70
Q

What does retrospective and prospective refer to?

A
  • Retrospective: backward looking (looks into the past)
  • Prospective: forward looking (takes a group of people and observes in the future
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
71
Q

What is a cross-over design?

A

A study in which the two groups shift treatment during the study.

This helps ensure that lurking variables do not affect the results.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
72
Q

What does cumulative proportion mean?

A

When doing trials, you focus on the percentage of times a certain outcome happens in the total of trials you have made.

Trial = simulation of something (simulation of die rolls)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
73
Q

What does this graph illustrate?

A

That random phenomena occurs in the short-run when only doing a low number of trials.

However, in the long run, things get very predictable.

This (together with people’s ludomania) is what makes casinos a good business. Even though a gambler might be lucky in the short run, the casino will win in the long run.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
74
Q

What is probability?

A

Probability of a particular outcome is the proportion of times that the outcome would occur in a long run of observations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
75
Q

What does sample space mean?

A

An event is a subset of the sample space.

An event corresponds to a particular outcome or a group of possible outcomes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
76
Q

What does the complement of an event mean?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
77
Q

What does it mean when two events are disjoint?

A

They do not have any common outcomes ⇒ They cannot happen at the same time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
78
Q

What does intersection and union of two events refer to and what is the difference?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
79
Q

If the chance of a professional basket player making a free throw is 80 %, what are the chance of making two in a row if we assume that each free throw is independent of one another?

A

0.8 X 0.8 = 0.64 = 64 %

80
Q

What does conditional probability refer to?

A

The conditional probability refers to; the probability of event A, given that event B has occurred.

81
Q

What is the multiplication rule for evaluating P(A and B)?

A
82
Q

What does sampling without replacement refer to?

A

Imagine you are playing lotto. The same number can only be picked 1 time, as the lotto-ball with the given number is not put back into the bowl.

83
Q

What are the ways that you can check whether two events are independent of each other?

A
84
Q

What is sensitivity and specificity regarding probability?

A

The probability that the result is correct!

Sensitivity: is the probability that you get a positive test, and you also have the disease.

Specificity: is the probability that you get a negative test, and you don’t have it.

85
Q

What is the weighted average?

A

When values are not equally likely to happen, fx.

86
Q

What are the three conditions for binomial distribution?

A
  1. Each trial has two possible outcomes
  2. Each trial has same probability of success
  3. All trials are independent
87
Q

With a normal distribution (bell-shaped) what are the percentages of observations within 1, 2, and 3 standard deviations?

A
  • 68 %
  • 95 % (1.96 standard deviations)
  • 99,7 %
88
Q

What is the formula for calculating the z-score?

A
89
Q

How is a good statistics designed?

A
90
Q

How do you calculate the standard deviation of the sampling distribution?

A

Knowing the standard deviation, you can check if the poll still holds true with 3 standard deviations (covering 99,7 % according to the empirical rule) and make your predictions from that.

91
Q

What does sampling distribution mean?

A
92
Q

What is the Central Limit Theorem (CLT)?

A

That the sampling distribution of the sample mean x(bar on top) often has approximately a normal distribution (bell-shaped).

For relatively large sample sizes, the sampling distribution is bell-shaped even if the population distribution is highly discrete or highly skewed.

Average = Approximately N(mean = population and

Standard deviation = standard error).

93
Q

How big does your sample size need to be before you can expect a bell-shape?

A

Around 30.

94
Q

Playing the roulette, are you more likely to come out ahead playing only 1 time or by playing 40 times?

A
  • Playing roulette for red 1 time gives you a winning chance of 18/38 = 47,4 %
  • Playing roulette for red 40 times gives you a winning change of 37,07 %

Playing once gives you a bigger chance of coming out ahead, but splitting your money up into 40 bets will give you a reasonable chance of getting out ahead while not losing ALL your money.

95
Q

What is the difference between sampling distribution and population distribution?

A
  • Sampling: probability distribution of a sample statistic
  • Population: probability distribution from which we take the sample
96
Q

What do these symbols mean?

A
97
Q

What are the two types of statistical inference methods?

A
  • Estimation of population parameters and
    • Most informative estimation method constructs an interval of numbers ⇒ Confidence interval
  • Testing hypotheses about the parameter values
98
Q

What is the difference between a point estimate and an interval estimate?

A
  • Point estimate: best guess for the unknown parameter value. Point estimate of the population mean would be the sample mean.
  • Interval estimate/Confidence interval: the most plausible values for a parameter using a margin of error (typically focused on 95 % using the z-score 1.96).
99
Q

Which number will you take to estimate population mean, population median and true probability? (point estimators)

A
100
Q

What is interval estimators?

A

Another word for confidence intervals.

“An interval within which the parameter is believed to fall” ⇒ In other words, the most believable values for a parameter.

101
Q

How do you construct a confidence interval?

A
  1. Taking a point estimate
  2. Adding and subtracting a margin of error (margin of error is based on the standard deviation of the sampling distribution of that point estimate)
  3. a 95 % confidence interval has a margin of error equal to 1.96 standard deviations, which therefore will be applied
102
Q

How do you get an 95 %-interval estimator?

A

Using the estimator of plus/minus 1.96 x the standard error.

z-score of 1.96

103
Q

How do you get a 99 %-interval estimator?

A

Using the estimator of plus/minus 2.58x the standard error.

Z-score of 2.58

104
Q

How do you get a 90 %-interval estimator?

A

Using the estimator of plus/minus 1.645x the standard error.

Z-score of 1.645.

105
Q

What do we want from an interval estimator?

A
  • Coverage: so that it covers the true value 95 % of the time (thus 1.96 standard deviations)
  • Length: the shorter the interval, the smaller standard errors = more reliable estimate

You take the Estimator (+ -) quantile x standard error

(the quantile depends on what we estimate, the number of observations and the total number of parameters estimated)

106
Q

When is an estimator called unbiased?

A

When it has a distribution centered at the true value of the parameter we are trying to estimate.

“Center” in this case as the mean of that sampling distribution.

107
Q

What are the two properties of a good point estimate?

A
  • Being unbiased (sampling distribution that is centered at the parameter)
  • Small standard deviation
108
Q

What is a standard error?

A

Standard error = the estimated standard deviation of a statistic.

A low standard error = less variable data = more reliable.

109
Q

What are the properties of the T-distribution and what is it used for?

A

Used for constructing confidence intervals that apply even in small sample sizes.

110
Q

Is the method of creating interval estimations with t-distribution robust when data is not bell-shaped?

A

All data can be expected to be bell-shaped to a certain degree as long as the sample size is above 30.

Thus, it comes down to the size of the sample. Above 30 and the t-distribution method is still robust.

111
Q

Is z-score and t-score ever the same?

A

Yes.

When df = infinity.

Already from df > 30, the t-score is similar to the z-score.

112
Q

How do you determine the general sample size for estimating a population proportion?

A
113
Q

What is the sample size formulas for estimating means and proportions?

A
114
Q

What affects the choice of the sample size?

A
  • The desired precision as measured by the margin of error, m
  • The second is the confidence interval, which determines the z-score or t-score in the sample size formulas

Other factors:

  • The variability in the data (the smaller variability (can be seen from the standard deviation), the smaller sample size needed))
  • Cost (resource constraints)
115
Q

What is bootstrap?

A

A recent computational invention, which purpose is to derive a standard error (SE) or a confidence interval formula by simulating resamples from the observed data.

It treats the data distribution as if it were the population distribution.

116
Q

What are the steps of a significance test?

A

Uses z-test statistic

  1. Assumptions (typically assuming randomization, or takes assumptions regarding the sample size or the shape of the population distribution)
  2. Hypotheses
    • A null hypothesis ⇒ Particular value
    • An alternative hypothesis ⇒ Some alternative range of values
  3. Test statistic
    • How far is the point estimate from the parameter value given in the null hypothesis
  4. P-value
    • P-value is the probability that the test statistic equals the observed value or a value even more extreme.
      • Calculated by presuming that the null hypothesis is true.
  5. Conclusion
  • Reject or do not reject the null hypothesis
117
Q

When do we call it t-test and when do we use z-score?

A

T-tests ⇒ Mean

Z-score ⇒ Proportion

118
Q

What is the p-value?

A

The probability of getting something less consistent with the null hypothesis.

  • Events with small probabilities are unlikely to occur ⇒
  • A small p-value leads us to reject the null hypothesis ⇒ In the cases where the p-value is smaller than the chosen significance level (typically 5 %)
119
Q

What is the p-value the sum of?

A

The p-value is the sum of both tail probabilities

120
Q

What is the formula of T in test statistics of hypotheses?

A
121
Q

Given the p-value you find, what can you conclude when the p-value is above and when it is below the chosen significance level?

A
  • P-value below; reject the null hypothesis (and conclude that the alternative is true
  • P value above; do not below the null hypothesis (and be careful what you conclude)
122
Q

What is the difference between one-sided and two-sided alternative hypotheses?

A
  • One-sided: the values fall only on one side of the null hypothesis value
  • Two-sided: the values fall on both sides of the null hypothesis value
123
Q

How do you write up the alternative hypothesis with letters/symbols for right-tail, left-tail and two-tail probability?

A
124
Q

What is a significance level?

A
125
Q

What are the 5 steps of a significance test about a population mean?

A

Uses t-test statistic

126
Q

What are the two types of errors in test decisions?

A
  • Type I error: when we reject a hypothesis (H0) when it is actually true
  • Type II error: when we accept a hypothesis (H0) when it is actually wrong
127
Q

What is a Type I error?

A

Reject H0 when it is true

Sending innocent guy to prison

128
Q

What is a Type II error?

A

Accept H0 when it is false

Letting guilty guy go free

129
Q

What is the probability of making a type 1 error?

A

The significance level of the test. If your test is 95 %-certain, there is a 5 % risk that you make the wrong decision.

130
Q

Why do most people find confidence intervals better than significance tests?

A
  • A significance test merely indicates whether the particular parameter value in H0 (the null hypothesis) is plausible
  • A confidence interval is more informative because it displays the entire set of believable values.
131
Q

What are some of the classic misinterpretations of results of significance test?

A
  • “Do not reject H0 (null hypothesis) does not mean “Accept H0”
    • It only indicates whether a particular value is plausible (a confidence interval shows a range of plausible values - not just a single value)
  • Statistical significance does not mean practical significance
    • A small p-value does not tell us if the parameter value differs by much in practical terms from the value in H0
  • The p-value cannot be interpreted as the probability that H0 is true
    • we calculate probabilities about test statistic values, not about parameters
  • It is misleading to report results only if they are “statistically significant”
    • If you only publish results of studies where the p-value is < 0.05, there is a danger that the study will be repeated enough times such that one of them obtain significance and a type 1 error will occur if the other studies have not been published.
  • Some tests may be statistically significant just by chance
  • True effects may not be as large as initials estimates reported by the media
    • The studies that get the most attention are always the most extreme ones.
132
Q

What is the likelihood of a type 2 error? (not rejecting H0 even though it’s false)

A

Power of a test: 1 - Z-score = probability of being correct

Thus, the z-score must be the probability of error.

Probability of being correct increases as:

  • the parameter values move farther into Ha (alternative hypothesis), values and away from H0 value
  • As the sample size increases
133
Q

What is the P-value?

A
134
Q

What are the steps in significance test for population proportions and means? (comparison)

A
135
Q

What do these symbols mean?

A
136
Q

What is the formula of finding the standard error (of a proportion)?

A
137
Q

What is the formula for making a z-test (in relation with testing proportions)?

A

(notice that the denominator is the standard error)

138
Q

What is a binary variable?

A

A variable that has two possible outcomes.

It is used as the explanatory variable.

Fx. gender regarding binge drinking (male and female)

139
Q

How do you calculate the standard error for comparing two proportions?

A
140
Q

How do you find the confidence interval for the difference between two population parameters?

A
141
Q

How do you interpret a confidence interval for a difference in proportions?

A
  • If 0 falls in the confidence interval, it is plausible (but not necessary) that the population proportions are equal.
  • Is the entire interval below or above 0?
  • Is the lower or upper part of the interval very close to 0? ⇒ the difference may be relatively small in practical terms.
142
Q

What is the steps of comparing two population proportions with a two-sided significance test?

A
143
Q

How can you find the p-value graphically?

A
144
Q

What are the steps of comparing two population means with a two-sided significance test?

A
145
Q

What is the pooled standard deviation and what is it used for?

A

It is an estimate that combines information from two samples to provide a single estimate of variability ⇒ A common standard deviation using weighted average of the squares of the two sample standard deviations.

Used to compare population means, ASSUMING equal population standard deviations.

146
Q

What is dependent observations and matched pairs?

A

Dependent samples: each observation in one sample had a matched observation in the other sample.

The observations are called matched pairs.

When paired, we measure twice on the same individual; once before and once after an intervention.

147
Q

How do you make a t-test on paired numeric data?

A

You just make an ordinary one-sample t-test based on the differences (within pairs, ie. “after” - “before”).

148
Q

How do you compare means of dependent samples?

A

You make a significance test and a confidence intervals using the single sample of difference scores.

149
Q

When can you assume that two groups of observations have the same standard deviation and use the formula for such cases?

A

If the difference between the two average means is less than twice the size of the other.

7 vs. 4 can be used. 11 vs. 4 can not be used.

150
Q

What is a control variable and how is it used?

A

A control variable is a variable that is held constant in a multivariate analysis.

To analyze whether an association can be explained by a third variable, we treat that third variable as a control variable. We hold the control variable constant, such that whatever association that occurs cannot be due to effects of the control variable because in each part of the analysis it is not allowed to vary.

151
Q

When given an exercise, what are good questions to ask yourself to figure out which method to use to solve the exercise?

A
  • Means or proportions? (quantitative or categorical variable)
  • Independent samples or dependent samples?
  • Confidence interval or significance interval?
  • Large n1 and n2 or not?

Most exercises have large independent samples, and confidence intervals are more useful than tests.

152
Q

What is a conditional distribution?

A

Putting the conditional percentages of a sample data distribution of X, conditional on the category Y, we get the conditional distribution.

Basically the list of conditional percentages in a given row.

153
Q

What are the conditional probabilities?

A

Again, basically the list of conditional percentages CONVERTED into probabilities/proportions.

154
Q

When are two categorical variables dependent and independent?

A
  • Independent; if the population conditional distributions for one of them are identical at each category of the other
  • Dependent; if the conditional distributions are not identical
155
Q

How do you calculate the expected cell count and how can the expected cell count be helpful?

A

Useful as we can quickly test whether two categorical variables are independent or not.

If independent; P (A and B) = P(A) x P(B) ⇒ The argument behind the formula.

156
Q

What is the chi-squared statistic?

A

A statistic that summarizes how far the observed cell counts in a contingency table fall from the expected cell counts.

If the null hypothesis should hold true, the observed and expected cell counts should be close in each cell.

157
Q

What is the formula for the chi-squared statistic?

A
158
Q

What are the main properties of chi-squared distribution?

A
  • Always positive (as we square the numbers)
  • The shape is characterized by the degrees of freedom dependent on the number of rows and columns: DF = (r - 1) x (c - 1)
  • The mean = degrees of freedom
  • As df increases, the distribution becomes more bell-shaped
    • In the beginning, it is skewed to the right. The skew lessens as df increases
  • Large chi-square = evidence against independence
159
Q

What is the 5-steps of a chi-squared test of independence?

A

4) Use tables provided on learn. If the calculated X^2 falls above the expected according to the table, we know that it has a smaller right-tail probability than the right-tail probability you are looking at.

160
Q

How do you make a chi-squared test comparing proportions in a 2x2 table?

A

Same steps as for a normal chi-squared test UNLESS the TEST STATISTIC, which instead of X^2 uses Z:

161
Q

What are some of the most common misuses of the Chi-squared test?

A
162
Q

Does large X^2 values imply strong association?

A

No.
The large X^2 value can be due to a large sample size.

163
Q

What is the standardized residual and what is its formula?

A

A standardized residual reports the number of standard errors that an observed count falls from its expected count.

164
Q

What does a standardized residual of +2 or -2 imply?

A

95 %-confidence in the cell-result in the way that the standardized residual tells us, that we are 95 % sure that the expected cell count (if independent) is wrong.

165
Q

Why is it important to distinguish between the regression line slope and the correlation?

A

The correlation, r does not depend on the units of measurement whereas the slope is depending on units of measurement. It is steep if we measure in grams, but flat when we measure in kilos for example.

166
Q

How far is the predicted y from its mean mean?

A

At any given x value, the x value is a certain number of standard deviations from its mean, and then the predicted y is r times that many standard deviations from its mean.

167
Q

What does a regression line show?

A

The estimated means of y at the various x-values.

168
Q

What does the residual standard deviation describe?

A

The typical size of the residuals.

169
Q

Why does the residual standard deviation differ from the standard deviation?

A

Because the STANDARD DEVIATION refers to the variability of all the y values around their mean, not just those at a fixed x value.

170
Q

What is the difference between a confidence interval for μy and a prediction interval for y?

A
  • Confidence interval for μy: inference about where a population mean falls
  • Prediction interval for y: prediction interval for y is an inference about where individual observations fall
171
Q

How do you find a confidence interval for μy and a prediction interval for y? s

A
172
Q

What is MSE?

A

The Mean Square Error.

173
Q

What is the ANOVA F statistic and how do you calculate it?

A
174
Q

What is the steps of making a two-sided significance test about a POPULATION SLOPE?

A
175
Q

When do we know that an exponential regression model is appropriate?

A

When the log of the response has an approximate straight-line relationship with the explanatory variable.

176
Q

Which of prediction and confidence interval has a wider interval and why?

A

Prediction interval.

Prediction intervals are answers to where a single observation would occur, which is of higher uncertainty than with a population parameter (which confidence intervals show).

177
Q

What is confidence intervals and prediction intervals called in JMP?

A
178
Q

What is the model assumptions regarding residuals and how do you check for them?

A
  • Residuals are (approximately) independent
    • No useful tool to check this; independence must be asserted from knowledge of the data
  • There is no relationship between residuals and x-values/covariates
    • You can plot residuals against covariates - look for non-linear relationships (see p. 654)
  • All the residuals have (approximately) the same standard deviation
    • Plot of residuals against fitted values; look for “trumpet shape” (p. 654)
      • If the assumption does not hold true, you can fix the data by transforming with logarithms
  • Residuals are normally distributed
    • QQ-plot of residuals. If normally distributed, points should be approximately on the line (Go to Analyze ⇒ Distribution ⇒ Normal Quantile Plot)
      • If the assumption does not hold true, you can fix the data by transforming with logarithms
179
Q

How do you find the confidence interval for a POPULATION SLOPE?

A
180
Q

What is the idea of multiple regression?

A

To use more than one explanatory variable to predict a response variable.

For example predicting the price of house just by its size in square feet would not be appropriate. We have to take multiple explanatory variables into account.

181
Q

What is a good rule of thumb regarding the relation between number of explanatory variables used in multiple regression models and the sample size, n?

A

A rough guideline is that the sample size, n should be at least 10 times the number of explanatory variables.

2 variables = n minimum of 20

3 variables = n minimum of 30

4 variables = n minimum of 40

182
Q

What does R mean?

A

Multiple correlation, whereas, normally correlation is denoted by r, when it is bivariate.

  • R falls between 0 and 1 (cannot correlate negatively with y - otherwise, predictions would be worse than merely using one variable to predict y, and thus we would not use multiple regression).
  • r falls between -1 and +1
183
Q

What are some properties of the F-distribution?

A
  • Can assume only nonnegative values
  • Is skewed to the right
184
Q

What does Bj mean?

A
185
Q

What is the full process of multiple regression? (steps)

A
186
Q

What does RSS mean?

A

Residual Sum of Squares

187
Q

What is T-ratios and how is it calculated?

A

Estimate / Standard error = t-ratio

188
Q

What is model DF?

A

Simply the number of explanatory variables you have in a multiple regression model.

189
Q

What is C. total sum of squares?

A

Model sum of squares + error sum of squares

190
Q

What is C. total DF?

A

Model DF + Error DF

191
Q

What is the F-ratio?

A

F = Model mean square / Error mean square

So F = 296,382 / 5,315

192
Q

What is the parameter t-tests used for and what is the formula?

A

Estimate / Standard error = t-ratio

It is t-distributed with n - 1 - p degrees of freedom (assuming the null hypothesis)

Remember to check if the F-test is significant before you use the t-ratios.

193
Q

Which of these values are irrelevant for our or any other statisticians analysis?

A

T-ratio for the intercept - not interesting. EVER.

194
Q

What is common student mistakes regarding multiple regression analysis?

A
  • Commenting on the intercept t-ratio (it is not relevant)
  • When people first conclude that a parameter/effect/covariate is insignificant (the p-value of it is insignificant) and after that they use the p-value of the other effects to conclude directly - WRONG ⇒ You have to re-fit the model and take out the covariate/effect/parameter that was not significant before you analyze the p-values of the other covariates/effects/parameters.
195
Q

What is the formula of the adjusted R^2 and why is it better?

A

It takes into the account of the size of the model, whereas the normal R^2 will give you a higher value (explanation-degree of data) just by having larger data sets or more variables.

However, the adjusted R^2 still doesn’t say much.

196
Q
A