S2016Q5 Flashcards
What is designing in the language of statistics?
Setting up a hypothesis or question and deciding how to collect data
What is describing in the language of statistics? (descriptive statistics)
Summarizing data with numbers and graph.
What is inferences in the language of statistics? (inferential statistics)
Decisions and predictions based on the data.
- Estimation
- Test
- Confidence intervals
The typical statistical model assumes what? (model = assumptions)
- Independence of observations
- The same underlying distribution for all observations
- Some sort of systematic structure
(but this is not always the case)
What is the significance level? (rule of thumb)
5 %.
What is random sampling and why is it important?
Making sure that each subject in the population has the same chance of being in the sample so that we make sure that the sample is a good reflection of the population.
What does inferential statistics refer to?
Methods of making decisions or predictions about a population, based on data obtained from a sample of that population.
What is the difference between a parameter and a statistic?
Parameter: a numerical summary of the population
Statistic: a numerical summary of the sample taken from the population
When is a variable categorical and when is it quantitative?
- Categorical: if each observation belongs to one of a set of categories such as “Yes” and “No”
- Ordered: “ordinal” (fx. exam grades)
- Unordered “nominal”: male/female, type of business, zip codes
- Quantitative: if observations take numerical values that represent different magnitudes of the variable (fx. age or annual income but NOT area code numbers).
What is unordered (nominal) data and what type?
Categorical: e.g.: Male/female, type of business, ZIP code etc.
What is ordered (ordinal) data and what type?
Categorical: e.g.: Grades, likert scales etc.
What is a good graph?
Check colors: www.colorbrewer2.org
Remember to: use different lines, colors, different plotting symbols.
Remember it might be printed black/white
What is a discrete variable and what type?
Numerical: Value in subset of natural numbers (typically integers)
E.g.: 0,1,2,3… (number of employees, number of companies etc.)
What is a continuous variable?
Numerical: May take any value in an interval
E.g.: income, sales etc.
When is a variable discrete and when is it continuous?
- Discrete: it has separate possible values such as the integers 0, 1, 2, …. for a variable expressed as “the number of…”. (number of companies in a region/employees in a company etc.
- Continuous: all possible values in an interval
What is the median?
The middle observation
E.g.: 1,1,1,2,2,2,3,3,4,5,6,7,7,8,8
Median = 3
When is it called modal category and when it is called mode?
Modal category and mode both refer to being the most frequent answer in a data set.
Modal category ⇒ the category with the highest frequency
Mode ⇒ the numerical value (quantitative) that occurs most frequently
What are the primary graphical display for summarizing a categorical variable?
- Pie chart
- Bar graph: the bar graph is usually preferred as it is easier to distinguish between two categories of approximately the same size
- When ordering by frequency as here, it is called a Pareto Chart (Vilfredo Pareto)
What are the primary graphical display for summarizing quantitative variables?
- Dot-plot: A dot plot shows a dot for each observation, placed just above the value on the number line for that observation (see picture). Can be useful for small data sets (<50 observations)
- Stem-and-leaf plot: Can be useful for small data sets (<50 observations)
- Histogram: The word is used for a graph with bars representing quantitative variables whereas bar graph is used for graphs with a categorical variable.
- Gives more flexibility in defining intervals and is better for big data sets (+50 observations)
What is the “mode” in a frequency table or histogram?
The highest point.
What does unimodal and bimodal refer to?
Whether the histogram or frequency table has a single mound or two distinct mounds.
What does symmetric and skewed shape refer to?
- Skewed to the left if the left is longer than the right
- The mean is smaller than the median
What is the “mean” of a distribution of a quantitative variable?
The sum of the observations divided by the number of observations.
(The average / The balance point of the distribution)
What is the median?
The median is the middle value of the observations when the observations are ordered from smallest to the largest.
(in case you have 20 observations, you will take observation (10+11)/2 as your median)
What is an outlier?
An observation that falls well above or well below the mean.
What does the “range” refer to?
Difference between largest and smallest observation
What is the formula of a standard deviation?
What does a large “s” mean when working with standard deviations?
The large the standard deviation, S, the greater the variability of the data.
/S is a typical distance of observations from the mean (the average)
What is the empirical rule for BELL-SHAPED data distributions? (within 1 standard deviation + within 2) fun-fact
What are quartiles and how do they relate to the median?
Median = Quartile 2 (50th percentile)
Q1 = 25th percentile
Q2 = 50th percentile
Q3 = 75th percentile
Using the distance between Q1 and Q2 and Q3 and Q2 you can also tell something about the shape of a data set distribution.
What is the interquartile range?
The range between Q3 and Q1
IQR = Q3 - Q1
It is often better to use the IQR instead of the range or standard deviation to compare the variability for distributions that are very highly skewed or that have severe outliers.
What is the 1.5 x IQR criterion and what is it for?
It is used to detect potential outliers.
You simply take IQR x 1.5 (IQR = Q3 - Q1).
What are the five numbers used in a box plot?
- Minimum value
- Maximum value
- Q1
- Q2
- Q3
What is the Z-score and how is it calculated?
The number of standard deviations that an observation falls from the mean.
A positive Z-score means that the observation is above the mean
What are examples of response variables and explanatory variables?
When do we say that there is an association between two variables?
As soon as the value for one variable is more likely to occur with certain values of the other variable.
If there is no association between two variables, what do we call them?
Independent variables.
What is a contingency table and how does it look?
A display for two categorical variables.
What is conditional proportions?
When the proportion depends on fx the type
Contingency table:
22,8% and 73,3% are conditional proportions, while the total 73,1% is not a conditional proportion, as it does not depend on the type of food (called marginal proportion instead).
What are the three types of cases that exits when investigating the association between two variables?
- Two categorical variables: Food type and pesticide status
- One quantitative and one categorical: Income and gender
- Two quantitative
Which variable should be called x and which should be called y in a scatterplot?
- Y-axis: The response variable
- X-axis: the explanatory variable
What is a scatterplot?
A graphical display for two quantitative variables using the horizontal axis (x) for the explanatory variable and the vertical axis (y) for the response variable.
It is used to study association between two quantitative variables.
When do we call it a positive association and when do we call it a negative association?
- Positive association: As x increases, y increases
- Negative: As x increases, y decreases.
Picture of NEGATIVE association
What does r = 1, r = -1 and r 0.4 mean?
r = The correlation between two quantitative variables. Always between -1 and 1.
1 = straight-line and fully connected positive association
minus 1 = straight-line and fully connected negative association
The closer to 1.0 or -1 the better.
r = 0.4 shows that the two variables are not closely associated.
In case you have a data set with an outlier far from the rest of the data that makes the scatterplot hard to interpret, what do you do?
You take the log of the numbers.
(the correlation is not the same after using log)
What is a quadrant?
Name an example of a case where the correlation, r would be inappropriate to use
If the relation between two variables is curved.
Fx. medical expenses and age. High in a young age, then lower, and in the end higher again with age.
The association is definitely existing, but the correlation, r is not appropriate to describe this association (only for straight-lined associations).
What is a regression line and what is it used for?
A regression line is a straight-line formed by the data of two quantitative variables showing their association.
It is used to predict the response value, Y of a certain x value.
Regression line = Prediction line
Regression equation = prediction equation
What is a residual?
A residual is the prediction error. In other words, the distance between the real y and the expected y at a given x-value.
If y is bigger than the expected y, the residual error is positive
How does software find the optimal regression line?
Using the least squares method.
The regression line has some positive and some negative residuals, and the sum (and mean) of the residuals equals 0.
What is the primary difference between the correlation, r and the regression method?
-
Regression:
- We must identify response and explanatory variables (we get a different line if we use x to predict y and y to predict x).
- Can be any real number (NOT just between -1 to 1)
- The values of the y-intercept and slope of the regression line depend on the units
-
Correlation:
- We get the same correlation no matter if we take x to y or y to x.
- Falls between -1 and 1
What are the classic pitfalls when analyzing associations?
- Extrapolations are unreliable - especially when it is far into the future
-
Influential outliers: observations that fall far from the trend and have an influence of your regression line (especially with small data sets)
- Best way to avoid them is to plot the data and realize that they are part of your data set.
-
Thinking that correlation implies causation
- Fx. higher education level rates are correlated with higher crime rates. These two are not connected but both are connected to a higher urbanization rate ⇒ More highly educated people in cities, where the crime rate also seems to be higher.
- Simpsons Paradox
- Confounding
What must hold before you can call an observation an influential outlier?
- Its x value is relatively low or high compared to the rest of the data
- The observation is a regression outlier, falling quite far from the trend that the rest of the data follow
It is always a good idea to subtract the outliers from the data set to plot the data again and see whether the regression line changes a lot or not.
What is a lurking variable?
A variable, usually unobserved, that influences the association between the variables of primary interest.
Fx. the two variables, number of people drowning on Cold Coast in a given month and number of ice creams sold in a given month.
The lurking variable is the number of people using the beaches in the given months (could also be correlated to the monthly mean temperature).
What is Simpson’s paradox?
That the direction of an association between two variables can change after we include a third variable and analyze the at separate levels of that variable.
Smoking example:
Was smoking actually beneficial for your health since a lower percentage of the smokers in the study died over the 20-year period?
No. Not when we took the age of the women studied at the beginning of the study into account.
The smokers were younger, and therefore less likely to die.
Correlation is positive 0.85 if you take all the data together (clearly not the case) However, if you split them into two groups, you get two negative correlations (one is -0.9). Looks correct. ALWAYS PLOT THE DATA
What is confounding?
When two explanatory variables are both associated with a response variable but are also associated with each other.
- Smokers had a greater survival rate than nonsmokers
- However, AGE was a confounding variable
- Older subjects were less likely to be smokers, and older subjects were more likely to die.
- Within each age group, smokers had a lower survival rate than non-smokers.
- Age had conclusively a dramatic influence on the association between smoking and survival status
What is the difference between confounding and lurking variables?
It is essentially the same BUT a lurking variable it not measured in the study whereas the confounding variable is.
In other words, the lurking variable is a potential confounding variable that has not been taken into account.
What is an experimental study and what is an observational study?
-
Experimental: the researcher conducts an experiment by assigning subject to certain experimental condition and then observing the outcomes on the response variable.
- The experimental conditions, which correspond to assigned values of the explanatory variable, are called treatments
- Observational: non-experimental; The researcher observes values of the response variable and explanatory variables for the sampled subjects, without anything being done to the subjects.
What kind of study is most reliable in terms of explanatory variables?
Because it is easier to adjust for lurking variables in an experiment than in an observational study, we can study the effect of an explanatory variable on a response variable more accurately with an experiment than with an observational study.
What are good places to collect available data?
What is simple random sampling?
A way where each possible sample has the same chance of being selected.
What are the ways of collecting data in sample surveys?
- Interviews: likely longer questions but also unlikely that you get honest answers to more sensitive areas such as drugs, alcohol, sex etc.
- Telephone interviews: like a normal interview but less costly but subjects might not be as patient as with personal interviews
- Usually, the one used for national surveys by GSS and Gallup etc.
- Self-administered questionnaire: cheap but many might fail to participate
What are the primary sources of potential bias in sample surveys?
- Sampling bias: the use of an inappropriate sampling method that does not take the entire population into account for example or is biased towards a certain group of people
- Nonresponse bias: fx. 70 % of American women being married for 5+ years had an affair in a survey where only 4,500 out of 100,000 women replied. The data is simply useless as it might only be the ones who had an affair that replied.
- Response bias: fx. if the interviewer asks a question in a leading way, such that subjects are more likely to respond a certain way. It can also be that subjects don’t give honest answers as the true answer might not be ethically correct or socially acceptable.
What are some classic surveys that are unreliable?
Surveys made with:
- Convenience samples: fx. talking to people coming out of a shopping mall. Unlikely that these people are representative of the entire population due to time, interest, etc.
- Volunteer samples: when people voluntarily answer questionnaires online
What are some elements of good experiments?
- Comparison control groups (often using a placebo treatment to make sure it seems identical)
- Randomization
- Blinding the study: making sure that the two groups don’t know whether they get the placebo or the actual pill for example
- Replicating: doing the studies again to make sure that you get approximately the same result from time to time.
What does it mean that a study has statistically significant results?
The number of observations/subjects has been large enough to make chance a small enough factor that it can not explain the difference between the results.
For example: Before = 44 % and after = 52 %. If the number of subjects chosen by randomization is large enough, it does not explain the 8 percent difference.
What is cluster random sampling?
As simple random sampling can often be hard, you can divide the population into a large number of cluster, such as city blocks.
Then you select a simple random sample of the clusters and use the subjects in those clusters as the sample.
What is stratified random sampling?
You divide the population into separate groups, called strata, and then selects a simple random sample from each stratum.
What are the advantages and disadvantages of the 3 different sampling methods?
What does retrospective and prospective refer to?
- Retrospective: backward looking (looks into the past)
- Prospective: forward looking (takes a group of people and observes in the future
What is a cross-over design?
A study in which the two groups shift treatment during the study.
This helps ensure that lurking variables do not affect the results.
What does cumulative proportion mean?
When doing trials, you focus on the percentage of times a certain outcome happens in the total of trials you have made.
Trial = simulation of something (simulation of die rolls)
What does this graph illustrate?
That random phenomena occurs in the short-run when only doing a low number of trials.
However, in the long run, things get very predictable.
This (together with people’s ludomania) is what makes casinos a good business. Even though a gambler might be lucky in the short run, the casino will win in the long run.
What is probability?
Probability of a particular outcome is the proportion of times that the outcome would occur in a long run of observations.
What does sample space mean?
An event is a subset of the sample space.
An event corresponds to a particular outcome or a group of possible outcomes.
What does the complement of an event mean?
What does it mean when two events are disjoint?
They do not have any common outcomes ⇒ They cannot happen at the same time.
What does intersection and union of two events refer to and what is the difference?