terms Flashcards

1
Q

b) Standard error

A

is a measure of how close the sample mean
estimates the population mean

Standard deviation of sampling distributions is the standard error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

a) Normal distribution

A

a probability distribution that is symmetric about the mean, showing that data near the mean are more frequent in occurrence than data far from the mean. The normal distribution appears as a “bell curve” when graphed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

c) Adjusted R-squared

A

Adjusted R-squared is a modified version of R-squared that has been adjusted for the number of predictors in the model. The adjusted R-squared increases when the new term improves the model more than would be expected by chance. It decreases when a predictor improves the model by less than expected.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

d) p-value

A

The P value is defined as the probability under the assumption of no effect or no difference (null hypothesis), of obtaining a result equal to or more extreme than what was actually observed. The P stands for probability and measures how likely it is that any observed difference between groups is due to chance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

heteroskedasticity

A

A violation of the assumption of homogenous variance of the
residuals (or homoskedasticity)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

standard normal distribution

A

The standard normal distribution is a specific type of normal distribution that has a mean of 0 and a standard deviation of 1. It is often referred to as the Z-distribution. This distribution is a special case of the general normal distribution, which can have any mean and standard deviation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

residuals

A

Residuals in data analysis are the differences between observed values and the values predicted by a statistical or machine learning model. They are a crucial diagnostic measure used to assess the quality and validity of a model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

null hypothesis

A

representing a default or baseline assumption that there is no effect, no difference, or no relationship between variables being studied. serves as a starting point for hypothesis testing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

statistical significance

A

Statistical significance is a concept used in hypothesis testing to determine whether the observed data in a study are unlikely to have occurred by chance alone. It helps researchers decide whether to reject the null hypothesis, which typically posits that there is no effect or no difference between groups.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

standard deviations

A

Standard deviation is a statistical measure that quantifies the amount of variation or dispersion in a set of data values. It is a crucial concept in statistics, used to understand how spread out the values in a data set are around the mean (average).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

parameter

A

A parameter in statistics is a numerical value that describes a characteristic of an entire population. It is a fixed value that summarizes or describes an aspect of the population, such as the mean, standard deviation, or proportion.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

leverage point

A

A leverage point is an observation with predictor values that are far from the mean of the predictor values for all observations. This makes the fitted regression model likely to pass close to these points

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

logistic regression

A
  • Logistic regression is a technique that enables us to investigate
    dummy variables on the left hand side.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

interaction effect

A

An interaction effect describes a situation where the effect of one variable on the outcome variable changes depending on the value of another variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

normal distribution

A

continuous probability distribution that is symmetrical around its mean.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Multicollinearity

A

Multicollinearity is a statistical phenomenon in which two or more independent variables in a multiple regression model are highly correlated. This high correlation means that the variables share similar information, which can lead to problems in estimating the relationship between each independent variable and the dependent variable.

17
Q

Z-scores

A

Multicollinearity is a statistical phenomenon in which two or more independent variables in a multiple regression model are highly correlated. This high correlation means that the variables share similar information, which can lead to problems in estimating the relationship between each independent variable and the dependent variable.

18
Q

chi 2

A

The chi-squared test is a statistical hypothesis test used to determine whether there is a significant association between categorical variables.

19
Q

interquartile range

A

The interquartile range (IQR) is a measure of statistical dispersion, which indicates the spread of the middle 50% of a dataset. It is particularly useful for understanding the variability of data and is less affected by outliers and extreme values compared to other measures like the range.

20
Q

when should u use logistic regression

A

When the dependent variable is binary (categorical)
* When the relationship between Y and X is non-linear

21
Q

What are some logistic regress advantages over Ordinary Least Squares regression?

A

Solves some of the problems of using OLS as an LPM:
i. Nonsensical predictions
ii. Heteroskedasticity
iii. Constant effects across all values of X
iv. Residuals are not normally distributed
v. R2 is inaccurate
* The coefficient of X is differential: The impact of X on Y depends on the value of X
* Impact is greatest around the mean of X, but much lower at the extremes of X

22
Q

What is the problem of multicollinearity

A

Problems Caused by Multicollinearity:
Inflated Standard Errors:

Multicollinearity increases the standard errors of the coefficients. This makes the estimates less precise and the confidence intervals wider, reducing the statistical power to detect significant effects.
Unreliable Estimates:

Coefficients can become highly sensitive to small changes in the model. Adding or removing a variable can cause large changes in the coefficients of the remaining variables.
Difficulty in Assessing Individual Predictor Effects:

It becomes hard to determine the individual effect of each predictor on the dependent variable because the predictors are not independent of each other.
Reduced Interpretability:

The interpretation of regression coefficients becomes problematic. High multicollinearity makes it unclear which variables are actually influencing the dependent variable and to what extent.

23
Q

when might u find multicollinearity in your data

A

Multicollinearity occurs when two or more predictor variables in a multiple regression model are highly correlated, meaning they contain similar information about the variance in the dependent variable.

24
Q

what is an appropriate diagnostic test for identifying multicolinearity in ur data

A

Variance Inflation Factor (VIF):

VIF measures how much the variance of a regression coefficient is inflated due to multicollinearity. A VIF value greater than 10 is often considered indicative of high multicollinearity.

25
Q

what does it mean if he estimates are unbiased

A
26
Q

what should u do once u have found multicollinearity

A

Addressing Multicollinearity:
Remove Highly Correlated Predictors:

If two predictors are highly correlated, consider removing one of them from the model.
Combine Predictors:

Combine correlated variables into a single predictor through techniques such as principal component analysis (PCA).
Regularization Techniques:

Use regression techniques that can handle multicollinearity, such as ridge regression or lasso regression. These methods add a penalty to the regression to reduce the impact of multicollinearity.
Centering Predictors:

Standardizing or centering the predictors by subtracting the mean can sometimes help mitigate the effects of multicollinearity

27
Q

what are some strengths and limitations of using bivariate measures of association for analysing associations between variables, provide examples

A

Strengths of Bivariate Measures of Association
Simplicity and Ease of Interpretation:

Bivariate measures are straightforward to calculate and interpret.
Example: Pearson’s correlation coefficient (r) indicates the strength and direction of a linear relationship between two continuous variables. An r value of 0.8 suggests a strong positive relationship.
Identification of Relationships:

These measures help identify whether a relationship exists between two variables and the nature of this relationship.
Example: The chi-square test can determine if there is a significant association between gender (male, female) and voting preference (yes, no).
Foundation for Further Analysis:

Bivariate analysis often serves as a preliminary step before more complex multivariate analysis.
Example: Identifying a strong correlation between study hours and exam scores can lead to a deeper investigation of additional factors affecting exam performance.
Versatility:

Various measures are available for different types of data (nominal, ordinal, interval, and ratio).
Example: Spearman’s rank correlation is used for ordinal data, providing insights into the association between ranked variables.
Limitations of Bivariate Measures of Association
Ignores Multivariate Context:

Bivariate measures do not account for the influence of other variables, potentially leading to misleading conclusions.
Example: A positive correlation between ice cream sales and drowning incidents might ignore the lurking variable of temperature (hot weather increases both).
Limited to Two Variables:

These measures only analyze the relationship between two variables at a time, missing out on more complex interdependencies.
Example: Examining the relationship between income and education without considering the impact of age or experience.
Assumption of Linear Relationships:

Some bivariate measures, like Pearson’s r, assume a linear relationship, which may not capture the true nature of the association if it’s non-linear.
Example: The relationship between stress and performance may be curvilinear (an inverted U-shape), which Pearson’s r would not adequately capture.
Sensitivity to Outliers:

Measures like Pearson’s r are sensitive to outliers, which can distort the true strength and direction of the relationship.
Example: A single extreme data point in the relationship between hours studied and test scores can significantly affect the correlation coefficient.
Cannot Establish Causation:

Bivariate measures can indicate association but cannot prove causation.
Example: A high correlation between shoe size and reading ability in children does not imply that larger feet cause better reading skills; both are related to age

28
Q

what are two common transformations used in regression analysis, when and why we would use them

A
29
Q

what problems are we likely to encounter when we have specification error in our regression analysis

A
30
Q

why might the identification of large outliers in our analysis alert us to specification error

A
31
Q

what is an appropriate diagnostic test for identifying omitted variables in your model

A
32
Q

estimates from a regression analysis that suffers from specification error can be biased. what does it mean if estimates are biased

A
33
Q

what good reason might a researcher have for including an irrelevant variable in her model

A