Finals Flashcards

1
Q

What does the Central Limit Theorem prove?

A

The sampling distribution of the mean is approximately normally distributed once σ is known and n sufficiently large.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a problem about the Central Limit Theorem? What can we use instead?

A

The standart deviation of the population, which we need to calculate CLT, is often not known. We can perform a one-sample t-test , which we need a sample standart deviation for.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a one-sample t-test? When should you use it? +Formula

A

It is used to compare a result to an expected value.
You should use this test when:
- You do not know the population mean or standard deviation.
- You have two independent, separate samples.
Formula: t = ( x̄ – μ) / (s / √n)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What can you tell me about Exploratory data analysis (EDA)? (2 bulletpoints)

A
  • EDA is for seeing what the data can tell us beyond the formal modelling or hypothesis testing task.
  • It is also the best paradigm to make statements about both validity and reliability
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What scales of data are there? Explain and give an example. (a lot of text)

A
  1. Categorical (Nominal)
    ○ uses labels to classify cases into classes
    ○ gender, nationality, residence, car brand
  2. Ordinal
    ○ monotonic increasing function
    ○ if X > Y then log(X) > log(Y)
    ○ PRESERVES ORDER NOT MAGNITUDE
    ○ ratings and rankings
    ○ Example: not all, slightly, fairly, much, very much
  3. Interval
    ○ Y = aX + b
    ○ i.e. What is the exact temperature in your city?
  4. Ratio
    ○ Y = aX
    ○ difference to ordinal: produces not only order but also makes the difference between variables known along with information on the value of true zero
    ○ i.e: how many children? 0, less or equal than 2 , more than 2
    ○ IT HAS A NATURAL ZERO POINT (total absence of the variable of interest, i.e. not having any children)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the properties of a reliable research tool?

A

A reliable research tool is consistent, stable, predictable and accurate.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does the parallel forms reliability do? When do you use it?

A

It measures the correlation between two equivalent versions of a test. You use it when you have two different assessment tools or sets of questions designed to measure the same thing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What does the test - retest reliability do?

A

It measures test consistency of a test measured over time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the split half technique? What would ensure an acceptable level of reliability in the measurments?

A

It is a method used to check measuring instruments where half of the data is computed and is then correlated against the other half of the data. A correlation coefficient of 0,9 would ensure an acceptable level of reliablity in the measurments.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the inter - rater reliability? How is it calculated?

A

It is the extent to which two or more raters agree. It is calculated by COHENS KAPPA. (Formula: K = (po-pe) / (1-pe) )

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How is the standard normal distribution curved? Give its 2 parameters and their values.

A

It is bell curved.

The parameters are the mean ( = 0 ) and the standart deviation ( = 1 ).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the difference between a T-distribution and a normal distribution?

A

A T-distribution is a normal distribution with heavier tails.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What do you know about the Monte Carlo method?

A

● Any problem that might be deterministic in principle can be solved by MC. It relies on repeated random sampling in order to obtain a good estimate or approximation of the exact p-value.
● MC is used when the data set does not meet the requirements necessary for parametric or asymptotic methods.
● Computing an exact p-value is possible via Exact tests, Randomization tests, but only for small data sets. MC can also work with large data sets.
● The Monte Carlo method tells you:
○ All of the possible events that could or will happen,
○ The probability of each possible outcome.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are regularisation techniques? + Examples

A

They are techniques used in Bias Variance Trade-Off which create bias by slightly changing the slope of the regression line. Lasso and Ridge regression are examples.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Explain “Bias Variance Trade-Off”? What is the name of the techniques used here?

A

● Bias–variance tradeoff is the property of a model that the variance of the parameter estimated across samples can be reduced by increasing the bias in the estimated parameters.
● Regularization techniques are used.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the standart error?

A

It is the standard deviation of the sampling distribution.

17
Q

Explain type 1 and type 2 error.

A
18
Q

What is sampling bias? Where does is comes from? What Non Random Sample types are there? (important) Explain them. (not crucial I guess, but interesting)

A

Sampling bias is a type of selection bias and involves systematic error due to a non random sample of a population:

  • Convenience sampling is a method where market research data is collected from a conveniently available pool of respondents. (remember WEIRD from the cognitive science lectures).
  • Snowball sampling is a technique where information comes from “somewhere” so that it cannot be traced and verified, e.g. from drug addicts or gamblers. Where does the snowball come from?
  • Quota sampling is a technique in which researchers choose individuals (for the sample) according to specific traits or qualities.
19
Q

Explain the different variables there are in an experiment.

A
  • Independent variable: A variable the experimenter changes or controls and is assumed to have a direct effect on the dependent variable.
  • Depented variable: A variable being tested and measured in an experiment and is “dependent” on the independent variable.
  • Extraneous variables: All variables, which are not the independent variable, but could still effect the results of the experiment.
20
Q

What is “residual”?

A

It is the difference between the observed value and the mean value that a supervised learning model predicts for that observation. In other words, it is a measure of how much a regression line vertically misses a data point.

21
Q

Explain prevalence, sensitivity, specificity, Positive Predictive Value, Negative Predictive Value and accuracy and give their formulas.

A

● Prevalence: Total number of cases of a disease existing in a population divided by the total population. P(Z) = ( TP + TN ) / ( TP + TN + FP + FN )
● Sensitivity: the proportion of people with the disease who will have a positive test result; P(T|Z) = TP / ( TP + FN [people with the disease])
● Specificity : the proportion of people without the disease who will have a
negative result; P(-T|-Z) = TN / ( TN + FP [people without the disease])
● Positive Predictive Value: the probability of patients who have a positive test result actually having the disease; P(Z|T) = ( TP + FN [people with the disease]) / ( TP )
● Negative Predictive Value: the probability that people who get a negative test result truly do not have the disease; P(-Z|-T) = ( TN + FP [people without the disease]) / ( FN )
● Accuracy: It measures the correctness of a diagnostic test on a condition. (TP + TN) / total

22
Q

What happens if the standard error of the mean gets decreased?

A

It will also decrease the difference between the lower and the upper bound of the confidence level (α) and allow for more accurate, precise conclusion. In other words, the smaller alpha α, the more precise are the conclusions.

23
Q

What are computer intensive techniques? What techniques / concepts are considered as CIT’s?

A

● Sometimes called resampling
● Involve intensive use of computers to compute thousands of new samples, divergent statistics or other values of interest to do inferential statistics in an “empirical” way, to improve system performance and to validate models
● Examples of CIT:
○ Bootstrapping
○ Monte Carlo Methods
○ Randomisation Test, Permutation Tests, Exact Tests

24
Q

A contingency table allows for …

A

… exact probability statements.

25
Q

What do you know about the Z-score? Give the formulas and the known P-values with their Z-scores.

A

● A z-score gives you an idea of how far from the mean a data point is. But more technically it’s a measure of how many standard deviations below or above the population mean a raw score is. In order to use a z-score, you need to know the mean μ and also the population standard deviation σ. Formula: Z = ( x - μ) / σ
● When μ and σ are unknown, the Z score may be calculated using the sample mean (x) and sample standard deviation (s) as estimates of the population values. Formula: Z = ( x - x ) / s
● A z-score of 2.5 means that this score lies 2.5 standard deviations above the mean. A score of -0.75 means that the score lies 0.75 standard deviations below the mean.
● P(Z > 1,96) = 2,5%
● P(Z > 1,645) = 5%

26
Q

What is cross validation?

A

● A way to implement Bias Variance Trade off and Train-Test paradigm
● A sophisticated way to partition the data
● It is one of the techniques used to test the effectiveness of machine learning models, it is also a resampling procedure used to evaluate a model if we have limited data.
● The sophisticated way of cross-validation: all data will alternately play a role of training and test data.
● Less sophisticated way: use 75% of the data as training data, use 25% of the data as test data.

27
Q

Explain overfitting. Which properties does it have?

A
  • Occurs when we fit the model perfectly to the data at hand (zero bias) but it will perform poorly and unpredictably on new data, across different samples (high variance)
  • It has 0 or low bias and high variance.
28
Q

What is bias?

A

It is the difference between expected and real value.

29
Q

What is the Mann Whitney’s test?

A

It’s a nonparametric version of the two independent samples of the t-test. It tests the null-hypothesis stating that for randomly selected values X and Y from two populations, the probability of X being greater tha Y is equal to the probability of Y being greater than X.
P( X > Y ) = P( Y > X )

30
Q

Explain the Train-Test paradigm.

A

● Training Dataset: The sample of data used to fit the model.
● Validation Dataset: The sample of data used to provide an unbiased evaluation of a model fit on the training dataset while tuning model hyperparameters. The evaluation becomes more biased as a skill on the validation dataset is incorporated into the model configuration.
● Test Dataset: The sample of data used to provide an unbiased evaluation of a final model fit on the training dataset.
● Both training and test data are created (from the dataset/sample that we started with ) by sophisticated partitioning and using the data

31
Q

Explain randomisations tests.

A

● Permutation/exact tests
● They compute exact values by enumerating all of the possible outcomes that could occur in some reference set beside the outcome that was actually obtained.
● Used only on small data

32
Q

Which types of regression exist?

A

Simple regression
Multiple regression (with an interaction term)
Logistic regression

33
Q

Explain simple regression.

A

○ Simple linear regression is used to estimate the relationship between two quantitative variables.
○ Yi = ß0 + ß1 * Xi + ei
○ Interpretation of the formula: If we increase variable X by 1 unit, then variable Y increases by B1
○ For numeric dependent variable Y

34
Q

Explain multiple regression and what happens with an interactive term.

A

○ Multiple linear regression is used to estimate the relationship between two or more independent variables and one dependent variable Y
○ Yi = ß0 +ß1 * X1i + ß2 * X2i + ei
○ Interpretation: If we increase variable X1 by 1 unit and control for X2, then variable Y increases by ß1
○ With interaction: Yi = ß0 +ß1 * X1 + ß2 * X2 + ß3 * X1 * X2 + ei
○ Without interaction term, the effect of X1 on Y is measured by ß1, with the interaction term, the effect of X1 and X1 * X2 on Y is measured by ß1 and ß3
○ If ß3 is significant, you may keep the interaction term in your model.

35
Q

Explain logistic regression

A

○ Formula: Yi = ß0 + ß1 * X1i + … + ßn * Xik

○ For a BINARY dependent variable Y

36
Q

What is selection bias?

A

● Occurs when individuals or groups in a study differ systematically from the population of interest leading to a systematic error in an association or outcome.
● Selection bias can arise in studies because groups of participants may differ in ways other than the interventions or exposures under investigation. When this is the case, the results of the study are biased by confounding.

37
Q

Give examples of selection bias and explain them.

A

○ Sampling bias
■ Sampling bias is a type of selection bias and involves
systematic error due to a non-random sample of a
population (convenience sampling, snowball sampling, quota
sampling)
○ Performing poor cross-validation
■ Naive split of data set into 70% training data and 30% test data
without randomizing the (order of) data first.
○ Publication bias
■ once only significant results are published in journals. If one tries
to combine all these results in a follow-up study (for example a
meta-analysis that combines the results of dozens of studies)
one finds cases that are not representative.
○ Attrition bias
■ Caused by attrition (loss of participants), discounting trial
subjects/tests that did not run to completion. It may lead to all
kinds of missing values, that may distort the quality of the
research
■ Lost to follow-up
● It’s a form of attrition bias, mainly occurring in medicinal
● studies over a lengthy time period. Non-Response or
retention bias can be influenced by a number of factors,
such as; wealth, education, altruism, initial understanding
of the study and it’s requirements