DAT Techniques Flashcards

1
Q

How does SQL, R and Python handle missing values?

A

SQL - NULL
R - NA
Python - None of NaN

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the different types of Distributions?

A

Normal (aka Gaussian) - Follows a symmetrical bell curve, mean=median=mode.

Left/negative skew - Mean is less than the median, tail is in the negative direction.

Right/positive skew - Mean is greater than the median, tail is in the positive direction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Define Selection Bias

A

In the gathering of data, a sample is produced not representative of the population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Define Reporting Bias

A

Certain occurrences are under-represented due to respondents not sharing certain info.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Define Sampling Bias

A

A subset of selection bias, where the population sample is not random.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is Pearson’s Correlation Coefficient?

A

A test statistic that measures the relationship between two continuous variables. Gives magnitude and direction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is Spearman’s Rank Correlation Coefficient?

A

A test statistic that measures how well a relationship can be described using a monotonic function (can handle non-linear, as long as it’s solely increasing/decreasing).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How does setting a Statistical Significance Level work?

A

A significance level, p is set. p is the probability that the data would be at least as extreme as the observed, given than H0 is correct. If the p value falls below the significant level, you can reject H0.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Explain the difference between Permutations and Combinations

A

Order matters in permutations. For combinations, the same elements ordered differently aren’t counted separately.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is Linear Regression?

A

Using a function to estimate a continuous value output. The input can be continuous or discrete (ideally limited options or binary).
The model should minimise Ordinary Least Squares (OLS).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are Residuals?

A

The difference between the data and the estimated outcomes, showing the error terms. In linear regression, they’re assumed to be normally distributed, with constant variance.
Under this assumption, the greater the residual, the less likely it is.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is Hold-Out Data?

A

Omitting a set of data when building a mode, so this can later be used to test the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is N-Fold Cross Validation?

A

A method for when there isn’t enough data to hold out.

  1. Split dataset into n separate subsets.
  2. Train model against n-1 subsets. Test against the one left out.
  3. Do this for each subset and calculate the mean performance.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the R²

A

A statistical measure, from 0 to 1, of how close the data is to the regression line. AKA the Coefficient of Determination.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly