DAT Techniques Flashcards

Question 1

Q

How does SQL, R and Python handle missing values?

Answer

A

SQL - NULL
R - NA
Python - None of NaN

Question 2

Q

What are the different types of Distributions?

Answer

A

Normal (aka Gaussian) - Follows a symmetrical bell curve, mean=median=mode.

Left/negative skew - Mean is less than the median, tail is in the negative direction.

Right/positive skew - Mean is greater than the median, tail is in the positive direction.

Question 3

Q

Define Selection Bias

Answer

A

In the gathering of data, a sample is produced not representative of the population.

Question 4

Q

Define Reporting Bias

Answer

A

Certain occurrences are under-represented due to respondents not sharing certain info.

Question 5

Q

Define Sampling Bias

Answer

A

A subset of selection bias, where the population sample is not random.

Question 6

Q

What is Pearson’s Correlation Coefficient?

Answer

A

A test statistic that measures the relationship between two continuous variables. Gives magnitude and direction.

Question 7

Q

What is Spearman’s Rank Correlation Coefficient?

Answer

A

A test statistic that measures how well a relationship can be described using a monotonic function (can handle non-linear, as long as it’s solely increasing/decreasing).

Question 8

Q

How does setting a Statistical Significance Level work?

Answer

A

A significance level, p is set. p is the probability that the data would be at least as extreme as the observed, given than H0 is correct. If the p value falls below the significant level, you can reject H0.

Question 9

Q

Explain the difference between Permutations and Combinations

Answer

A

Order matters in permutations. For combinations, the same elements ordered differently aren’t counted separately.

Question 10

Q

What is Linear Regression?

Answer

A

Using a function to estimate a continuous value output. The input can be continuous or discrete (ideally limited options or binary).
The model should minimise Ordinary Least Squares (OLS).

Question 11

Q

What are Residuals?

Answer

A

The difference between the data and the estimated outcomes, showing the error terms. In linear regression, they’re assumed to be normally distributed, with constant variance.
Under this assumption, the greater the residual, the less likely it is.

Question 12

Q

What is Hold-Out Data?

Answer

A

Omitting a set of data when building a mode, so this can later be used to test the model.

Question 13

Q

What is N-Fold Cross Validation?

Answer

A

A method for when there isn’t enough data to hold out.

Split dataset into n separate subsets.
Train model against n-1 subsets. Test against the one left out.
Do this for each subset and calculate the mean performance.

Question 14

Q

What is the R²

Answer

A

A statistical measure, from 0 to 1, of how close the data is to the regression line. AKA the Coefficient of Determination.

DAT Techniques Flashcards

(14 cards)