DAT Techniques Flashcards
How does SQL, R and Python handle missing values?
SQL - NULL
R - NA
Python - None of NaN
What are the different types of Distributions?
Normal (aka Gaussian) - Follows a symmetrical bell curve, mean=median=mode.
Left/negative skew - Mean is less than the median, tail is in the negative direction.
Right/positive skew - Mean is greater than the median, tail is in the positive direction.
Define Selection Bias
In the gathering of data, a sample is produced not representative of the population.
Define Reporting Bias
Certain occurrences are under-represented due to respondents not sharing certain info.
Define Sampling Bias
A subset of selection bias, where the population sample is not random.
What is Pearson’s Correlation Coefficient?
A test statistic that measures the relationship between two continuous variables. Gives magnitude and direction.
What is Spearman’s Rank Correlation Coefficient?
A test statistic that measures how well a relationship can be described using a monotonic function (can handle non-linear, as long as it’s solely increasing/decreasing).
How does setting a Statistical Significance Level work?
A significance level, p is set. p is the probability that the data would be at least as extreme as the observed, given than H0 is correct. If the p value falls below the significant level, you can reject H0.
Explain the difference between Permutations and Combinations
Order matters in permutations. For combinations, the same elements ordered differently aren’t counted separately.
What is Linear Regression?
Using a function to estimate a continuous value output. The input can be continuous or discrete (ideally limited options or binary).
The model should minimise Ordinary Least Squares (OLS).
What are Residuals?
The difference between the data and the estimated outcomes, showing the error terms. In linear regression, they’re assumed to be normally distributed, with constant variance.
Under this assumption, the greater the residual, the less likely it is.
What is Hold-Out Data?
Omitting a set of data when building a mode, so this can later be used to test the model.
What is N-Fold Cross Validation?
A method for when there isn’t enough data to hold out.
- Split dataset into n separate subsets.
- Train model against n-1 subsets. Test against the one left out.
- Do this for each subset and calculate the mean performance.
What is the R²
A statistical measure, from 0 to 1, of how close the data is to the regression line. AKA the Coefficient of Determination.