Module 2: Chapter 1 - Tools & Techniques Flashcards

1
Q

What are the four advantages of using machine learning techniques over traditional econometric methods?

A

1) Handling Big Data
2) Handling non-linearity
3) Reducing dimensionality
4) Handling missing data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the most crucial difference between classic econometric models and machine learning models when it comes about its approach understanding data?

A

Standard econometric models assume that the data-generating process can be approximated through a model and parameters are tested for statistical significance.
ML techniques make no assumptions on the data generating process

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is regularization in the context of ML techniques?

A

With regularization, the number of parameters is kept low by introducing a penalty when parameters are included that do not significantly improve the prediction power of an ML algorithm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the crucial philosophy behind ML techniques?

A

Model selection and fine-tuning of parameters to produce reliable predictions out-of-sample

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are ML tests focusing on?

A

1) Out-of-sample prediction accuracy
2) Bias-variance trade-off

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the bias-variance trade-off?

A

The bias–variance tradeoff is a central problem in supervised learning. Ideally, one wants to choose a model that both accurately captures the regularities in its training data, but also generalizes well to unseen data. Unfortunately, it is typically impossible to do both simultaneously. High-variance learning methods may be able to represent their training set well but are at risk of overfitting to noisy or unrepresentative training data. In contrast, algorithms with high bias typically produce simpler models that may fail to capture important regularities (i.e. underfit) in the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Translate these statistical terms into ML parlance
- Data point
- Dependent variable
- Independent variable
- Estimation
- Estimator

A

ML parlance:
- Example, instance
- Output, outcome, label, target, response var., ground truth
- Feature, signal, input, attribute
- Training, learning
- Learner (classifier), algorithm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the four machine-learning methodologies?

A

1) Unsupervised learning
2) Supervised learning
3) Semi-supervised learning
4) Reinforcement learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Provide an overview behind the logic of supervised learning and used data structures

A

Supervised learning is about prediction problems (e.g. oil price next week) or classification problems (e.g. identify an animal on a picture)

For the estimation of the model we use a vector of attributes and associated output/labels.

Can be used with time-series or cross-sectional data.

Example: credit risk

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Provide an overview behind the logic of semi-supervised learning and used data structures

A

The goal of semi-supervised learning is the same as for supervised learning (prediction/classification).

The difference to supervised learning is that only a part of the training data is labeled. The unlabeled part is used to:
- find patterns in explanatory variables
- fit unlabeled data with pseudo-labels based on the algorithm trained on the labeled data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Provide an overview behind the logic of unsupervised learning and used data structures

A

The goal of unsupervised learning is to recognize patterns in data. We have a vector of attributes but no output/classification to predict. Hence, prediction is not a goal of unsupervised learning.

Problems to solve:
- Clustering of data
- Find small number of factors that explain the data

Goals:
- Characterize datasets
- Learn data structures

Use cases:
- Anomaly detection
- Grouping of data (e.g. stocks based on some features)
- Dimensionality reduction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Provide an overview behind the logic of reinforcement learning and used data structures

A

The goal is to make a series of decisions to achieve an objective. The environment can be static or dynamic (changing rules, adapting opponents, etc.). Final result is a recommended action (not a prediction or classification or cluster)

Feedback is provided through “rewards” (or sanctions if negative) that indicate that a certain type of behavior is desired but no specific instructions are provided (trial-and-error approach).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Provide an overview behind the logic of parametric vs. non-parametric methods

A

Parametric methods: Modeler makes an assumption about the functional relationship between features and outputs/labels (mapping)

Non-parametric methods: no assumption on functional relationship between features and outputs/labels. Problem: large data sets are needed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the purpose of EDA?

A

EDA = Exploratory Data Analysis
Collecting, cleaning, visualizing, and analyzing data

Collecting = data gathering
Cleaning = correct errors, remove duplicates, etc.
Visualization = Explore relationships between attributes and labels/outcomes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Explain the differences between:
1) Structured, semi-structured, and unstructured data
2) Numerical and Categorical data
3) Longitudinal and cross-sectional data
4) Textual and other data

A

Structured data = tabular form
Unstructured data = data that is not structured according to a preset data model
Semi-structured data = Mixture of structured/unstructured data (e.g. emails, csv files, html files)

Numerical data = continuous/discrete data
Categorical data = nominal/ordinal data

Longitudinal data = time-series data (repeated measurements) of the same entity over time. Can also be based on geography or network relations
Cross-sectional data = snapshot (single point-in-time) of independent entities

Other data = audio, video, photographs, maps, drawings, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the difference between interval and ratio data?

A

Interval data = clear ordering with consistent and meaningful intervals but no true zero point

Ratio data = interval data with true zero point

Zero point = absence of a value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Which five factors necessitate data cleaning?

A

1) Inconsistent recording
(recordings must be done in the same way)
2) Unwanted observations
(observations collected that should not be in our sample)
3) Duplicate observations
4) Outlier (treatment)
(for some analysis purposes, we ant to keep outliers, e.g. fraud detection)
5) Missing data
(can be structurally caused, i.e. impossible to get certain data or recording errors. Important to understand the reason why data is missing. Informative missingness exists when missingness is not random and can lead to bias. Imputation or estimation can be used to fill in missing data)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

When is a distribution of data not symmetric?

A

When the probability of data occurring is not equal left and right to the mean, then the data sample is skewed with a longer tail on one side.
Negative skewed = left tail is longer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

When are Quantile-versus-Quantile (QQ) Plots used?

A

QQ plots are used to verify that data is distributed according to an assumed data distribution. Plotting real quantile distribution (y-axis) vs. the theoretical distribution (x-axis). Typically used to check if data is normally distributed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is a fat tail distribution?

A

Tails of a normal distribution where abnormal events are rare, but not completely improbable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Which visualization diagram is typically used to detect outliers?

A

Box-and-Whiskers Plot

22
Q

What visualization diagram is typically used to identify relationships between attributes?

A

Scatterplots

23
Q

What is Feature Extraction / Feature Engineering?

A

Translate features collected in the dataset to features used for analysis.

Qualitative data has to be transformed to numerical information (encoding). E.g. categorical variables used for qualitative information (e.g. geographic region).

Categorical features are represented with dummy variables (binary) because there is no meaningful interpretation between the value differences (e.g. 1- single, 2-married) as there is no natural ordering.

24
Q

How is the process of creating dummy variables from qualitative data called?

A

One-hot encoding / binarization

25
Q

What is the Dummy Variable Trap?

A

Multicollinearity induced in regression models fail as dummy variables become perfectly correlated with each other.

Consequence: inaccurate calculations of regression coefficients and standard errors

26
Q

What can be done against the Dummy Variable Trap?

A

1) Omitting one dummy variable
2) Setting the constant term (bias) of the equation to zero

27
Q

If there is natural ordering in a categorical variable, how is categorized?

A

It is an ordinal variable

28
Q

Why is data scaling needed?

A

Many machine-learning approaches require all the variables to be measured on the same scale; otherwise, the technique will not be able to determine the parameters appropriately and the results will be dominated by the feature with the largest magnitude

29
Q

What two data scaling techniques are used?

A

1) Standardization
Effect: mean of 0 and standard deviation of 1
2) Normalization (min-max transformation)
Effect: variable is bounded between 0 and 1

Standardization > Normalization
–> presence of outliers because normalization would remove much of the distance between the data points

30
Q

What are the three reasons for data scaling?

A

(1) Numerical stability of the learning algorithm. The difference in the scale of the variables could cause overflow/underflow easily

(2) Ease of interpretation of the model parameter estimates. If the scales of the variables differ significantly, so do those of the parameter estimates and it would make evaluation of the importance of each estimate difficult

(3) Determining whether the out-of-sample prediction is within the range of the training data. If the out-of-sample data corresponds to a normalized value greater than 1 or smaller than 0, that means the prediction is an extrapolation that may correspond to a large prediction error.

Underflow: Underflow is a type of rounding error that can be extremely damaging. When integers near zero are rounded to zero, underflow occurs. When the argument is zero instead of a small positive number, many functions act qualitatively differently.

Overflow: Overflow is another highly harmful type of numerical error. When numbers of enormous magnitude are approximated as -\infty or \infty , overflow occurs. These infinite numbers are frequently converted to NaN values with more mathematics.

31
Q

What is the rule of thumb to determine if a dataset is highly skewed?

A

If the ratio of the highest value to the lowest value is larger than 10 (in absolute value)

32
Q

What is a common practice to to transform highly skewed data? What are alternatives?

A

Transforming via the natural logarithm

Alternatives:
- Square root
- Inverse transformations

33
Q

What is the most commonly used dimensionality reduction technique?

A

Principal Component Analysis

34
Q

What is the logic behind PCA?

A

Find linear combinations of the original predictors that summarize most of the variability of the data.

Linear combinations = Principal Components

PC1 = linear combination of the original predictors that captures the most variability in the data among all possible linear combinations

PC2 = linear combination of the predictors that captures the most variability that is not already explained by the first PC

This second PC is constructed to be orthogonal to the first PC, that is, the two PCs are uncorrelated

35
Q

When is PCA most useful?

A

With highly correlated data and data experiencing multicollinearity.

Improves numerical stability of models that require low correlation among the predictors

36
Q

Why is it important to scale data before using PCA?

A

PCA seeks to find linear combinations to explain the variability in the predictors without any understanding of the predictors’ measurement scale or their distribution (e.g., if they are skewed)

PCA will first summarize the predictors with more variation, that is those with the largest magnitude

37
Q

How can we find an appropriate number of Principle Components?

A

Plot the variance explained per PC and check where we see a drop (elbow)

38
Q

Why should we train the model / algorithm only on a part of the available data?

A

The analyst wants the model to be able to generalize well to data that have not been used to estimate it

This leads to a distinction between in-sample (model estimation) and out-of-sample (sometimes known as a hold-out sample) parts of the data.

39
Q

Why is it especially in the case of ML important to have a hold-out sample?

A

There is typically little economic or financial intuition behind the modeling assumptions and the risk of choosing a complex model that accurately fits the dataset at hand but does not generalize well to unseen data is high

40
Q

What are the two common problems when a model is trained on a dataset?

A

Overfitting
Estimation of a model that is too complex and captures the noise in the dataset at hand rather than the true nature of relationships between the features and the output(s)

Underfitting
Significant patterns in data are not captured by the model

41
Q

Once a dataset is ready to be used for model estimation, in which parts does the data have to be divided?

A

(1) Training set
–> estimate model parameters

(2) Validation set
–> select between competing models

(3) Test set
–> determine the final chosen model’s effectiveness

42
Q

What is the rule-of-thumb for splitting portions of data?

A

2/3 for training
1/6 for validation
1/6 for testing

43
Q

Why should the training sample not be too small?

A

Introduction of bias in the parameter estimation

44
Q

Why should the validation sample not be too small?

A

Model evaluation can become too inaccurate so that it is hard to identify the best specification

45
Q

Which method to use to split cross-sectional data?

A

Choose randomly data based on the assigned proportions

46
Q

Which method to use to split time-series data?

A

Training on older data, testing and validation on more recent data

47
Q

What is cross-validation? Which technique is used?

A

Cross-validation involves combining the training and validation data into a single sample, with only the test data held back. Then the combined data are split into equally sized sub-samples, with the estimation being performed repeatedly and one of the sub-samples left out each time

k-fold cross validation technique

The technique splits the combined training and validation data available, n, into k samples, with the test data excluded from the combined sample

48
Q

What are the most used Python packages for ML estimation?

A

NumPy, SciPy, Pandas, Scikit-Learn, Tensorflow and Keras

49
Q

What is a popular package for ML estimation in R?

A

A popular package for building and evaluating machine learning models in R is caret (which is short for Classification And REgression Training)

50
Q

For each of the following terms used in classical statistics, provide the equivalent term in machine learning parlance:

Intercept
Slope
Explanatory variable
Dependent variable
In-sample period
Out-of-sample period

A

Intercept – bias
Slope – weight
Explanatory variable – feature
Dependent variable – output or label
In-sample period – training data
Out-of-sample period – test data

51
Q

For what kinds of problems would machine learning likely be more suitable than conventional econometric modeling?

A

Machine-learning techniques have advantages when applied to problems where there is little theory regarding the nature of a relationship or which features are relevant. It is used when the number of data points and the number of features are large (big data or wide data, as opposed to tall data where the number of predictors is strictly smaller than the number of observations). Machine learning might also be preferable when the relationships between features (and targets) are nonlinear.