Module 2: Chapter 1 - Tools & Techniques Flashcards
What are the four advantages of using machine learning techniques over traditional econometric methods?
1) Handling Big Data
2) Handling non-linearity
3) Reducing dimensionality
4) Handling missing data
What is the most crucial difference between classic econometric models and machine learning models when it comes about its approach understanding data?
Standard econometric models assume that the data-generating process can be approximated through a model and parameters are tested for statistical significance.
ML techniques make no assumptions on the data generating process
What is regularization in the context of ML techniques?
With regularization, the number of parameters is kept low by introducing a penalty when parameters are included that do not significantly improve the prediction power of an ML algorithm
What is the crucial philosophy behind ML techniques?
Model selection and fine-tuning of parameters to produce reliable predictions out-of-sample
What are ML tests focusing on?
1) Out-of-sample prediction accuracy
2) Bias-variance trade-off
What is the bias-variance trade-off?
The bias–variance tradeoff is a central problem in supervised learning. Ideally, one wants to choose a model that both accurately captures the regularities in its training data, but also generalizes well to unseen data. Unfortunately, it is typically impossible to do both simultaneously. High-variance learning methods may be able to represent their training set well but are at risk of overfitting to noisy or unrepresentative training data. In contrast, algorithms with high bias typically produce simpler models that may fail to capture important regularities (i.e. underfit) in the data.
Translate these statistical terms into ML parlance
- Data point
- Dependent variable
- Independent variable
- Estimation
- Estimator
ML parlance:
- Example, instance
- Output, outcome, label, target, response var., ground truth
- Feature, signal, input, attribute
- Training, learning
- Learner (classifier), algorithm
What are the four machine-learning methodologies?
1) Unsupervised learning
2) Supervised learning
3) Semi-supervised learning
4) Reinforcement learning
Provide an overview behind the logic of supervised learning and used data structures
Supervised learning is about prediction problems (e.g. oil price next week) or classification problems (e.g. identify an animal on a picture)
For the estimation of the model we use a vector of attributes and associated output/labels.
Can be used with time-series or cross-sectional data.
Example: credit risk
Provide an overview behind the logic of semi-supervised learning and used data structures
The goal of semi-supervised learning is the same as for supervised learning (prediction/classification).
The difference to supervised learning is that only a part of the training data is labeled. The unlabeled part is used to:
- find patterns in explanatory variables
- fit unlabeled data with pseudo-labels based on the algorithm trained on the labeled data
Provide an overview behind the logic of unsupervised learning and used data structures
The goal of unsupervised learning is to recognize patterns in data. We have a vector of attributes but no output/classification to predict. Hence, prediction is not a goal of unsupervised learning.
Problems to solve:
- Clustering of data
- Find small number of factors that explain the data
Goals:
- Characterize datasets
- Learn data structures
Use cases:
- Anomaly detection
- Grouping of data (e.g. stocks based on some features)
- Dimensionality reduction
Provide an overview behind the logic of reinforcement learning and used data structures
The goal is to make a series of decisions to achieve an objective. The environment can be static or dynamic (changing rules, adapting opponents, etc.). Final result is a recommended action (not a prediction or classification or cluster)
Feedback is provided through “rewards” (or sanctions if negative) that indicate that a certain type of behavior is desired but no specific instructions are provided (trial-and-error approach).
Provide an overview behind the logic of parametric vs. non-parametric methods
Parametric methods: Modeler makes an assumption about the functional relationship between features and outputs/labels (mapping)
Non-parametric methods: no assumption on functional relationship between features and outputs/labels. Problem: large data sets are needed
What is the purpose of EDA?
EDA = Exploratory Data Analysis
Collecting, cleaning, visualizing, and analyzing data
Collecting = data gathering
Cleaning = correct errors, remove duplicates, etc.
Visualization = Explore relationships between attributes and labels/outcomes
Explain the differences between:
1) Structured, semi-structured, and unstructured data
2) Numerical and Categorical data
3) Longitudinal and cross-sectional data
4) Textual and other data
Structured data = tabular form
Unstructured data = data that is not structured according to a preset data model
Semi-structured data = Mixture of structured/unstructured data (e.g. emails, csv files, html files)
Numerical data = continuous/discrete data
Categorical data = nominal/ordinal data
Longitudinal data = time-series data (repeated measurements) of the same entity over time. Can also be based on geography or network relations
Cross-sectional data = snapshot (single point-in-time) of independent entities
Other data = audio, video, photographs, maps, drawings, etc.
What is the difference between interval and ratio data?
Interval data = clear ordering with consistent and meaningful intervals but no true zero point
Ratio data = interval data with true zero point
Zero point = absence of a value
Which five factors necessitate data cleaning?
1) Inconsistent recording
(recordings must be done in the same way)
2) Unwanted observations
(observations collected that should not be in our sample)
3) Duplicate observations
4) Outlier (treatment)
(for some analysis purposes, we ant to keep outliers, e.g. fraud detection)
5) Missing data
(can be structurally caused, i.e. impossible to get certain data or recording errors. Important to understand the reason why data is missing. Informative missingness exists when missingness is not random and can lead to bias. Imputation or estimation can be used to fill in missing data)
When is a distribution of data not symmetric?
When the probability of data occurring is not equal left and right to the mean, then the data sample is skewed with a longer tail on one side.
Negative skewed = left tail is longer
When are Quantile-versus-Quantile (QQ) Plots used?
QQ plots are used to verify that data is distributed according to an assumed data distribution. Plotting real quantile distribution (y-axis) vs. the theoretical distribution (x-axis). Typically used to check if data is normally distributed
What is a fat tail distribution?
Tails of a normal distribution where abnormal events are rare, but not completely improbable