Unit 1 Flashcards
☆
How do Explicit Models work?
Use explicit knowledge to design model deductively
☆
What are the pros of Explicit Models?
Pros:
□ Knowledge about behavior of model and environment/problem
□ Knowledge about restrictions of model and reasons for design choices
☆
What are the cons of Explicit Models?
Cons:
□ Sometimes problem is too complex to model
□ Consequences of simplifications of problem/model hard to assess
□ Insufficient knowledge about problem/environment
☆
How do Inductive Models work?
Machine Learning: Use previously observed data to create model inductively
☆
What are the pros of Inductive Models?
Pros:
□ Problem can be solved without (exhaustive) knowledge about problem
□ Predictions/Insights are created directly from data
□ Can handle complex problems and profits from big data
☆
What are the cons of Inductive Models?
Cons:
□ Data is required (sometimes a lot of data!)
□ Complex models (deep learning) can end up being a black box
□ Naive application might lead to biases
☆☆
How does Supervised Machine Learning work?
■ Learning a function that maps an input to an output (target value).
■ Learning is based on example input values with corresponding target values (also called supervisory signals)
□ E.g. image + object type, DNA sequence + phenotype, …
■ Typical usage: predictive modeling
□ Train model on dataset with input+target values
□ Use trained model to predict target values for other (new) inputs
■ Classification: target value is class label (discrete attribute, e.g. integer, letter, word)
■ Regression: target value is numerical value (real number)
☆☆
What is a Model?
parameterized function/method with specific parameter values (e.g. a trained neural network)
☆☆
What is a Model Class?
the class of models in which we search for the model (e.g. neural networks, SVMs, etc)
☆☆
What are Parameters?
representations of concrete models inside the given model class (e.g. network weights)
☆☆
What are Hyperparameters?
parameters controlling model complexity or the training procedure (e.g. network learning rate, the number of hidden layers, etc)
☆☆
What is Model selection/training?
process of finding a model from the model class
☆
How does the Feature Selection process work?
■ What data do we have?
■ Removal of redundant features
■ Removal of features the model class cannot utilize
■ (Deep Learning: Feature selection mainly done by neural network)
☆
What is done during Preprocessing?
■ Contrast and brightness correction
■ Segmentation
■ Alignment
■ Normalization
■ …
☆
How does Input Representation work?
■ We can represent each object by a vector of feature values (i.e. feature vectors) of length d x =(x(1),…,x(d))T
■ An object described by a feature vector is also referred to as sample
■ Individual x(j) may be
□ group descriptions: categorical variables/features (e.g. x(3) = name of the boat with which the fish was caught)
□ numbers: numerical variables/features (e.g. fish length in cm)
■ Assume our dataset consists of l objects with feature vectors x1,…,xl
■ Each feature vector is of length d
■ Then we can write the feature vectors of all objects in a matrix of feature vectors
■ Assume we are given a target value yi ∈ R for each sample xi
■ Then all target values constitute the target/label vector:
■ Often we write our dataset, including input features and targets, as data matrix Z
■ Note: Target of each sample can be a vector, then we get a target value matrix Y (multi-label classification). Don’t confuse this with multi-class classification (more than 2 possible label values).
☆☆
How does the Loss function work?
■ Assume we have a model g, parameterized by w
■ g(x;w) maps an input vector x to an output value y
■ We want (prediction) y to be as close as possible to the true target value y ■ We can use a loss (cost) function L(y,g(x;w)) to measure how close our prediction is to the true target for a given sample with z = (x^T,y)^T
■ The smaller the loss (cost), the better our prediction
■ Many loss functions available with different justifications
■ Not every loss function is suitable for every task
■ Choice of loss function depends on data, task, and model class
☆☆
What is the generalization error/risk?
The generalization error or risk is the expected loss on future data for a given model g(.;w):
■ In practice, we hardly have any knowledge about p(x,y)
■ →We have to estimate the generalization error
☆☆
What is Empirical Risk Minimization (ERM) ?
Empirical Risk Minimization (ERM) is a fundamental principle in machine learning used to minimize the error (or “risk”) of a model by optimizing its performance based on a given training dataset.
■ We do not know the true p(x,y) but we have access to a subset of l data samples (i.e. our dataset)
■ We estimate the (true) risk by the empirical risk Remp
on our dataset
■ Assume that the data points are i.i.d. (independent and identically distributed)
■ Strong law of large numbers: Remp
(g(.;w)) → R(g(.;w)) for l → ∞
■ Goal: Empirical Risk Minimization (ERM)
☆☆
What is the problem of overfitting?
If the model is too complex, it might memorize the training data rather than generalize to unseen data.
■ With ERM we can optimize our model by minimizing the risk on our (training) dataset
■ Problem: We might fit our parameters to noise specific to our training dataset(i.e.overfitting)
■→We need to get a better estimate for the (true) risk
☆☆
What are the 3 subsets used in Machine Learning?
Training set: subset used to train a model, i.e. to optimize/fit model parameters
Validation set: subset used to find the best hyperparameters
Test set: subset used to estimate risk
☆☆
What is the purpose of a training set?
Training set: A subset with m samples we perform ERM on (i.e. optimize parameters on) to train a model
☆☆
What is the purpose of a test set?
Test set: A subset with l −m samples we use to estimate the risk. Neither used for model selection nor hyperparameter search nor training.
■ Our estimate Remp
on the test set will show if we overfit to noise in training set
☆
How can we avoid overlaps between training and test sets?
■ Solution: Cross Validation (CV)
□ Split dataset into n disjoint folds
□ Use n−1 folds as training set, left-out fold as test set
□ Train n times, every time leaving out a different fold as test set
□ Average over n estimated risks on test sets to get better estimate of generalization capability
■ Nested Cross Validation
□ We can apply another (inner) CV procedure within each training-set of the original (outer) CV → allows for evaluation of model selection procedure
■ Getting a risk estimate on selected model:
1. Apply cross validation on training set (withhold test set)
2. 2. Use test set to estimate risk for the model selected via CV
3. ■ In practice, the found model is often trained further or re-trained on complete dataset for best performance
☆
What are some common pitfalls in Machine Learning?
Underfitting: model is too simple/coarse to fit training or test data (too low model complexity)
Overfitting: model fits (too) well to training data but not well to future/test data (too high model complexity)
Unbalanced datasets: datasets biased toward a single class need to be evaluated properly (balanced accuracy, ROCAUC, loss weighting, …)