Knowledge Discovery Flashcards
Regression
Predict a given continuous-valued variable based on the values of other values assuming a liner or non linear model of dependency
based on a set of training data of tupels (x, y) where is a vector and y is a continuous-valued variable, find a function f such that f(x) = y
Classification
Predict a given discrete-valued variable based on the values of other variables, assuming a linear or nonlinear model of dependency
based on a set of training data of tupels (x,y) where x is a vector and y a discrete-valued variable. find a function f such that f(x) = y
Clustering
Goal is to find patterns/structures in the data, wich are previously unknown
doesnt use labeld data
based on a set of training data x find a fuction such that homgeneous objects are aggregated into the same groups
- high homogenity within clusters
- high heterogenity between clusters
methods agains underfitting
underfitting means high bias
feature selection:
get additional features
add polynomial features …
Regularization:
decrese lamda
NN:
larger NN, increase complexity
methods against overfitting
overfitting means high variance
get more data!
feature selection
* smaller sets of features
Regularization
* increase lamda
NN:
- dropout
- smaller NN
- early stopping
- increase weight decay
Evaluation methods
Hold-out
* split into to sets (e.g. 80 train, 20 test)
+ evaluation on data which is not used for training
+ easy to implement
- not all data is used for training
- evaluation results strongly depend on choice of training and test set
k-fold cross-validation
* slipt data in k folds, obtain average performance
+ approximation of actual model performance
+ all data used for training
- expensive, k models learned
- choice of k
leave-one-out * cross-validation with k=N \+ makes best use of data - very elaborate - hight variance of the individual model
Gradient Descent vs Normal Equation
Gradient Decent:
- need to choose alpha
- needs may iterations
- Works well even when M is large
Normal Equaiton
- no need to choose alpha
- don’t need to iterate
- need to compute (X^TX)^-1 which is very complex
- slow, if M is very large
Use GD if
- XTX -1 is not invertible
- number of features is large (≥10.000)
Mulitclass methods one vs rest / one vs one
one vs rest :
- easier interpretable
- possibility to gain knowledge about the class by inspecting its classifier
- less computationally expensive O(C)
- recommended as default choice
one vs one :
- less sensitive to imbalanced data sets
- advantageous for algorithms such as kernel algorithms which dont scale well with M
- much more computationally expensive: O(C^2)
batch gradient descent pros and cons
pro:
+ fewer uupdates
+ computationally more efficient the sgd
+ more stable convergence
con:
- entire training data must fit into the RAM
- updates and training speed may become slow for large data sets
stochastic gradient decent pros and cons
pro:
+ the frequent updates immediately give an insight into the performance of the model
+ the increased model update frequency can result in faster learning
+ the noisy update process can allow the model to avoid local minima
con:
- updating the model so frequently is more computationally expensive, taking significantly longer to train models on large datasets
- the frequent updates can result in a noisy gradient signal
mini-batch GD pros and cons
pro:
+ the model update frequency is higher than batch gradient descent which allows for a mofe robust convergence, avoiding local minima
+ the batched updates provide a computationally more efficient process than stochastic gradient descent
+ the batching allows both the efficiency of not having all training data in memory and algorithm implementations
con:
- requires additional hyperparameter b
kernel trick
the transformation of data into a higher dimensional space and separate them linearly
Kernal Methods
- linear model
- representer theorem
- dual representation
- kernel function
Linear Model:
- classification = side of hyperplan
- training = estimation of “good” parameters w and b
Representer Theorem:
optimal weight vector w is a linear combination of training instances
Dual Representation:
During learning and classification, it is sufficient to access only dot products of data points
Kernel Function:
K(x,z) =
* similarity function with an interpretation as a dot product after mapping to vector representation
“Kernel-Trick”
Rewrite learning algorithm such that any reference to the input data happens from within inner products
Replace any such inner product by the kernel function of your choice
work with the (linear) algorithm as usual
*non-linearity enters learning through kernels while training algorithm may remain linear
SVMs
Generalization through margin maximization
- intuitive motivation
- unique optimal solution
maximal margin = minimal norm
dual formulation
- makes it possible to use kernel functions
- makes problem become a quadric programming problem for which fast algorithms exist
hard margin vs soft margin
many well known implementations available