Xijia Flashcards
What is a kernel function and the role in algorithm?
A function that’s used when non-linear data is projected into higher dimensions. ( X => ØX ) The kernel function is used to make this computation at the same time.
Given x1 and x2 returns < Øx1 , Øx2 > without calculating ØX.
One is therefore able to transform the data without the need for excessive computer power.
What is the basic property of kernel function? (Gram matrix)
The kernel function holds one central part. The Gram-matrix. Which is made up from the:
- kernel
- training set
The gram-matrix contains the evaluation of the kernel function on al pairs of data-points. All the information to the algorithm must pass through this.
Polynomial Kernel
F(x, xj) = (x (Dot-product) xj+1)^d
Generalized version of linear kernel and not preferred.
+ more powerful than linear kernel
+ strong physical control
- more hyperparameters
- high polynomial degree
RBF kernel
F(x, xj) = exp(- ||x - xj||^2)/(2pi^2)
pi => kontrolls the complexity of the model
It is one of the most preferred and used kernel functions. It is usually chosen for non-linear data. It helps to make proper separation when there is no prior knowledge of data.
+ only one hyperparameter
+ less computional
- powerful and flexible
What is the basic idea of kernel methods?
Functions that given x1 and x2 return without calculating Ø(x).
Transform future-vectors to infinite dimensions without extra computational burden.
Why is the kernel method viewed as the memory-based method?
Memory-based methods keep the training samples and use them during the prediction phase.
Kernel functions store everything in the gram-matrix (kernel information and training-set)
Hence they are memory based
What is the basic idea of ensemble methods?
Use multiple weak methods together to make a better predictive performance.
ex:
{random forest & adabosting}
{boosting & adaboosting}
What is the difference between kernel methods and Ensemble methods?
Kernel methods main goal is to help classification by adding dimensions to the data.
Ensemble methods use multiple methods and combine them to one classification.
Why does bagging (bootstrap aggregation) can help us to improve the predictions?
Bagging is when we divide the train-dataset into many subparts (randomly) and train many models and take the average:
+ Raises stability of the model
+ Reduces overfitting
What is the difference between bagging and adaboost?
Both are ensemble methods with random sampling from the test data. But boosting redistributes it’s weights after each training-step.
Understand how does Perceptron algorithm work?
- Needs linearly separable data
- Input: w = (w0,w1,w2)^T {Dimension +1} {w0 = bias}
SUM [w0x0 + w1x1 + w2*x2 ] = (+) or (-)
If missclassification:
wi = wi + Ndxi
new = old + lear. rate
N = learning rate ( how fast are we gonna step towards the line ) d = {1 if miss should be above the line} {-1 if miss should be below line}
Features?
- {supervised learning}
- {optimal weight coefficients are automatically learned}
Features of Perceptron algorithm?
- {supervised learning}
* {optimal weight coefficients are automatically learned}
Understand how to kernelize the linear PCA method.
- First axis is of highest important ( x1 is not to scale of x2)
- PCA’s covariance matrix scales with the number of input dimensions
- Kernel PCA’s kernel matrix scales with the number of datapoints.
Be able to implement kernel PCA from scratch in R
- Pick kernel function
- Calculate the kernel matrix
- Center the kernel matrix
- Solve the Eigenproblem
- Project the data to each other
Understand how to kernelize the ridge regression method.
Ridge regression is regression where the cost function is altered by adding a penalty equation to the magnitude of the coefficient.
With kernel alteration xi –> (IO)(Xi) we can analyze infinity higher dimensions
Understand the mean of hyper parameters in KRR.
- Kernel Parameter
* Regularization constant
formulation of maximum margin classifier (MMC)
Maximum margin classifier
min {b,w} 1/2 w^T*w
S.T. yi(w^Txi+b)>=1 för alla i=1,…,N
formulation of soft margin classifier (SMC)
- Give up some high noise cases
- Introduce slackness parameter
min {b,w,epsi} 1/2 w^t*w + C sum of epsi{i} i = 1 to N
S.T. yi(w^T*xi+b) >= 1- epsi{i} and epsi{i} >= 0 for every I
Hyper parameter: Large C, less noice tolerance, high cost
Understand the difference and connections between MMC, SMC, and SVM
(SVM is refered as a kernelized SMC)
SVM = SMC + kernel trick
SMC = MMC + penalty on slackness parameter
formulation of SVM
- Works like logistic regression or perceptron
* multiple hyperplanes are working
Why SVM is called a sparse kernel method?
Only the “outliers” matter for our model. We can add new data-points behind the line and it won’t change. That’s why its called a sparse model.
What are the hyper-parameters in RBF-kernel SVM? What is the meaning of them?
- Gamma defines how far the influence of a single training example reaches .
- C - trades correct classification of training examples against maximization of the decision margin.
Understand the outputs of the function ’ksvm’ from ’kernlab’ package
alpha - resulting suport vectors alphaindex - index of resulting suport vectors coef - corresponding coefficients b - negative intercept nSV - the number of support vectors obj - value of objective function error - training error cross - cross validation prob.model - width of the laplacian fitted
Know how to train a random forest
- Random Sampling with replacement when choosing data n-times.
- Random select features and build a decision tree for each dataset (n).
- Use majority voting of each decision tree
- bootstrapping + aggregation = bagging
Kolla upp den här så att du förstår den
Know how to train a decision tree
- kolla upp detta bättre så att du kan beskriva
Connection between decision tree and RF?
A random forest is made up of decision threes that use voting to classify the sample.
What are the hyper-parameters in RF? What is the meaning for them?
mTRM - Amount of variables at a node-split (bara två variabler?)
noDESIZE - Amount of observations in the leaves
nTREE - Number of trees to grow.
Why Random forest is self-validated?
Random forest is built on bagging. in each round of bootstrap 2/3 of the samples will be included for training. So the decision three can be evaluated on the remaining 1/3.
What is OOB error? (Random forest)
out of bag error - is a method of measuring the prediction error of random forests, boosted decision trees, and other machine learning models utilizing bootstrap aggregating (bagging).
How to apply Random forest for features selection?
Random forest is good as features selection since they rank naturally by how well they improve the purity of the node. This is called mini impurity. Nodes with the greatest decrease in impurity is at the top of the tree.
Understand the outputs from ’randomForest’ package
???????
Understand the basic idea of the Adaboost algorithm.
- Combines a lot of weak learners
- Some stumps/trees get more say in classification
- Each stump is made by taking the previous stumps misstakes