Knowledge Discovery Flashcards

Question 1

Q

Regression

Answer

A

Predict a given continuous-valued variable based on the values of other values assuming a liner or non linear model of dependency

based on a set of training data of tupels (x, y) where is a vector and y is a continuous-valued variable, find a function f such that f(x) = y

Question 2

Q

Classification

Answer

A

Predict a given discrete-valued variable based on the values of other variables, assuming a linear or nonlinear model of dependency

based on a set of training data of tupels (x,y) where x is a vector and y a discrete-valued variable. find a function f such that f(x) = y

Question 3

Q

Clustering

Answer

A

Goal is to find patterns/structures in the data, wich are previously unknown

doesnt use labeld data

based on a set of training data x find a fuction such that homgeneous objects are aggregated into the same groups

high homogenity within clusters
high heterogenity between clusters

Question 4

Q

methods agains underfitting

Answer

A

underfitting means high bias

feature selection:
get additional features
add polynomial features …

Regularization:
decrese lamda

NN:
larger NN, increase complexity

Question 5

Q

methods against overfitting

Answer

A

overfitting means high variance

get more data!

feature selection
* smaller sets of features

Regularization
* increase lamda

NN:

dropout
smaller NN
early stopping
increase weight decay

Question 6

Q

Evaluation methods

Answer

A

Hold-out
* split into to sets (e.g. 80 train, 20 test)
+ evaluation on data which is not used for training
+ easy to implement
- not all data is used for training
- evaluation results strongly depend on choice of training and test set

k-fold cross-validation
* slipt data in k folds, obtain average performance
+ approximation of actual model performance
+ all data used for training
- expensive, k models learned
- choice of k

leave-one-out 
* cross-validation with k=N
\+ makes best use of data 
- very elaborate 
- hight variance of the individual model

Question 7

Q

Gradient Descent vs Normal Equation

Answer

A

Gradient Decent:

need to choose alpha
needs may iterations
Works well even when M is large

Normal Equaiton

no need to choose alpha
don’t need to iterate
need to compute (X^TX)^-1 which is very complex
slow, if M is very large

Use GD if

XTX -1 is not invertible
number of features is large (≥10.000)

Question 8

Q

Mulitclass methods
one vs rest / one vs one

Answer

A

one vs rest :

easier interpretable
possibility to gain knowledge about the class by inspecting its classifier
less computationally expensive O(C)
recommended as default choice

one vs one :

less sensitive to imbalanced data sets
advantageous for algorithms such as kernel algorithms which dont scale well with M
much more computationally expensive: O(C^2)

Question 9

Q

batch gradient descent pros and cons

Answer

A

pro:
+ fewer uupdates
+ computationally more efficient the sgd
+ more stable convergence

con:
- entire training data must fit into the RAM
- updates and training speed may become slow for large data sets

Question 10

Q

stochastic gradient decent pros and cons

Answer

A

pro:
+ the frequent updates immediately give an insight into the performance of the model
+ the increased model update frequency can result in faster learning
+ the noisy update process can allow the model to avoid local minima

con:

updating the model so frequently is more computationally expensive, taking significantly longer to train models on large datasets
the frequent updates can result in a noisy gradient signal

Question 11

Q

mini-batch GD pros and cons

Answer

A

pro:
+ the model update frequency is higher than batch gradient descent which allows for a mofe robust convergence, avoiding local minima
+ the batched updates provide a computationally more efficient process than stochastic gradient descent
+ the batching allows both the efficiency of not having all training data in memory and algorithm implementations

con:
- requires additional hyperparameter b

Question 12

Q

kernel trick

Answer

A

the transformation of data into a higher dimensional space and separate them linearly

Question 13

Q

Kernal Methods

linear model
representer theorem
dual representation
kernel function

Answer

A

Linear Model:

classification = side of hyperplan
training = estimation of “good” parameters w and b

Representer Theorem:
optimal weight vector w is a linear combination of training instances

Dual Representation:
During learning and classification, it is sufficient to access only dot products of data points

Kernel Function:
K(x,z) =
* similarity function with an interpretation as a dot product after mapping to vector representation

Question 14

Q

“Kernel-Trick”

Answer

A

Rewrite learning algorithm such that any reference to the input data happens from within inner products

Replace any such inner product by the kernel function of your choice

work with the (linear) algorithm as usual

*non-linearity enters learning through kernels while training algorithm may remain linear

Question 15

Q

SVMs

Answer

A

Generalization through margin maximization

intuitive motivation
unique optimal solution

maximal margin = minimal norm

dual formulation

makes it possible to use kernel functions
makes problem become a quadric programming problem for which fast algorithms exist

hard margin vs soft margin

many well known implementations available

Knowledge Discovery Flashcards

(15 cards)