Lecture 2 Flashcards
What is a learning example
- decompose objects into features (from all characteristics pick important characteristics of instances)
- decompse inputs into features (?dunno what he means)
- a feature is a measurable aspect of an object/instance
- features are extracted before learning
- some LA can extract features from some types of inputs (e.g. image or text)
a feature
a measurable apsect of an object
feature transformation
new features of X often include transformations features. These transformations are part of pre-processing, but can make a problem much easier.
PCA, Neural Networks, scaling and normalisation
example: from catesian coordinates to Polar coordinates
name some classifiers
logistic regression
kNN
decision tree
Decision boundaries logistic regression
In short:
- DB can be linear function or polynomial function (both with one or more variables)
- iterative
- regression coefficients decision boundary usually estimated using maximum likelihood estimation.
- g ( f (x ) ) * = decisions boundary is linear or multiple linear equation,
CAN ALSO BE polynomial function (but classifier is always 2 classes, Y hat = 0 or Y hat = 1 ) - you can not find a closed-form solution like LR
- ITERATIVE PROCESS untill process has converged
- function composition is an operation that takes two functions f and g and produces a function h such that h(x) = g(f(x)). In this operation, the function g is applied to the result of applying the function f to x.
logistic regression
Y variable is binary (nominaal) and X variables can be categorical or numeric.
K-nearest Neighbor
Simple idea: similarity (distance, input can be numeric or categorical)
Give a new example Xj
We look for the most similar example in the TrS
Predict the same target for Xj
The key component of kNN is the distance function. Depending on how you define distance you can get very different classifiers / performance.
Measurement of / defining similarity
- small distance ; similar object
- large distance ; dissimilar object
Distance functions
Numeric (5):
- General Lp-metric (Minkowski)
- Euclidean distance (p=2)
- Manhattan distance (p=1)
- Maximum metric (Chebyshev) (p = infinite)
- Cosine Distance
https://www.sciencedirect.com/topics/computer-science/minkowski-distance –> see Figure 2.23.
Categorical (2):
- Hamming distance (p=0)
- Levenshtein distance
Hamming distance
Looks at each attribute; are they equal or not. Count the different attributes.
- KAROLIN and KATHRIN hamming distance is 3, because the ROL are different, this is THR in the second one.
- 1 0 1 1 1 0 1 and 1 0 0 1 0 0 1 hamming distance is 2
Chebyshev distance
generalisation of the Minikowski distance for h –> infinity .
Let’s use two objects, x1 = (1, 2) and x2 = (3, 5). The second attribute gives the greatest difference between values for the objects, which is 5 − 2 = 3. This is the Chebyshev distance.
Manhatten distance
You can only go sideways and up.
Let x1 = (1, 2) and x2 = (3, 5) represent two objects. The Manhattan distance between the two is 2 + 3 = 5.
2 = 3 - 1 / 3 = 5 - 2
Euclidean distance
In 2 dimensions:
Let x1 = (1, 2) and x2 = (3, 5) represent two objects. The Euclidean distance between the two is wortel van (2^2 + 3^2) = 3.61
2 = 3 - 1 / 3 = 5 - 2
Finding the nearest neighbor
Learning as MEMORIZATION
Given a test point, measure the distances to all the training points and pick the k nearest ones
Their labels define the estimated label of the test point (the label corresponding with features)
Choose right value of k
if you pick to LARGE –> UNDERFITTING
- everything is classified as most probable class
if you pick to SMALL –> OVERFITTING (variability, unstable decision boundaries)
units in kNN
Units do matter in kNN
Suppose you had a dataset (m “examples” by n “features”) and all but one feature dimension had values strictly between 0 and 1, while a single feature dimension had values that range from -1000000 to 1000000. When taking the euclidean distance between pairs of “examples”, the values of the feature dimensions that range between 0 and 1 may become uninformative and the algorithm would essentially rely on the single dimension whose values are substantially larger
You have to transform features to standardized units –> z-score dimensions (just standardizing ==> z = ( X - mean of X) / s.d. of X
Problems kNN
some dimensions may be more informative about the class than others
Can we take this into account in the k-NN algorithm?
Do we need to ‘‘forget” training examples, with a growing (training) dataset?
Advantages kNN
fast learner (since there is no abstraction)
fast classification possible (using smart indexing structures like k-d-trees)
directly provides illustrative examples (–> the k-nearest neighbors)
Cosine distance
Cosine similarity is generally used as a metric for measuring distance when the magnitude of the vectors does not matter.
Decision tree classifier
a classification scheme which generates a tree and a set of rules from given dataset.
It take one feature at the time and test a binary condition. Each node test a condition on a feature (if else statement).
The order of the nodes are important. The first question maximizes the information gained (entropy) from the answer.
Based on information theory.
Normalisation does not help, no need to do that
The decision boundaries are PERPENDICULAR to the instance-space-axes
PRUNE/PRUNING is a technique that reduces the size of decision trees by removing sections of the tree that provide little power to classify instances –> REDUCE OVERFIT
i3D algorithm (decision tree)
Basic algorithm
- all (training) instances are assigned to the root
- the next attribute (test) is seleected - splitting strategy
- the training set is partitioned using the split attribute
- proceed for all partitions recursively –> locally optimizing algorithm
Stopping criterion
- no more splitting attributes
- all instances of the node belong to exacly to one class.
Entropy
Tells us how pure or impure a subset is. Number between 0 and 1 (bits).
If the entropy is 1, you are totally uncertain, you have 50 percent change it is yes or no.
If the entropy is 0, you are totally certain, 100% sure what the outcome label is going to be, it is always yes or always no.
Information Gain
Expected drop in entropy after the split.
If I plit on this attribute, how much more certain am I going to be after te split, compared to before the split.
You want entropy to be low (entropy of 0 = 100% certain, pure) and iniformation gain to be high (so entropy goes down).
Complexity of induced model DT
The complexity of the model induced by a decision tree is determined by the depth of the tree
Increasing dept of the tree increases the number of decision boundaries
All DB are PERPENDICULAR to the feature axes, because at each node a decision is make about a single feature.
Advantages of DT
simple to understand and interpret
work with relatively little data
help to find which feature is most important for classification
rule base
Multiclass Classification - One-vs-all algorithm classifier
e.g. if you have 3 classes, you turn your dataset in 3 seperate binary classification problems.
You have 3 LINEAR decision boundaries (even if it is perpendicular, the slope is undefined). Datapoints could be in 0 or more classes.
Train a logistic regression classifier for eacht class i to predict the probability that y = i.
On a new input x, to make a prediction, we run the 3 classifiers on input x and pick the class that is most confident that it is the right class.