Lecture 2 Flashcards

1
Q

What is a learning example

A
  • decompose objects into features (from all characteristics pick important characteristics of instances)
  • decompse inputs into features (?dunno what he means)
  • a feature is a measurable aspect of an object/instance
  • features are extracted before learning
  • some LA can extract features from some types of inputs (e.g. image or text)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

a feature

A

a measurable apsect of an object

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

feature transformation

A

new features of X often include transformations features. These transformations are part of pre-processing, but can make a problem much easier.

PCA, Neural Networks, scaling and normalisation

example: from catesian coordinates to Polar coordinates

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

name some classifiers

A

logistic regression
kNN
decision tree

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Decision boundaries logistic regression

In short:

  • DB can be linear function or polynomial function (both with one or more variables)
  • iterative
A
  • regression coefficients decision boundary usually estimated using maximum likelihood estimation.
  • g ( f (x ) ) * = decisions boundary is linear or multiple linear equation,
    CAN ALSO BE polynomial function (but classifier is always 2 classes, Y hat = 0 or Y hat = 1 )
  • you can not find a closed-form solution like LR
  • ITERATIVE PROCESS untill process has converged
  • function composition is an operation that takes two functions f and g and produces a function h such that h(x) = g(f(x)). In this operation, the function g is applied to the result of applying the function f to x.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

logistic regression

A

Y variable is binary (nominaal) and X variables can be categorical or numeric.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

K-nearest Neighbor

A

Simple idea: similarity (distance, input can be numeric or categorical)

Give a new example Xj

We look for the most similar example in the TrS

Predict the same target for Xj

The key component of kNN is the distance function. Depending on how you define distance you can get very different classifiers / performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Measurement of / defining similarity

  • small distance ; similar object
  • large distance ; dissimilar object
A

Distance functions

Numeric (5):

  • General Lp-metric (Minkowski)
  • Euclidean distance (p=2)
  • Manhattan distance (p=1)
  • Maximum metric (Chebyshev) (p = infinite)
  • Cosine Distance

https://www.sciencedirect.com/topics/computer-science/minkowski-distance –> see Figure 2.23.

Categorical (2):

  • Hamming distance (p=0)
  • Levenshtein distance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Hamming distance

A

Looks at each attribute; are they equal or not. Count the different attributes.

  1. KAROLIN and KATHRIN hamming distance is 3, because the ROL are different, this is THR in the second one.
  2. 1 0 1 1 1 0 1 and 1 0 0 1 0 0 1 hamming distance is 2
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Chebyshev distance

A

generalisation of the Minikowski distance for h –> infinity .

Let’s use two objects, x1 = (1, 2) and x2 = (3, 5). The second attribute gives the greatest difference between values for the objects, which is 5 − 2 = 3. This is the Chebyshev distance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Manhatten distance

A

You can only go sideways and up.

Let x1 = (1, 2) and x2 = (3, 5) represent two objects. The Manhattan distance between the two is 2 + 3 = 5.

2 = 3 - 1 / 3 = 5 - 2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Euclidean distance

A

In 2 dimensions:

Let x1 = (1, 2) and x2 = (3, 5) represent two objects. The Euclidean distance between the two is wortel van (2^2 + 3^2) = 3.61

2 = 3 - 1 / 3 = 5 - 2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Finding the nearest neighbor

A

Learning as MEMORIZATION

Given a test point, measure the distances to all the training points and pick the k nearest ones

Their labels define the estimated label of the test point (the label corresponding with features)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Choose right value of k

A

if you pick to LARGE –> UNDERFITTING
- everything is classified as most probable class

if you pick to SMALL –> OVERFITTING (variability, unstable decision boundaries)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

units in kNN

A

Units do matter in kNN

Suppose you had a dataset (m “examples” by n “features”) and all but one feature dimension had values strictly between 0 and 1, while a single feature dimension had values that range from -1000000 to 1000000. When taking the euclidean distance between pairs of “examples”, the values of the feature dimensions that range between 0 and 1 may become uninformative and the algorithm would essentially rely on the single dimension whose values are substantially larger

You have to transform features to standardized units –> z-score dimensions (just standardizing ==> z = ( X - mean of X) / s.d. of X

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Problems kNN

A

some dimensions may be more informative about the class than others

Can we take this into account in the k-NN algorithm?

Do we need to ‘‘forget” training examples, with a growing (training) dataset?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Advantages kNN

A

fast learner (since there is no abstraction)

fast classification possible (using smart indexing structures like k-d-trees)

directly provides illustrative examples (–> the k-nearest neighbors)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Cosine distance

A

Cosine similarity is generally used as a metric for measuring distance when the magnitude of the vectors does not matter.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Decision tree classifier

A

a classification scheme which generates a tree and a set of rules from given dataset.

It take one feature at the time and test a binary condition. Each node test a condition on a feature (if else statement).

The order of the nodes are important. The first question maximizes the information gained (entropy) from the answer.

Based on information theory.

Normalisation does not help, no need to do that

The decision boundaries are PERPENDICULAR to the instance-space-axes

PRUNE/PRUNING is a technique that reduces the size of decision trees by removing sections of the tree that provide little power to classify instances –> REDUCE OVERFIT

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

i3D algorithm (decision tree)

A

Basic algorithm

  • all (training) instances are assigned to the root
  • the next attribute (test) is seleected - splitting strategy
  • the training set is partitioned using the split attribute
  • proceed for all partitions recursively –> locally optimizing algorithm

Stopping criterion

  • no more splitting attributes
  • all instances of the node belong to exacly to one class.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Entropy

A

Tells us how pure or impure a subset is. Number between 0 and 1 (bits).

If the entropy is 1, you are totally uncertain, you have 50 percent change it is yes or no.

If the entropy is 0, you are totally certain, 100% sure what the outcome label is going to be, it is always yes or always no.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Information Gain

A

Expected drop in entropy after the split.

If I plit on this attribute, how much more certain am I going to be after te split, compared to before the split.

You want entropy to be low (entropy of 0 = 100% certain, pure) and iniformation gain to be high (so entropy goes down).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Complexity of induced model DT

A

The complexity of the model induced by a decision tree is determined by the depth of the tree

Increasing dept of the tree increases the number of decision boundaries

All DB are PERPENDICULAR to the feature axes, because at each node a decision is make about a single feature.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Advantages of DT

A

simple to understand and interpret

work with relatively little data

help to find which feature is most important for classification

rule base

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Multiclass Classification - One-vs-all algorithm classifier

A

e.g. if you have 3 classes, you turn your dataset in 3 seperate binary classification problems.

You have 3 LINEAR decision boundaries (even if it is perpendicular, the slope is undefined). Datapoints could be in 0 or more classes.

Train a logistic regression classifier for eacht class i to predict the probability that y = i.

On a new input x, to make a prediction, we run the 3 classifiers on input x and pick the class that is most confident that it is the right class.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Underfit (you can only tell by comparing a val/train set error with a test set error)

High bias, because your test data fit is far away from the data points, but low variance, because it is more likely that future, different datasets will fit better whit this line.

https://www.youtube.com/watch?v=fDQkUN9yw44

A
  • High bias (and low variance) ;
    ** High training error
    ** High CV error; approximately as high as training
    error
27
Q

Overfit (you can only tell by comparing a val/train set error with a test set error)

Low bias, because the data points in the test set fits perfectly, but it is unlikely that is fits exactly as good with future datasets. So very unlikely that this describes the dataset. We can not generalize well beyond training data.

https://www.youtube.com/watch?v=fDQkUN9yw44

A
  • fit training data very closely
  • able to predict very well on training data
  • does not generalize wel on unseen data (test set)
  • High variance (and low bias) ;
      • Low training error
      • High CV error; CV error&raquo_space; training error

Common causes of overfit:

a) Too complex model (to high degree of polynomials and/or too much features)
b) Data has noise i.e. like there are outliers and errors in data
c) Size of data used for training may not be enough

28
Q

Variance

A

Difference in fit between different test datasets –> distance between test dataset and fit line

29
Q

Hyperparameter

A

variable part of the model which is not set during learning on training data

needs to be tuned on validation set

30
Q

Model Paramater

A
  • required by the model when making predictions
  • value define the skill of the model on your problem (we need to use some data to estimate paramaters e.g. for logistic regression we have to use some data to estimate the shape of the S-curve
  • Are learned from data
31
Q

Model Paramater

A
  • required by the model when making predictions
  • value define the skill of the model on your problem.
  • Are learned from data
  • Often not set manually by the practitioner
  • Often saved as part of the learned model
  • We need to use some data to estimate paramaters e.g. for logistic regression we have to use some data to estimate the shape of the S-curve. Estimating parameters is called ‘training the algorithm’.
32
Q

Model Hyperparameter

A
  • Often used in processes to help estimating model parameters
  • Often specified by the practitioner
  • Can be set using heuristics
  • Often tuned for a given predictive modeling problem
    e. g. the k, neighbor weights, distance metric (3) in K-nearest neighbors. You can control the fit in kNN by changing K. Smaller K - more fit, larger K - less fit.
33
Q

Grid search: hyperparameter optimization

A

Systematic search for best hyperparameter settings

Manually choose values to test for each hyperparameter.

Check validation accuracy/error for all combinations

Number of parameters expand, but often behave parallel (relatively independent)

Lots of computation

34
Q

Evaluating performance regression task

A

Coefficient of determination (R2) –> nested

Root Mean Squared error –> is sensitive for outliers (because you square)

Mean Absolute Error –> absolute value, not sensitive to outliers

35
Q

R2

A

How well the model predicts targets relative to the mean

Equivalent to PROPORTION OF VARIANCE EXPLAINED BY THE MODEL

The mean is not always a suitable baseline

36
Q

Evaluating performance classification task

A

Confusion matrix

Accuracy, precision, recall, F1 score

Area Under the ROC curve –> can inspect outcome for different TRESHOLDS (criterion)

37
Q

Accuracy

A

TN + TP / all values

The proportion that your model predicted correctly. The predicted negatives and predicted positives were actually labeled as negative or positive

38
Q

False positive

A

The model predicted those datapoints to be positive, but the true label was negative

You have some - values in your + prediction class

39
Q

False negative

A

The model predicted those datapoints to be negative, but the true label is positive

You have some + values in your - prediction class

40
Q

Recall

A

TP / TP + FN

fraction of all true positives that are correclty identified as positive

Evaluation matrix if you want model that rarely fail to detect the true value (cancer, terrrorist)

41
Q

Precision

A

TP / TP + FP

fraction of positives that the model correclty predicted as true positive

Evaluation matrix when avoiding false positives is important. We are fine with cases that are not all TP instances are detected, but from the positves we want to be sure that they are truly positive.

42
Q

Specificity

A

FP / FP + TN

fraction of true negatives that the classifier indentify incorrectly as postive

(since a false positive that is a value that actually negative)

43
Q

Recall-oriented ML tasks

A

Consequense of not correctly identifying a positive case is high

  • legal discovery
  • tumor detection/earthquakes
  • often paired with human expert to filter the false postives
44
Q

Precision-oriented ML tasks

A

Consequense of not detecting false postive is high (e.g. predicting that someone likes something, but they doe not like it, youtube recommendations). You dont want to recommend something that someone does not like, people are not happy with that.

45
Q

Overfit in a confusion matrix

A

https://datascience.stackexchange.com/
questions/28426/train-accuracy-vs-test-accuracy-vs-confusion-matrix

  • Difference in train and test accuracy tells us IF model is overfitting.

Train accuracy is higher than test accuracy = overfit
Train accuracy is lower than test accuracy = underfit

  • Confusion matrix tells us HOW MUCH it is overfitting
46
Q

Cross validation

A

To estimate performance on the learned model from available data using one algorithm. –> estimating the parameters for machine learning methods (evaluating how well we trained the model by estimating parameter)

To compare the performance of two ore more different algorithms and find the best algorithm for the available data. –> evaluating how well the machine learning methods work (compared to each other)

47
Q

CV methods

A

Hold out

K-fold cross validation

Leave one out

48
Q

Downsampling: unballanced classes

A
creates a balanced
dataset by matching the number of samples in the minority class with a
random sample from the majority class
49
Q

Upsampling: unballanced classes

A
matches the number
of samples in the majority class with resampling from the minority class
50
Q

CV

Time from high to low

A

leave one out
k fold
hold out

51
Q

CV

dataset from large to small

A

hold out
k fold
leave one out
bootstrapping

52
Q

Generalization

A

Ability to give accurate predictions for new, previously unseen data

53
Q

Assumptions for generalization

A
  • Future unseen data will have the same properties (probability distribution, spread, correlation e.d.) as current training set
  • So if accurate on the train DS = accurate on test DS
  • But this may not happen if trained model is tuned too specifically to trained set
54
Q

Overfit (AMLwithP)

A

Can capture complex patterns by being great at predicting lots and lots of specific data samples or areas of local variation, but it often misses the global pattern in the training set that would help generalize in test set.

55
Q

Clustering

Types of clustering methods

https://youtu.be/CiA3ca7W7Eg

A
  • Discover the underlying structure of the data.
    Identify a finite set of categories, classes or groups.
  • Objects in the same cluster should be as similar as possible. Objects in different clusters should be as dissimilar as possible.

What sub-populations, groups exist in the data?

  • How many are there?
  • What are their sizes?
  • Do elements in a sub-population have any common properties?
  • Are sub-populations cohesive? Can they be further split up?
  • Are there outliers?
56
Q

Types of clustering (C) methods

https://youtu.be/CiA3ca7W7Eg

A

• Partitioning methods
• Hyperparameter: k, distance function
• Goal: partition into k C’s with minimal costs
• Placing k centroids at random places in space
• For each point i find nearest centroid and assing
the point to cluster j
• update each centroid (e.g. compute mean, as in
k-means clustering)
• stop when none of C assignments change

• Hierarchical methods
• HP: Distance function for points and C’s
• Determines hierarchy of C’s, combines
respective most similar C’s

  • Density-based methods
    * Parameter: minimum points per C, distance

=================================

• Categorize by goal:

    • Monothetic: members have some common property (e.g. all are males aged 15-25)
    • Polythetic: cluster members are similar to each other (distance defines membership)

• Categorize by overlap:

    • Hard clustering (elements either belong to a cluster or not)
    • Soft clustering (clusters may overlap)

• Flat & hierarchical

57
Q

Defining similarity with distance (D) functions

A

Requirements for distance (D) functions:

  • The D between two points is non-negative.
  • The D from a point to the point itself is equal to 0.
  • The distance from point i to point j is exactly equal to the distance from point j to point i (symmetry).
  • The distance from point i to point j via point k is always longer than the direct distance from i to j (triangle inequality).
58
Q
Distance functions
(See lecture 2)
A

The similarity measure is the measure of how much alike two data objects are. Similarity measure in a data mining context is a distance with dimensions representing features of the objects.

If this distance is small, it will be the high degree of similarity where large distance will be the low degree of similarity.

59
Q

Distance functions

A

Numeric (5):

  • General Lp-metric (Minkowski)
  • Euclidean distance (p=2)
  • Manhattan distance (p=1)
  • Maximum metric (Chebyshev) (p = infinite)
  • Cosine Distance

https://www.sciencedirect.com/topics/computer-science/minkowski-distance –> see Figure 2.23.

Categorical (2):

  • Hamming distance (p=0)
  • Levenshtein distance
  • Generalization of the Euclidean and - Manhattan distances. (If q = 2, d = Euclidean distance and if q = 1 d = Manhattan)
60
Q

Similarity functions

A

Although no single definition of a similarity measure exists, usually such measures are in some sense the inverse of distance metrics.

  • Cosine similarity
  • Pearsons r
  • OLS coefficient

https://brenocon.com/blog/2012/03/cosine-similarity-pearson-correlation-and-ols-coefficients/

61
Q

Cosine Similarity

A
  • Similar: angle is near 0 degrees and the cosine is near 1 (i.e. 100 %) *
  • Unrelated: angle is near 90 degrees and the cosine is near 0 (i.e. 0 %)
  • Opposite: angle is near 90 degrees and the cosine is near -1 (i.e. - 100 %)

Cosine distance ignores the magnitude of the vectors

  • See table for degree and cosine clarification
    https: //socratic.org/questions/how-do-you-tell-whether-the-value-of-tan-90-degrees-is-positive-negative-zero-or
62
Q

K means clustering

  • Polythetic
  • Hard boundaries
  • Flat
  • K is hyperparameter
A
  • Data partitioned into K sub-populations. K is hyperparameter. Needs to be specified.
  • Associates each point with a ‘centroid’. A centroid = attribute-value representation of a cluster, sort prototypical individual in the sub-population,
  • Objective:
  • Minimize the paiswise squared deviations within C (inter cluster)
  • Maximize the sum of squared deviations between different C’s
63
Q

Silhouet coefficient

  • Measure how well clustering has performed
  • Helps pick the right value of K
  • Where s(i) is closest to 1
A

How well clustering has performed

  • Pick range of candidates of values for K
  • Calculate the silhouette coefficient s(i) for each point in the dataset

s(i) = b(i) - a(i) / {max b(i) & max a(i) }

a(i) = find the average distance of point i to other points in the same cluster (inner cluster distance)–> these points needs to be as similar as possible, near 0

b(i) = average distance of point i to all the other points in the cluster that is the closest (between cluster distance) –> needs to be as dissimilar as possible (so distance as far as possible, near infinity)

  • ideally a(i) &laquo_space;b(i),
  • Ideal value of silhouette = 1 and worst possible value for silhouette = -1
  • Plot silhouettes for each value of K to identify outliers
  • if a(i) > b(i) likely that the point is missclassified. Outliers/miss-classified instances have s(i) less than zero, (this is on the left side of the plots).
  • Pick the K where average silhouette is closest to 1