Lecture 2 Flashcards

1
Q

What is a learning example

A
  • decompose objects into features (from all characteristics pick important characteristics of instances)
  • decompse inputs into features (?dunno what he means)
  • a feature is a measurable aspect of an object/instance
  • features are extracted before learning
  • some LA can extract features from some types of inputs (e.g. image or text)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

a feature

A

a measurable apsect of an object

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

feature transformation

A

new features of X often include transformations features. These transformations are part of pre-processing, but can make a problem much easier.

PCA, Neural Networks, scaling and normalisation

example: from catesian coordinates to Polar coordinates

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

name some classifiers

A

logistic regression
kNN
decision tree

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Decision boundaries logistic regression

In short:

  • DB can be linear function or polynomial function (both with one or more variables)
  • iterative
A
  • regression coefficients decision boundary usually estimated using maximum likelihood estimation.
  • g ( f (x ) ) * = decisions boundary is linear or multiple linear equation,
    CAN ALSO BE polynomial function (but classifier is always 2 classes, Y hat = 0 or Y hat = 1 )
  • you can not find a closed-form solution like LR
  • ITERATIVE PROCESS untill process has converged
  • function composition is an operation that takes two functions f and g and produces a function h such that h(x) = g(f(x)). In this operation, the function g is applied to the result of applying the function f to x.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

logistic regression

A

Y variable is binary (nominaal) and X variables can be categorical or numeric.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

K-nearest Neighbor

A

Simple idea: similarity (distance, input can be numeric or categorical)

Give a new example Xj

We look for the most similar example in the TrS

Predict the same target for Xj

The key component of kNN is the distance function. Depending on how you define distance you can get very different classifiers / performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Measurement of / defining similarity

  • small distance ; similar object
  • large distance ; dissimilar object
A

Distance functions

Numeric (5):

  • General Lp-metric (Minkowski)
  • Euclidean distance (p=2)
  • Manhattan distance (p=1)
  • Maximum metric (Chebyshev) (p = infinite)
  • Cosine Distance

https://www.sciencedirect.com/topics/computer-science/minkowski-distance –> see Figure 2.23.

Categorical (2):

  • Hamming distance (p=0)
  • Levenshtein distance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Hamming distance

A

Looks at each attribute; are they equal or not. Count the different attributes.

  1. KAROLIN and KATHRIN hamming distance is 3, because the ROL are different, this is THR in the second one.
  2. 1 0 1 1 1 0 1 and 1 0 0 1 0 0 1 hamming distance is 2
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Chebyshev distance

A

generalisation of the Minikowski distance for h –> infinity .

Let’s use two objects, x1 = (1, 2) and x2 = (3, 5). The second attribute gives the greatest difference between values for the objects, which is 5 − 2 = 3. This is the Chebyshev distance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Manhatten distance

A

You can only go sideways and up.

Let x1 = (1, 2) and x2 = (3, 5) represent two objects. The Manhattan distance between the two is 2 + 3 = 5.

2 = 3 - 1 / 3 = 5 - 2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Euclidean distance

A

In 2 dimensions:

Let x1 = (1, 2) and x2 = (3, 5) represent two objects. The Euclidean distance between the two is wortel van (2^2 + 3^2) = 3.61

2 = 3 - 1 / 3 = 5 - 2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Finding the nearest neighbor

A

Learning as MEMORIZATION

Given a test point, measure the distances to all the training points and pick the k nearest ones

Their labels define the estimated label of the test point (the label corresponding with features)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Choose right value of k

A

if you pick to LARGE –> UNDERFITTING
- everything is classified as most probable class

if you pick to SMALL –> OVERFITTING (variability, unstable decision boundaries)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

units in kNN

A

Units do matter in kNN

Suppose you had a dataset (m “examples” by n “features”) and all but one feature dimension had values strictly between 0 and 1, while a single feature dimension had values that range from -1000000 to 1000000. When taking the euclidean distance between pairs of “examples”, the values of the feature dimensions that range between 0 and 1 may become uninformative and the algorithm would essentially rely on the single dimension whose values are substantially larger

You have to transform features to standardized units –> z-score dimensions (just standardizing ==> z = ( X - mean of X) / s.d. of X

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Problems kNN

A

some dimensions may be more informative about the class than others

Can we take this into account in the k-NN algorithm?

Do we need to ‘‘forget” training examples, with a growing (training) dataset?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Advantages kNN

A

fast learner (since there is no abstraction)

fast classification possible (using smart indexing structures like k-d-trees)

directly provides illustrative examples (–> the k-nearest neighbors)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Cosine distance

A

Cosine similarity is generally used as a metric for measuring distance when the magnitude of the vectors does not matter.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Decision tree classifier

A

a classification scheme which generates a tree and a set of rules from given dataset.

It take one feature at the time and test a binary condition. Each node test a condition on a feature (if else statement).

The order of the nodes are important. The first question maximizes the information gained (entropy) from the answer.

Based on information theory.

Normalisation does not help, no need to do that

The decision boundaries are PERPENDICULAR to the instance-space-axes

PRUNE/PRUNING is a technique that reduces the size of decision trees by removing sections of the tree that provide little power to classify instances –> REDUCE OVERFIT

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

i3D algorithm (decision tree)

A

Basic algorithm

  • all (training) instances are assigned to the root
  • the next attribute (test) is seleected - splitting strategy
  • the training set is partitioned using the split attribute
  • proceed for all partitions recursively –> locally optimizing algorithm

Stopping criterion

  • no more splitting attributes
  • all instances of the node belong to exacly to one class.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Entropy

A

Tells us how pure or impure a subset is. Number between 0 and 1 (bits).

If the entropy is 1, you are totally uncertain, you have 50 percent change it is yes or no.

If the entropy is 0, you are totally certain, 100% sure what the outcome label is going to be, it is always yes or always no.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Information Gain

A

Expected drop in entropy after the split.

If I plit on this attribute, how much more certain am I going to be after te split, compared to before the split.

You want entropy to be low (entropy of 0 = 100% certain, pure) and iniformation gain to be high (so entropy goes down).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Complexity of induced model DT

A

The complexity of the model induced by a decision tree is determined by the depth of the tree

Increasing dept of the tree increases the number of decision boundaries

All DB are PERPENDICULAR to the feature axes, because at each node a decision is make about a single feature.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Advantages of DT

A

simple to understand and interpret

work with relatively little data

help to find which feature is most important for classification

rule base

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Multiclass Classification - One-vs-all algorithm classifier
e.g. if you have 3 classes, you turn your dataset in 3 seperate binary classification problems. You have 3 LINEAR decision boundaries (even if it is perpendicular, the slope is undefined). Datapoints could be in 0 or more classes. Train a logistic regression classifier for eacht class i to predict the probability that y = i. On a new input x, to make a prediction, we run the 3 classifiers on input x and pick the class that is most confident that it is the right class.
26
Underfit (you can only tell by comparing a val/train set error with a test set error) High bias, because your test data fit is far away from the data points, but low variance, because it is more likely that future, different datasets will fit better whit this line. https://www.youtube.com/watch?v=fDQkUN9yw44
- High bias (and low variance) ; ** High training error ** High CV error; approximately as high as training error
27
Overfit (you can only tell by comparing a val/train set error with a test set error) Low bias, because the data points in the test set fits perfectly, but it is unlikely that is fits exactly as good with future datasets. So very unlikely that this describes the dataset. We can not generalize well beyond training data. https://www.youtube.com/watch?v=fDQkUN9yw44
- fit training data very closely - able to predict very well on training data - does not generalize wel on unseen data (test set) - High variance (and low bias) ; * * Low training error * * High CV error; CV error >> training error Common causes of overfit: a) Too complex model (to high degree of polynomials and/or too much features) b) Data has noise i.e. like there are outliers and errors in data c) Size of data used for training may not be enough
28
Variance
Difference in fit between different test datasets --> distance between test dataset and fit line
29
Hyperparameter
variable part of the model which is not set during learning on training data needs to be tuned on validation set
30
Model Paramater
- required by the model when making predictions - value define the skill of the model on your problem (we need to use some data to estimate paramaters e.g. for logistic regression we have to use some data to estimate the shape of the S-curve - Are learned from data
31
Model Paramater
- required by the model when making predictions - value define the skill of the model on your problem. - Are learned from data - Often not set manually by the practitioner - Often saved as part of the learned model - We need to use some data to estimate paramaters e.g. for logistic regression we have to use some data to estimate the shape of the S-curve. Estimating parameters is called 'training the algorithm'.
32
Model Hyperparameter
- Often used in processes to help estimating model parameters - Often specified by the practitioner - Can be set using heuristics - Often tuned for a given predictive modeling problem e. g. the k, neighbor weights, distance metric (3) in K-nearest neighbors. You can control the fit in kNN by changing K. Smaller K - more fit, larger K - less fit.
33
Grid search: hyperparameter optimization
Systematic search for best hyperparameter settings Manually choose values to test for each hyperparameter. Check validation accuracy/error for all combinations Number of parameters expand, but often behave parallel (relatively independent) Lots of computation
34
Evaluating performance regression task
Coefficient of determination (R2) --> nested Root Mean Squared error --> is sensitive for outliers (because you square) Mean Absolute Error --> absolute value, not sensitive to outliers
35
R2
How well the model predicts targets relative to the mean Equivalent to PROPORTION OF VARIANCE EXPLAINED BY THE MODEL The mean is not always a suitable baseline
36
Evaluating performance classification task
Confusion matrix Accuracy, precision, recall, F1 score Area Under the ROC curve --> can inspect outcome for different TRESHOLDS (criterion)
37
Accuracy
TN + TP / all values The proportion that your model predicted correctly. The predicted negatives and predicted positives were actually labeled as negative or positive
38
False positive
The model predicted those datapoints to be positive, but the true label was negative You have some - values in your + prediction class
39
False negative
The model predicted those datapoints to be negative, but the true label is positive You have some + values in your - prediction class
40
Recall
TP / TP + FN fraction of all true positives that are correclty identified as positive Evaluation matrix if you want model that rarely fail to detect the true value (cancer, terrrorist)
41
Precision
TP / TP + FP fraction of positives that the model correclty predicted as true positive Evaluation matrix when avoiding false positives is important. We are fine with cases that are not all TP instances are detected, but from the positves we want to be sure that they are truly positive.
42
Specificity
FP / FP + TN fraction of true negatives that the classifier indentify incorrectly as postive (since a false positive that is a value that actually negative)
43
Recall-oriented ML tasks
Consequense of not correctly identifying a positive case is high - legal discovery - tumor detection/earthquakes - often paired with human expert to filter the false postives
44
Precision-oriented ML tasks
Consequense of not detecting false postive is high (e.g. predicting that someone likes something, but they doe not like it, youtube recommendations). You dont want to recommend something that someone does not like, people are not happy with that.
45
Overfit in a confusion matrix
https://datascience.stackexchange.com/ questions/28426/train-accuracy-vs-test-accuracy-vs-confusion-matrix - Difference in train and test accuracy tells us IF model is overfitting. Train accuracy is higher than test accuracy = overfit Train accuracy is lower than test accuracy = underfit - Confusion matrix tells us HOW MUCH it is overfitting
46
Cross validation
To estimate performance on the learned model from available data using one algorithm. --> estimating the parameters for machine learning methods (evaluating how well we trained the model by estimating parameter) To compare the performance of two ore more different algorithms and find the best algorithm for the available data. --> evaluating how well the machine learning methods work (compared to each other)
47
CV methods
Hold out K-fold cross validation Leave one out
48
Downsampling: unballanced classes
``` creates a balanced dataset by matching the number of samples in the minority class with a random sample from the majority class ```
49
Upsampling: unballanced classes
``` matches the number of samples in the majority class with resampling from the minority class ```
50
CV Time from high to low
leave one out k fold hold out
51
CV dataset from large to small
hold out k fold leave one out bootstrapping
52
Generalization
Ability to give accurate predictions for new, previously unseen data
53
Assumptions for generalization
- Future unseen data will have the same properties (probability distribution, spread, correlation e.d.) as current training set - So if accurate on the train DS = accurate on test DS - But this may not happen if trained model is tuned too specifically to trained set
54
Overfit (AMLwithP)
Can capture complex patterns by being great at predicting lots and lots of specific data samples or areas of local variation, but it often misses the global pattern in the training set that would help generalize in test set.
55
Clustering Types of clustering methods https://youtu.be/CiA3ca7W7Eg
- Discover the underlying structure of the data. Identify a finite set of categories, classes or groups. - Objects in the same cluster should be as similar as possible. Objects in different clusters should be as dissimilar as possible. What sub-populations, groups exist in the data? - How many are there? - What are their sizes? - Do elements in a sub-population have any common properties? - Are sub-populations cohesive? Can they be further split up? - Are there outliers?
56
Types of clustering (C) methods https://youtu.be/CiA3ca7W7Eg
• Partitioning methods • Hyperparameter: k, distance function • Goal: partition into k C's with minimal costs • Placing k centroids at random places in space • For each point i find nearest centroid and assing the point to cluster j • update each centroid (e.g. compute mean, as in k-means clustering) • stop when none of C assignments change • Hierarchical methods • HP: Distance function for points and C's • Determines hierarchy of C's, combines respective most similar C's * Density-based methods * Parameter: minimum points per C, distance ================================= • Categorize by goal: - - Monothetic: members have some common property (e.g. all are males aged 15-25) - - Polythetic: cluster members are similar to each other (distance defines membership) • Categorize by overlap: - - Hard clustering (elements either belong to a cluster or not) - - Soft clustering (clusters may overlap) • Flat & hierarchical
57
Defining similarity with distance (D) functions
Requirements for distance (D) functions: - The D between two points is non-negative. - The D from a point to the point itself is equal to 0. - The distance from point i to point j is exactly equal to the distance from point j to point i (symmetry). - The distance from point i to point j via point k is always longer than the direct distance from i to j (triangle inequality).
58
``` Distance functions (See lecture 2) ```
The similarity measure is the measure of how much alike two data objects are. Similarity measure in a data mining context is a distance with dimensions representing features of the objects. If this distance is small, it will be the high degree of similarity where large distance will be the low degree of similarity.
59
Distance functions
Numeric (5): - General Lp-metric (Minkowski) - Euclidean distance (p=2) - Manhattan distance (p=1) - Maximum metric (Chebyshev) (p = infinite) - Cosine Distance https://www.sciencedirect.com/topics/computer-science/minkowski-distance --> see Figure 2.23. Categorical (2): - Hamming distance (p=0) - Levenshtein distance * Generalization of the Euclidean and - Manhattan distances. (If q = 2, d = Euclidean distance and if q = 1 d = Manhattan)
60
Similarity functions
Although no single definition of a similarity measure exists, usually such measures are in some sense the inverse of distance metrics. - Cosine similarity - Pearsons r - OLS coefficient https://brenocon.com/blog/2012/03/cosine-similarity-pearson-correlation-and-ols-coefficients/
61
Cosine Similarity
- Similar: angle is near 0 degrees and the cosine is near 1 (i.e. 100 %) * - Unrelated: angle is near 90 degrees and the cosine is near 0 (i.e. 0 %) - Opposite: angle is near 90 degrees and the cosine is near -1 (i.e. - 100 %) Cosine distance ignores the magnitude of the vectors * See table for degree and cosine clarification https: //socratic.org/questions/how-do-you-tell-whether-the-value-of-tan-90-degrees-is-positive-negative-zero-or
62
K means clustering - Polythetic - Hard boundaries - Flat - K is hyperparameter
* Data partitioned into K sub-populations. K is hyperparameter. Needs to be specified. * Associates each point with a 'centroid'. A centroid = attribute-value representation of a cluster, sort prototypical individual in the sub-population, * Objective: - Minimize the paiswise squared deviations within C (inter cluster) - Maximize the sum of squared deviations between different C's
63
Silhouet coefficient - Measure how well clustering has performed - Helps pick the right value of K - Where s(i) is closest to 1
How well clustering has performed - Pick range of candidates of values for K - Calculate the silhouette coefficient s(i) for each point in the dataset s(i) = b(i) - a(i) / {max b(i) & max a(i) } a(i) = find the average distance of point i to other points in the same cluster (inner cluster distance)--> these points needs to be as similar as possible, near 0 b(i) = average distance of point i to all the other points in the cluster that is the closest (between cluster distance) --> needs to be as dissimilar as possible (so distance as far as possible, near infinity) * ideally a(i) << b(i), * Ideal value of silhouette = 1 and worst possible value for silhouette = -1 - Plot silhouettes for each value of K to identify outliers * if a(i) > b(i) likely that the point is missclassified. Outliers/miss-classified instances have s(i) less than zero, (this is on the left side of the plots). - Pick the K where average silhouette is closest to 1