Lecture 2 Flashcards

Question

Multiclass Classification - One-vs-all algorithm classifier

Answer 1

e.g. if you have 3 classes, you turn your dataset in 3 seperate binary classification problems. You have 3 LINEAR decision boundaries (even if it is perpendicular, the slope is undefined). Datapoints could be in 0 or more classes. Train a logistic regression classifier for eacht class i to predict the probability that y = i. On a new input x, to make a prediction, we run the 3 classifiers on input x and pick the class that is most confident that it is the right class.

Answer 2

- High bias (and low variance) ; ** High training error ** High CV error; approximately as high as training error

Answer 3

- fit training data very closely - able to predict very well on training data - does not generalize wel on unseen data (test set) - High variance (and low bias) ; * * Low training error * * High CV error; CV error >> training error Common causes of overfit: a) Too complex model (to high degree of polynomials and/or too much features) b) Data has noise i.e. like there are outliers and errors in data c) Size of data used for training may not be enough

Answer 4

Difference in fit between different test datasets --> distance between test dataset and fit line

Answer 5

variable part of the model which is not set during learning on training data needs to be tuned on validation set

Answer 6

- required by the model when making predictions - value define the skill of the model on your problem (we need to use some data to estimate paramaters e.g. for logistic regression we have to use some data to estimate the shape of the S-curve - Are learned from data

Answer 7

- required by the model when making predictions - value define the skill of the model on your problem. - Are learned from data - Often not set manually by the practitioner - Often saved as part of the learned model - We need to use some data to estimate paramaters e.g. for logistic regression we have to use some data to estimate the shape of the S-curve. Estimating parameters is called 'training the algorithm'.

Answer 8

- Often used in processes to help estimating model parameters - Often specified by the practitioner - Can be set using heuristics - Often tuned for a given predictive modeling problem e. g. the k, neighbor weights, distance metric (3) in K-nearest neighbors. You can control the fit in kNN by changing K. Smaller K - more fit, larger K - less fit.

Answer 9

Systematic search for best hyperparameter settings Manually choose values to test for each hyperparameter. Check validation accuracy/error for all combinations Number of parameters expand, but often behave parallel (relatively independent) Lots of computation

Answer 10

Coefficient of determination (R2) --> nested Root Mean Squared error --> is sensitive for outliers (because you square) Mean Absolute Error --> absolute value, not sensitive to outliers

Answer 11

How well the model predicts targets relative to the mean Equivalent to PROPORTION OF VARIANCE EXPLAINED BY THE MODEL The mean is not always a suitable baseline

Answer 12

Confusion matrix Accuracy, precision, recall, F1 score Area Under the ROC curve --> can inspect outcome for different TRESHOLDS (criterion)

Answer 13

TN + TP / all values The proportion that your model predicted correctly. The predicted negatives and predicted positives were actually labeled as negative or positive

Answer 14

The model predicted those datapoints to be positive, but the true label was negative You have some - values in your + prediction class

Answer 15

The model predicted those datapoints to be negative, but the true label is positive You have some + values in your - prediction class

Answer 16

TP / TP + FN fraction of all true positives that are correclty identified as positive Evaluation matrix if you want model that rarely fail to detect the true value (cancer, terrrorist)

Answer 17

TP / TP + FP fraction of positives that the model correclty predicted as true positive Evaluation matrix when avoiding false positives is important. We are fine with cases that are not all TP instances are detected, but from the positves we want to be sure that they are truly positive.

Answer 18

FP / FP + TN fraction of true negatives that the classifier indentify incorrectly as postive (since a false positive that is a value that actually negative)

Answer 19

Consequense of not correctly identifying a positive case is high - legal discovery - tumor detection/earthquakes - often paired with human expert to filter the false postives

Answer 20

Consequense of not detecting false postive is high (e.g. predicting that someone likes something, but they doe not like it, youtube recommendations). You dont want to recommend something that someone does not like, people are not happy with that.

Answer 21

https://datascience.stackexchange.com/ questions/28426/train-accuracy-vs-test-accuracy-vs-confusion-matrix - Difference in train and test accuracy tells us IF model is overfitting. Train accuracy is higher than test accuracy = overfit Train accuracy is lower than test accuracy = underfit - Confusion matrix tells us HOW MUCH it is overfitting

Answer 22

To estimate performance on the learned model from available data using one algorithm. --> estimating the parameters for machine learning methods (evaluating how well we trained the model by estimating parameter) To compare the performance of two ore more different algorithms and find the best algorithm for the available data. --> evaluating how well the machine learning methods work (compared to each other)

Answer 23

Hold out K-fold cross validation Leave one out

Answer 24

``` creates a balanced dataset by matching the number of samples in the minority class with a random sample from the majority class ```

Answer 25

``` matches the number of samples in the majority class with resampling from the minority class ```

Answer 26

leave one out k fold hold out

Answer 27

hold out k fold leave one out bootstrapping

Answer 28

Ability to give accurate predictions for new, previously unseen data

Answer 29

- Future unseen data will have the same properties (probability distribution, spread, correlation e.d.) as current training set - So if accurate on the train DS = accurate on test DS - But this may not happen if trained model is tuned too specifically to trained set

Answer 30

Can capture complex patterns by being great at predicting lots and lots of specific data samples or areas of local variation, but it often misses the global pattern in the training set that would help generalize in test set.

Answer 31

- Discover the underlying structure of the data. Identify a finite set of categories, classes or groups. - Objects in the same cluster should be as similar as possible. Objects in different clusters should be as dissimilar as possible. What sub-populations, groups exist in the data? - How many are there? - What are their sizes? - Do elements in a sub-population have any common properties? - Are sub-populations cohesive? Can they be further split up? - Are there outliers?

Answer 32

• Partitioning methods • Hyperparameter: k, distance function • Goal: partition into k C's with minimal costs • Placing k centroids at random places in space • For each point i find nearest centroid and assing the point to cluster j • update each centroid (e.g. compute mean, as in k-means clustering) • stop when none of C assignments change • Hierarchical methods • HP: Distance function for points and C's • Determines hierarchy of C's, combines respective most similar C's * Density-based methods * Parameter: minimum points per C, distance ================================= • Categorize by goal: - - Monothetic: members have some common property (e.g. all are males aged 15-25) - - Polythetic: cluster members are similar to each other (distance defines membership) • Categorize by overlap: - - Hard clustering (elements either belong to a cluster or not) - - Soft clustering (clusters may overlap) • Flat & hierarchical

Answer 33

Requirements for distance (D) functions: - The D between two points is non-negative. - The D from a point to the point itself is equal to 0. - The distance from point i to point j is exactly equal to the distance from point j to point i (symmetry). - The distance from point i to point j via point k is always longer than the direct distance from i to j (triangle inequality).

Answer 34

The similarity measure is the measure of how much alike two data objects are. Similarity measure in a data mining context is a distance with dimensions representing features of the objects. If this distance is small, it will be the high degree of similarity where large distance will be the low degree of similarity.

Answer 35

Numeric (5): - General Lp-metric (Minkowski) - Euclidean distance (p=2) - Manhattan distance (p=1) - Maximum metric (Chebyshev) (p = infinite) - Cosine Distance https://www.sciencedirect.com/topics/computer-science/minkowski-distance --> see Figure 2.23. Categorical (2): - Hamming distance (p=0) - Levenshtein distance * Generalization of the Euclidean and - Manhattan distances. (If q = 2, d = Euclidean distance and if q = 1 d = Manhattan)

Answer 36

Although no single definition of a similarity measure exists, usually such measures are in some sense the inverse of distance metrics. - Cosine similarity - Pearsons r - OLS coefficient https://brenocon.com/blog/2012/03/cosine-similarity-pearson-correlation-and-ols-coefficients/

Answer 37

- Similar: angle is near 0 degrees and the cosine is near 1 (i.e. 100 %) * - Unrelated: angle is near 90 degrees and the cosine is near 0 (i.e. 0 %) - Opposite: angle is near 90 degrees and the cosine is near -1 (i.e. - 100 %) Cosine distance ignores the magnitude of the vectors * See table for degree and cosine clarification https: //socratic.org/questions/how-do-you-tell-whether-the-value-of-tan-90-degrees-is-positive-negative-zero-or

Answer 38

* Data partitioned into K sub-populations. K is hyperparameter. Needs to be specified. * Associates each point with a 'centroid'. A centroid = attribute-value representation of a cluster, sort prototypical individual in the sub-population, * Objective: - Minimize the paiswise squared deviations within C (inter cluster) - Maximize the sum of squared deviations between different C's

Answer 39

How well clustering has performed - Pick range of candidates of values for K - Calculate the silhouette coefficient s(i) for each point in the dataset s(i) = b(i) - a(i) / {max b(i) & max a(i) } a(i) = find the average distance of point i to other points in the same cluster (inner cluster distance)--> these points needs to be as similar as possible, near 0 b(i) = average distance of point i to all the other points in the cluster that is the closest (between cluster distance) --> needs to be as dissimilar as possible (so distance as far as possible, near infinity) * ideally a(i) << b(i), * Ideal value of silhouette = 1 and worst possible value for silhouette = -1 - Plot silhouettes for each value of K to identify outliers * if a(i) > b(i) likely that the point is missclassified. Outliers/miss-classified instances have s(i) less than zero, (this is on the left side of the plots). - Pick the K where average silhouette is closest to 1