Final Flashcards

Question

What is standardization?

Answer 1

Modifies a feature in a way so that it has zero as its mean value, and 1 as its standard deviation. ## Footnote 𝑋`′=(𝑋−𝜇(𝑋))/𝜎(𝑋)`

Answer 2

In polynomial regression we use different powers of x in our hypothesis. This can lead to better results when the data is not linear. ## Footnote Can cause underfitting if the degree is too low and overfitting if the degree is too high

Answer 3

It is based on the fact that not all features have equal importance. We can use a p-test with null hypothesis `H_0: θ_i = 0`. The p-value here represents the probability that our feature is 0.

Answer 4

The output is a discrete set of variables rather than continuous.

Answer 5

The sigmoid function ## Footnote `S(z) = 1/(1+e^-z)` where `z=θ_0 + x_1θ_1 + ... x_nθ_n`

Answer 6

We define a threshold, the sigmoid produces values between `0` and `1`, and so we assign a threshold for which we predict class 1 vs. class 0, e.g. `0.5`.

Answer 7

* It is a loss function that measures the difference between the actual probability distribution and the predicted probability distribution. * It is split into two cost functions, one per output variable

Answer 8

A **confusion matrix**

Answer 9

Use an all vs rest approach, where we have `k` classifiers, one for each output class, and use the output of the classifier with the highest probability.

Answer 10

To penalize overfitting or model complexity

Answer 11

Tries to keep the values of hyperparameters small. ## Footnote Penalization is the sum of squares of the hyperparameters

Answer 12

Tries to keep small values of hyperparameters, reduces many to zero. ## Footnote Penalization is the sum of the absolute value of hyperparameters

Answer 13

ROC (Receiver Operating Characteristic) curve is a probability curve to measure the performance of a model with respect to a classification problem at various threshold settings.

Answer 14

* `TPR` is on the y-axis * `FPR` is on the x-axis * The ideal point is located at the top-left of the plot: no false positives and every predicted positive is correct * Area under curve **AUC** displays capacity to distinguish between classes

Answer 15

A support vector machine is a class of supervised model used for regression, classification and outlier detection

Answer 16

They find a line, or curve or plane that best seperates classes

Answer 17

On the training data, many lines may perfectly divide data. But some will be better than others on the test data. The alternative is **maximizing the margin** or maximum width of a line before reaching points from either class

Answer 18

If the signed distance from the decision boundary is negative, it is one class, and positive it is the other.

Answer 19

The data points in a dataset that lie closest to the decision boundary

Answer 20

* **Hard** margin the decision boundary cannot be violated (the data must be linearly seperable) * **Soft** margin, the decision boundary can be violated, misclassification is minimized ## Footnote Soft margin is characterized by a slack variable

Answer 21

A kernel can transform data into higher dimensions to allow it to be linearly seperable. ## Footnote e.g. pushing center points up along the z axis when they are near the center and down if they are further from the center

Answer 22

We may need to consider a different **kernel** to create linear seperation in higher dimensions.

Answer 23

Trade-off between maximizing the margin width and minimizing the classification error * Large `C` results in a smaller margin width, enforcing a strict classification **hard margin** * Small `C` allows for a larger margin width, allowing some misclassifications **soft margin**

Answer 24

True ## Footnote Scikit-learn SVR model.

Answer 25

Similar data points tend to belong to the same class

Answer 26

1. Non negative `dist(a,b) > 0` 2. Triangle inequality `dist(a, b) + dist(b, c) ≥ dist(a,c)` 3. Identity `dist(a, b) | a=b = 0` 4. Symmetry `dist(a,b) = dist(b,a)

Answer 27

Euclidian distance

Answer 28

Tessalation using polygons which represent an area that is closest to a specific data point compared to any other data point in the dataset.

Answer 29

It is a continuous section of Voronoi tessalations in which the same target is predicted.

Answer 30

* We select the majority wins value for the class of the `k` nearest neighbours * A low `k` value will lead to **overfitting**, while higher `k` can result in **underfitting**

Answer 31

We are considering too many neighbours, so we consider some that are very far away! Becomes majority rules over entire dataset. A solution is **weighted KNN**

Answer 32

Take into account the distance when making predictions.

Answer 33

KMeans is a clustering algorithm which attempts to create `k` clusters.

Answer 34

Intra-cluster variance *within cluster sum of squares*

Answer 35

1. Begin with `k` random centroids 2. Assign every data point to the nearest centroid 3. Calculate the new centroid from the assigned data points 4. Repeat steps 2 and 3 until reaching a stopping criterion

Answer 36

A technique used for dimensionality reduction. This method transforms a large set of variables into a smaller one that still contains most of the information in the large set.

Answer 37

In many real-world applications, data comes in the form of high-dimensional vectors, which can be difficult to analyze and visualize. While keeping as much information as possible dimensionality reduction helps us visualize data.

Answer 38

Feature selection means to select a subset containing the **most relevant** features to use in training a model

Answer 39

1. Reduces training time 2. Reduces overfitting 3. Improves accuracy - less misleading data

Answer 40

Use statistical techniques to guage the relevance of input variables to the target variable without training. ## Footnote Like chi-squared, correlation and mutual information

Answer 41

Create many models with different subsets of input features, select the best. ## Footnote Recursive Feature Elimination (RFE)

Answer 42

Feature selection performed by some machine learning algorithms automatically as part of learning the model **e.g. regularization** ## Footnote Penalize the model for using irrelevant features

Answer 43

Plot the cost (mse points to centroid) vs. the number of clusters, at a certain `k` the graph will kink and the benefit will be less for increasing `k`.

Answer 44

1. Agglomerative (Bottom-up) - start with clusters and merge together nearby clusters 2. Divisive (Top-down) - start with one big cluster and split into subclusters

Answer 45

1. Initialize each data point as a cluster 2. Find distances between all clusters 3. Merge closest two clusters into one 4. Repeat steps 2 and 3 until a certain stopping criteria (distance)

Answer 46

**Dendrogram** * x axis represents clusters * y axis represents cluster distance

Answer 47

1. Need to define the number of clusters beforehand with KMeans 2. Clusters can be arbitrary shapes for agglomerative clustering

Answer 48

PCA tries to reduce the projection error to reduce the dimension from `n`

Answer 49

`P(A|B) = (P(B|A) * P(A)) / P(B)`

Answer 50

This is a joint probability, so we look at rows where there is `Headache AND no fever AND vomiting` over all rows.

Answer 51

This is equal to `P(headache | meningitis)` * `P(no fever | headache, meningitis)` * `P(vomiting | headache, no fever, meningitis)` Outermost cases, then narrow down. In the cases where they have headache AND meningitis is there also no fever?

Answer 52

Find `P(t=l | q)`, for all `l` where in binary cases this is just true or false, and return the largest probability.

Answer 53

Because we assume conditional independence between features.

Answer 54

Bagging (Boostrap aggregating) is making new datasets that are the same size as the original one by randomly picking data points from the original dataset, allowing duplicates

Final Flashcards

(81 cards)