ai Flashcards

Question

what is the cost function for logistic regression?

Answer 1

g(w) =1/N Σ (0→n) Cost(h(xⁿ;w),yⁿ) where Cost(h(x;w),y) = { -log(h(x;w)) if y=1, -log(1-h(x;w)) if y=0 } This can be simplified to: -1/N Σ (0→n) ( (yⁿ)*log(h(xⁿ;w))+(1-yⁿ)*log(1-h(xⁿ;w)) )

Answer 2

∇g(w) = (h(xⁿ;w)-yⁿ)*xⁿ

Answer 3

a model that uses a finite set of coefficients that can be changed in order to optimise the cost.

Answer 4

measures distance between two points and is defined by: Lᵖ(xᵃ,xᵇ) = ᵖ√ [ Σ(j=1→n) |xⱼᵃ - xⱼᵇ|² ] In general, p=2 when using it for k-NN

Answer 5

finds the (normalised) Lᵖ Norm between the given point and all the points in the training data. then keep the k items from the training data with least distance. last, for categorisation put the input it in the category that has most of the k neighbours in. for regression, predict the average.

Answer 6

independent variables might have different scales meaning some impact the result more than others.

Answer 7

binary data: make yes/no equal to 1 and 0 before calculating distance categorical data: if the values are the same, distance is 0, otherwise its 1

Answer 8

use a random 70% of the training set to train multiple models. Then assess validation error of each model on the remaining 30% (validation set) and keep the one with the lowest error. mean squared error for linear regression and 1- (# wrong predictions/ # predictions) for classification.

Answer 9

randomly split the training data into k-parts. train each model on every combination of k-1 parts and use the remaining k part to assess the validation error. find the average error and repeat for each model. take the model with the lowest average error.

Answer 10

same as k-fold cross-validation but where k= cardinality of the training set

Answer 11

doesn't use training data but finds patterns in the data with know prior knowledge about the output or target variable.

Answer 12

removing some parameters that are deemed less important/ unimportant

Answer 13

normalisation ensures attributes contribute equally to the similarity measure. min-max normalisation: xⱼⁱ ::= (xⱼⁱ - xⱼ,ₘᵢₙ) / (xⱼ,ₘₐₓ - xⱼ,ₘᵢₙ) sensitive to outliers z-score standardisation: xⱼⁱ ::= (xⱼⁱ - μⱼ) / σⱼ doesn't ensure the values are bounded by a range

Answer 14

a 2D matrix with all data points along both dimensions and the distances in the entries

Answer 15

high intra-cluster similarity (WCSS): points in a cluster are similar/close to each other low inter-cluster similarity (BCSS): points are dissimilar/ far away from points in other clusters Total Sum of Squares (TSS) = BCSS + WCSS. maximising one minimises the other

Answer 16

variability(C) = Σ (e∈C) d(e,Centroid(C)) where the centroid of a cluster is the average of all the points in the cluster. distance is commonly measured with squared Euclidian distance: (x₂ - x₁)² + (y₂ - y₁)² + ... + (z₂ - z₁)² for points (x,y,...z) dissimilarity (𝐂) = Σ (C∈𝐂) (variability(C)) needs to be minimised so: dissimilarity = Σ (C∈𝐂) (Σ (e∈C) d(e,Centroid(C))) SUM OF DISTANCES OF EACH POINT TO ITS RESPECTIVE CENTROID

Answer 17

variability(C) = Σ (e∈C) n꜀*d(Centroid(C),Centroid(data)) where n꜀ is the number of points in cluster C SUM OF DISTANCES OF CENTROIDS TO CENTRE OF DATA

Answer 18

a partitional clustering algorithm that selects k data points as centroids for clusters and assigns the other points to the nearest centroid. then repeat following steps until convergence: calculate the new centroid of the new clusters. assign the points to the centroids.

Answer 19

might converge at a local minimum: to overcome, do multiple runs with different randomised initial centroids and choose the best result. alternatively, randomly select the first centroid and then choose the other based on probability proportional to distance^2 from it to the first centroid. choosing the wrong value for k changes the result dramatically: to overcome, either use knowledge of the dataset or run the algorithm with multiple k-values and choose the k at the point where returns are diminishing on the k/dissimilarity graph.

Answer 20

agglomerative: start with each cluster being a single entry and merge clusters with smallest inter-cluster dissimilarity until one is left. output a dendrogram divisive: start with all entries in one cluster and keep splitting (maximising inter-cluster dissimilarity) until each entry is a cluster. output a dendrogram

Answer 21

single linkage: shortest distance from any member of the cluster to the other cluster complete linkage: largest distance from any member of the cluster to any member of the other cluster group average: average of distances between members of the two clusters

Answer 22

a dendrogram is a tree a dissimilarity/ discrete entries axes. when two clusters are merged, they are joined at the y value that represents their dissimilarity. drawing a horizonal line represents a cut-off to obtain a certain number of clusters or level of similarity.

Answer 23

an index used to measure adequacy of clusters meaning assesses how well the clusters provides true information about the data or reflect the intrinsic character of the data.

Answer 24

unsupervised (internal indices): doesn't use external information about the data for example WCSS supervised (external indices): measures the extent to which a clustering algorithm matches some external structure for example entropy relative: compares two clusters using either types of validation above

Answer 25

variability and separation algorithms such as BCSS and WCSS or silhouette coefficient: SC of one example is (aᵢ-bᵢ)/max(aᵢ,bᵢ) where aᵢ is average distance of the iᵗʰ example to all the other entries in its cluster and bᵢ is the minimum average distance to the other clusters SC of a cluster/clustering is the average SC of examples in the cluster/clustering

Answer 26

take the number of clusters at the peak of the graph with number of clusters on the x-axis and average SC on the y-axis

Answer 27

cophenetic correlation coefficient (CPCC): [∑ (Dᵢ,ⱼ-d)(Pᵢ,ⱼ-p)] / √[∑(Dᵢ,ⱼ-d)∑(Pᵢ,ⱼ-p)] where Pᵢ,ⱼ is the cophenetic distance between i and j found by reading the dissimilarity of the first linkage of i and j off of the dendrogram. d and p are the average distances and cophenetic distances

Answer 28

classification-oriented: entropy, purity and precision, recall and F-measure similarity-oriented: Jaccard measure and Rand statistic they both use external information about what classes the data actually splits into

Answer 29

precision p(i,j): #examples of class j in cluster i / #examples in cluster i recall r(i,j): #examples of class j in cluster i / #examples in class j F-measure f(i,j): 2*p(i,j)*r(i,j) / (p(i,j)+r(i,j))

Answer 30

Entropy of the iᵗʰ cluster: eᵢ = -∑ p(i,j)*log₂p(i,j) Total entropy: e = ∑ eᵢ* (#examples in cluster i / #examples) low entropy means low uncertainty and disorder

Answer 31

Purity of the iᵗʰ cluster: pᵢ = maxⱼ p(i,j) Total Purity: p = ∑ pᵢ* (#examples in cluster i / #examples) high purity means better separation of classes and likelihood that entries are in the right class

Answer 32

fₓᵧ: #examples when x/y=1,0 share/ don't share cluster/class Jaccard measure: (f₁₁)/(f₀₁+f₁₀+f₁₁) Rand statistic: (f₀₀+f₁₁)/(f₀₀+f₀₁+f₁₀+f₁₁)

ai Flashcards

(56 cards)