Quick Fire Flashcards

Question

Entropy

Answer 1

entropy of a random variable is the average level of "information", "surprise", or "uncertainty" inherent in the variable's possible outcomes - Sum over all x in X of (P(x)logP(x)) The basic idea of information theory is that the "informational value" of a communicated message depends on the degree to which the content of the message is surprising. If an event is very probable, it is no surprise (and generally uninteresting) when that event happens as expected; hence transmission of such a message carries very little new information. However, if an event is unlikely to occur, it is much more informative to learn that the event happened or will happen Cool properties I(p) is monotonically decreasing in p: an increase in the probability of an event decreases the information from an observed event, and vice versa. I(p) ≥ 0: information is a non-negative quantity. I(1) = 0: events that always occur do not communicate information. I(p1, p2) = I(p1) + I(p2): the information learned from independent events is the sum of the information learned from each event. Conditional Entropy the conditional entropy quantifies the amount of information needed to describe the outcome of a random variable {\displaystyle Y} given that the value of another random variable {\displaystyle X} is known. H(Y l X) = - Sum over all values of x and y (p(x,y)log(p(x,y)/p(x)))

Answer 2

Given a known joint distribution of two discrete random variables, say, X and Y, the marginal distribution of either variable – X for example — is the probability distribution of X when the values of Y are not taken into consideration. This can be calculated by summing the joint probability distribution over all values of Y. Naturally, the converse is also true: the marginal distribution can be obtained for Y by summing over the separate values of X. p(xi)=sum over all y from j=1 to J while xi is fixed (p(xi,yj)) p(yi)=sum over all x from k=1 to K while yi is fixed (p(xk,yi))

Answer 3

The joint probability mass function of two discrete random variables X and Y: p(x,y) = P(X=x and Y=y), probability that X is xi and Y is yi at the same time Using conditional distributions: p(x,y) = P(X=x l Y=y)×P(Y=y), probability that X is xi given Y is yi times the marginal distribution of Y or p(x,y) = P(Y=y l X=x)×P(X=x), probability that Y is yi given X is xi times the marginal distribution of X For independent variables p(x,y)=P(X=x)P(Y=y) This means that acquiring any information about the value of one or more of the random variables leads to a conditional distribution of any other variable that is identical to its unconditional (marginal) distribution; thus no variable provides any information about any other variable.

Answer 4

The marginal probability is the probability of a single event occurring, independent of other events. A conditional probability, on the other hand, is the probability that an event occurs given that another specific event has already occurred. This means that the calculation for one variable is dependent on another variable P(Y=y l X=x) = p(x,y)/P(X=x) Example, you had a table like this: https://en.m.wikipedia.org/wiki/Marginal_distribution Suppose there is data from classroom of 200 students on the amount of time studied (X) and the percent correct (Y).[4] Assuming that X and Y are discrete random variables, the joint distribution of X and Y can be described by listing all the possible values of p(xi,yj). It would be: The conditional distribution can be used to determine the probability that a student scored 20 or below while also studying for 60 minutes or more. While the marginal distribution can be used to determine how many students that scored 20 or below See these images: https://courses.cs.cornell.edu/cs2800/wiki/index.php/Conditional_probability

Answer 5

Kullback–Leibler divergence, (also called relative entropy), is a measure of how one probability distribution is different from a second, reference probability distribution D_KL(P(X) ll Q(X)) = sum over all X ( P(x)log(P(x)/Q(x)) ) In the context of machine learning, {\displaystyle D_{\text{KL}}(P\parallel Q)} is often called the information gain achieved if P would be used instead of Q which is currently used. By analogy with information theory, it is called the relative entropy of P with respect to Q. In the context of coding theory, {\displaystyle D_{\text{KL}}(P\parallel Q)} can be constructed by measuring the expected number of extra bits required to code samples from P using a code optimized for Q rather than the code optimized for P.

Answer 6

Synonymous to KL Divergence or MI depending on context the amount of information gained about a random variable or signal from observing another random variable. However, in the context of decision trees, the term is sometimes used synonymously with mutual information, which is the conditional expected value of the Kullback–Leibler divergence of the univariate probability distribution of one variable from the conditional distribution of this variable given the other one. In general terms, the expected information gain is the change in information entropy Η from a prior state to a state that takes some information as given: IG(T,a) = H(T)-H(T l a) H(T l a) is conditional entropy of T given attribute a the expected information gain is the mutual information, meaning that on average, the reduction in the entropy of T is the mutual information. Drawbacks - favors high cardinality features . Example : Information gain is often used to decide which of the attributes are the most relevant, so they can be tested near the root of the tree. One of the input attributes might be the customer's credit card number. This attribute has a high mutual information, because it uniquely identifies each customer, but we do not want to include it in the decision tree: deciding how to treat a customer based on their credit card number is unlikely to generalize to customers we haven't seen before (overfitting). https://en.m.wikipedia.org/wiki/Decision_tree_learning#Gini_impurity Gini impurity is a measure of how often a randomly chosen element from the set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset. The Gini impurity can be computed by summing the probability {\displaystyle p_{i}} of an item with label {\displaystyle i} being chosen times the probability {\displaystyle \sum _{k\neq i}p_{k}=1-p_{i}} of a mistake in categorizing that item. How gini impurity is used to decide splits in decision tree...the goal is to reduce Gini Impurity at every split, because you'd be reducing entropy which is a measure of uncertainty. Ie you wanna choose features that have a high information gain. https: //www.displayr.com/how-is-splitting-decided-for-decision-trees/ https: //en.m.wikipedia.org/wiki/Information_gain_in_decision_trees (see example)

Answer 7

Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean/average prediction (regression) of the individual trees Decision tree Decision trees are a popular method for various machine learning tasks. Tree learning "come[s] closest to meeting the requirements for serving as an off-the-shelf procedure for data mining", say Hastie et al., "because it is invariant under scaling and various other transformations of feature values, is robust to inclusion of irrelevant features, and produces inspectable models. However, they are seldom accurate Ensemble learning Bagging The training algorithm for random forests applies the general technique of bootstrap aggregating, or bagging, to tree learners. Given a training set X = x1, ..., xn with responses Y = y1, ..., yn, bagging repeatedly (B times) selects a random sample with replacement of the training set and fits trees to these samples: For b = 1, ..., B: Sample, with replacement, n training examples from X, Y; call these Xb, Yb. Train a classification or regression tree fb on Xb, Yb. After training, predictions for unseen samples x' can be made by averaging the predictions from all the individual regression trees on x': {\displaystyle {\hat {f}}={\frac {1}{B}}\sum _{b=1}^{B}f_{b}(x')} or by taking the majority vote in the case of classification trees. This bootstrapping procedure leads to better model performance because it decreases the variance of the model, without increasing the bias. This means that while the predictions of a single tree are highly sensitive to noise in its training set, the average of many trees is not, as long as the trees are not correlated. Simply training many trees on a single training set would give strongly correlated trees (or even the same tree many times, if the training algorithm is deterministic); bootstrap sampling is a way of de-correlating the trees by showing them different training sets. RF does "feature bagging" that selects, at each candidate split in the learning process, a random subset of the features. The reason for doing this is the correlation of the trees in an ordinary bootstrap sample: if one or a few features are very strong predictors for the response variable (target output), these features will be selected in many of the B trees, causing them to become correlated.

Answer 8

``` Given the knowledge of the ground truth class assignments of the samples, it is possible to define some intuitive metric using conditional entropy analysis. In particular Rosenberg and Hirschberg (2007) define the following two desirable objectives for any cluster assignment: ``` homogeneity: each cluster contains only members of a single class. completeness: all members of a given class are assigned to the same cluster. Homogeneity and completeness scores are formally given by: h=1−H(C|K)/H(C) c=1−H(K|C)/H(K) where H(C|K) is the conditional entropy of the classes given the cluster assignments and is given by: H(C|K)=−∑c=1 to C ∑k=1to K (n_ck/n)×log⁡(n-ck/n_k) and H(C) is the entropy of the classes and is given by: H(C)=−∑c=1toC (n_c/n) ×log⁡(n_c/n) with n the total number of samples, nc and nk the number of samples respectively belonging to class c and cluster k, and finally n_ck the number of samples from class c assigned to cluster k. The conditional entropy of clusters given class H(K|C) and the entropy of clusters H(K) are defined in a symmetric manner. V measure is the harmonic mean of these two things. And is very similarly calculated to NMI. v_measure_score is symmetric: it can be used to evaluate the agreement of two independent assignments on the same dataset. This is not the case for completeness_score and homogeneity_score: both are bound by the relationship: homogeneity_score(a, b) == completeness_score(b, a) Advantages Bounded scores: 0.0 is as bad as it can be, 1.0 is a perfect score. Intuitive interpretation: clustering with bad V-measure can be qualitatively analyzed in terms of homogeneity and completeness to better feel what ‘kind’ of mistakes is done by the assignment. No assumption is made on the cluster structure: can be used to compare clustering algorithms such as k-means which assumes isotropic blob shapes with results of spectral clustering algorithms which can find cluster with “folded” shapes. 2.3.10.3.2. Drawbacks The previously introduced metrics are not normalized with regards to random labeling: this means that depending on the number of samples, clusters and ground truth classes, a completely random labeling will not always yield the same values for homogeneity, completeness and hence v-measure. In particular random labeling won’t yield zero scores especially when the number of clusters is large. This problem can safely be ignored when the number of samples is more than a thousand and the number of clusters is less than 10. For smaller sample sizes or larger number of clusters it is safer to use an adjusted index such as the Adjusted Rand Index (ARI). Also requires knowledge of ground truth classes

Answer 9

Need to usethe model itself in case ground truth is not known The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette ranges from −1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring cluster where a higher Silhouette Coefficient score relates to a model with better defined clusters. The Silhouette Coefficient is defined for each sample and is composed of two scores: a: The mean distance between a sample and all other points in the same class. b: The mean distance between a sample and all other points in the next nearest cluster. The Silhouette Coefficient s for a single sample is then given as: s=(b−a)/max(a,b) Advantages The score is bounded between -1 for incorrect clustering and +1 for highly dense clustering. Scores around zero indicate overlapping clusters. The score is higher when clusters are dense and well separated, which relates to a standard concept of a cluster. 2.3.10.5.2. Drawbacks The Silhouette Coefficient is generally higher for convex clusters than other concepts of clusters, such as density based clusters like those obtained through DBSCAN.

Answer 10

If the ground truth labels are not known, the Calinski-Harabasz index (sklearn.metrics.calinski_harabasz_score) - also known as the Variance Ratio Criterion - can be used to evaluate the model, where a higher Calinski-Harabasz score relates to a model with better defined clusters. The index is the ratio of the sum of between-clusters dispersion and of within-cluster dispersion for all clusters (where dispersion is defined as the sum of distances squared): Advantages The score is higher when clusters are dense and well separated, which relates to a standard concept of a cluster. The score is fast to compute. 2.3.10.6.2. Drawbacks The Calinski-Harabasz index is generally higher for convex clusters than other concepts of clusters, such as density based clusters like those obtained through DBSCAN. 2.3.10.6.3. Mathematical formulation For a set of data E of size nE which has been clustered into k clusters, the Calinski-Harabasz score s is defined as the ratio of the between-clusters dispersion mean and the within-cluster dispersion: s=(tr(Bk)/tr(Wk))×(nE−k)/(k−1) where tr(Bk) is trace of the between group dispersion matrix and tr(Wk) is the trace of the within-cluster dispersion matrix defined by: Wk=∑q=1 to k ∑for each x∈Cq (x−cq)(x−cq)T Bk=∑q=1 to k (nq×(cq−cE)(cq−cE)T) with Cq the set of points in cluster q, cq the center of cluster q, cE the center of E, and nq the number of points in cluster q.

Answer 11

Contingency matrix. It reports the cardinality of each class with respect to each cluster. CLUSTER 1 CLUSTER 2 CLASS A. 2. 0 CLASS B. 4. 3 ``` This says that cluster 1 has 2 samples in class a and 4 in class b The contingency matrix provides sufficient statistics for all clustering metrics where the samples are independent and identically distributed and one doesn’t need to account for some instances not being clustered. ``` Comparison-pair matrix...similar to Rand index but in matrix form. pair confusion matrix (sklearn.metrics.cluster.pair_confusion_matrix) is a 2x2 similarity matrix C=[C00 C01 C10 C11] between two clusterings computed by considering all pairs of samples and counting pairs that are assigned into the same or into different clusters under the true and predicted clusterings. It has the following entries: C00 : number of pairs with both clusterings having the samples not clustered together C10 : number of pairs with the true label clustering having the samples clustered together but the other clustering not having the samples clustered together C01 : number of pairs with the true label clustering not having the samples clustered together but the other clustering having the samples clustered together C11 : number of pairs with both clusterings having the samples clustered together Considering a pair of samples that is clustered together a positive pair, then as in binary classification the count of true negatives is C00, false negatives is C10, true positives is C11 and false positives is C01. Perfectly matching labelings have all non-zero entries on the diagonal regardless of actual label values

Answer 12

If the ground truth labels are not known, the Davies-Bouldin index (sklearn.metrics.davies_bouldin_score) can be used to evaluate the model, where a lower Davies-Bouldin index relates to a model with better separation between the clusters. This index signifies the average ‘similarity’ between clusters, where the similarity is a measure that compares the distance between clusters with the size of the clusters themselves. Zero is the lowest possible score. Values closer to zero indicate a better partition. The index is defined as the average similarity between each cluster Ci for i=1,...,k and its most similar one Cj. In the context of this index, similarity is defined as a measure Rij that trades off: si, the average distance between each point of cluster i and the centroid of that cluster – also know as cluster diameter. https://en.m.wikipedia.org/wiki/Davies%E2%80%93Bouldin_index can be other norms as well...as long as this distance metric has to match with the metric used in the clustering scheme itself for meaningful results dij, the distance between cluster centroids i and j. A simple choice to construct Rij so that it is nonnegative and symmetric is: Rij=(si+sj)/dij Then the Davies-Bouldin index is defined as: DB=∑i=1 to k (max Rij where i≠j) Due to the way it is defined, as a function of the ratio of the within cluster scatter, to the between cluster separation, a lower value will mean that the clustering is better. It happens to be the average similarity between each cluster and its most similar one, averaged over all the clusters, where the similarity is defined as Si above. This affirms the idea that no cluster has to be similar to another, and hence the best clustering scheme essentially minimizes the Davies–Bouldin index. Can be used to decide for k in kmeans... Advantages The computation of Davies-Bouldin is simpler than that of Silhouette scores. The index is computed only quantities and features inherent to the dataset. 2.3.10.7.2. Drawbacks The Davies-Boulding index is generally higher for convex clusters than other concepts of clusters, such as density based clusters like those obtained from DBSCAN. The usage of centroid distance limits the distance metric to Euclidean space. drawback that a good value reported by this method does not imply the best information retrieval

Answer 13

TF-IDF is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. This is done by multiplying two metrics: how many times a word appears in a document, and the inverse document frequency of the word across a set of documents. Term frequency- inverse document frequency TF-IDF (term frequency-inverse document frequency) was invented for document search and information retrieval. It works by increasing proportionally to the number of times a word appears in a document, but is offset by the number of documents that contain the word. So, words that are common in every document, such as this, what, and if, rank low even though they may appear many times, since they don’t mean much to that document in particular. TF-IDF for a word in a document is calculated by multiplying two different metrics: The term frequency of a word in a document. There are several ways of calculating this frequency, with the simplest being a raw count of instances a word appears in a document. Then, there are ways to adjust the frequency, by length of a document, or by the raw frequency of the most frequent word in a document. The inverse document frequency of the word across a set of documents. This means, how common or rare a word is in the entire document set. The closer it is to 0, the more common a word is. This metric can be calculated by taking the total number of documents, dividing it by the number of documents that contain a word, and calculating the logarithm. So, if the word is very common and appears in many documents, this number will approach 0. Otherwise, it will approach 1. Multiplying these two numbers results in the TF-IDF score of a word in a document. The higher the score, the more relevant that word is in that particular document.

Answer 14

Can be used to compute a "distance" for nominal variables

Answer 15

One of the major drawbacks of K-Means is its naive use of the mean value for the cluster center. We can see why this isn’t the best way of doing things by looking at the image of two circles, one inside the other. there are two circular clusters with different radius’ centered at the same mean. K-Means can’t handle this because the mean values of the clusters are very close together. K-Means also fails in cases where the clusters are not circular/convex, again as a result of using the mean as cluster center. Gaussian Mixture Models (GMMs) give us more flexibility than K-Means. With GMMs we assume that the data points are Gaussian distributed; this is a less restrictive assumption than saying they are circular by using the mean. That way, we have two parameters to describe the shape of the clusters: the mean and the standard deviation! Taking an example in two dimensions, this means that the clusters can take any kind of elliptical shape (since we have a standard deviation in both the x and y directions). Thus, each Gaussian distribution is assigned to a single cluster. To find the parameters of the Gaussian for each cluster (e.g the mean and standard deviation), we will use an optimization algorithm called Expectation–Maximization (EM) We begin by selecting the number of clusters (like K-Means does) and randomly initializing the Gaussian distribution parameters for each cluster. One can try to provide a good guesstimate for the initial parameters by taking a quick look at the data too. Though note, as can be seen in the graphic above, this isn’t 100% necessary as the Gaussians start our as very poor but are quickly optimized. Given these Gaussian distributions for each cluster, compute the probability that each data point belongs to a particular cluster. The closer a point is to the Gaussian’s center, the more likely it belongs to that cluster. This should make intuitive sense since with a Gaussian distribution we are assuming that most of the data lies closer to the center of the cluster. Based on these probabilities, we compute a new set of parameters for the Gaussian distributions such that we maximize the probabilities of data points within the clusters. We compute these new parameters using a weighted sum of the data point positions, where the weights are the probabilities of the data point belonging in that particular cluster. Steps 2 and 3 are repeated iteratively until convergence, where the distributions don’t change much from iteration to iteration. ``` Ads Firstly GMMs are a lot more flexible in terms of cluster covariance than K-Means; due to the standard deviation parameter, the clusters can take on any ellipse shape, rather than being restricted to circles. K-Means is actually a special case of GMM in which each cluster’s covariance along all dimensions approaches 0. Secondly, since GMMs use probabilities, they can have multiple clusters per data point. So if a data point is in the middle of two overlapping clusters, we can simply define its class by saying it belongs X-percent to class 1 and Y-percent to class 2. I.e GMMs support mixed membership. ``` https://towardsdatascience.com/the-5-clustering-algorithms-data-scientists-need-to-know-a36d136ef68

Answer 16

also known as shift invariant or space invariant artificial neural networks (SIANN), based on the shared-weight architecture of the convolution kernels that shift over input features and provide translation equivariant responses

Quick Fire Flashcards

(40 cards)