Machine Learning Flashcards
Association Rules
Detect relationships or associations between specific values of categorical variables in large data sets.
Market basket analysis: uncover hidden patterns in large data sets, such as “customers who order product A often also order product B or C” or “employees who said positive things about initiative X also frequently complain about issue Y but are happy with issue Z.”
Bayes Theorem
P(A | B) = P(B | A) * P(A) / P(B); P(A) being the number of instances of a given value divided by the total number of instances; P(B) is often ignored since this equation is typically used in a probability ratio that compares two different values for A, with P(B) being the same for both
Bayesian Networks
A graphical formalism for representing the structure of a probabilistic model:
- Show the ways in which the random variables may depend on each other
- Good at representing domains with a causal structure
- Edges in the graph determine which variables directly influence which other variables
- Factorization structure of the joint probability distribution
- Encoding a set of conditional independence assumptions
Boosting
AdaBoost can be interpreted as a sequential procedure for minimizing the exponential loss on the training set with respect to the coefficients of a particular basis function expansion. This leads to generalizations of the algorithm to different loss functions.
Clustering: Canopy
…
Classification
We are trying to predict results in a discrete output. In other words, we are trying to map input variables into discrete categories.
Cluster Analysis
Methods to assign a set of objects into groups. These groups are called clusters and objects in a cluster are more similar to each other than to those in other clusters. Well known algorithms are hierarchical clustering, k-means, fuzzy clustering, supervised clustering.
Cluster Analysis: Distance Metrics Between Items
- Euclidean distance: The geometric distance between objects in the multidimensional space. The shortest path between two objects. It is used to obtain sphere-shaped clusters. 2. City block (Manhattan) distance. It corresponds to the sum of distances along each dimension and is less sensitive to outliers. It is used to obtain diamond-shaped clusters. 3. Cosine similarity measure. It is calculated by measuring the cosine of angle between two objects. It is used mostly to compute the similarity between two sets of transaction data.
Cluster Analysis: Distance Measures Between Clusters
In hierarchical clustering: 1. Average linkage: It is the average distance between all the points in two clusters. 2. Single linkage: It is the distance between nearest points in two clusters 3. Complete linkage: It is the distance between farthest points in two clusters.
Cluster Analysis: Hierarchical Clustering
…
Cluster Analysis: Gaussian Mixture Models (GMM)
An unsupervised learning technique for clustering that generates a mixture of clusters from the full data set using a Gaussian (normal) data distribution model for each cluster. The GMM’s output is a set of cluster attributes (mean, variance, and centroid) for each cluster, thereby producing a set of characterization metadata that serves as a compact descriptive model of the full data collection.
Cluster Analysis: K-Means Overview
What: K-means is one of the most widely used clustering techniques because of its simplicity and speed. It partitions the data into a user specified number of clusters, k. Why: Simplicity, speed. It is fast for large data sets, which are common in segmentation.
Cluster Analysis: K-Means: Cautions
Clusters may converge to a local minimum. Due to this issue, the clusters that are obtained might not be the right ones. To avoid this, it might be helpful to run the algorithm with different initial cluster centroids and compare the results.
Cluster Analysis: K-Means: Scaling Options
Note that each iteration needs N × k comparisons, which determines the time complexity of one iteration. The number of iterations required for convergence varies and may depend on N, but as a first cut, this algorithm can be considered linear in the dataset size. The k-means algorithm can take advantage of data parallelism. When the data objects are distributed to each processor, step 3 can be parallelized easily by doing the assignment of each object into the nearest cluster in parallel.
Cluster Analysis: K-Means: How
- Initialization: The algorithm is initialized by picking the initial k cluster representatives or “centroids”. These initial seeds can be sampled at random from the dataset, or by taking the results of clustering a small subset of the data;
- Data Assignment. Each data point is assigned to its closest centroid, with ties broken arbitrarily. This results in a partitioning of the data.;
- Recompute and reset the location of the “means”. Each cluster representative is relocated to the center (mean) of all data points assigned to it.;
Now repeat step 3 and 4 until the convergence criterion is met (e.g., the assignment of objects to clusters no longer changes over multiple iterations) or maximum iteration is reached.
Cluster Analysis: K-Means: 4 Key Steps
1: Initialization of k centroids; 2: Data points assigned to nearest centroid; 3: Relocation of each mean to the center of it’s points; 4: Repeat step 2 and 3 until assignments no longer change
KNN
K-Nearest Neighbors algorithm (for classification), K-Means (for clustering), K-itemsets (for association rule mining - see A, above), K-Nearest Neighbors Data Distributions (for outlier detection), KD-trees (for indexing and rapid search of high-dimensional data), and more KDD (Knowledge Discovery from Data) things.
KNN
K-NN is an algorithm that can be used when you have a bunch of objects that have been classified or labeled in some way, and other similar objects that haven’t gotten classified or labeled yet, and you want a way to automatically label them.
Conditional Random Fields
…
Confusion Matrix
A matrix showing the predicted and actual classifications. A confusion matrix is of size LxL, where L is the number of different label values. Rows for each of the actual values cross-tabbed against columns of the predicted values.
Optimization: Convex optimization
…
Correlation
the analysis of data to determine a relationship between variables and whether that relationship is negative (- 1.00) or positive (+1.00).
Correlation: Pearson
1) Find items which can be compared; 2) Calculate the sum for each set of those items; 3) Calculate the sum of squares for each set of those items; 4) Calculate the sum of the products for each set of those items; 5) Use the values from 2-4 to calculate the coefficient
Correlation: Pearson
a measure of how well two sets of data fit on a straight line; Gives a value between -1 and 1 with 1 meaning the sets vary in an identical way; Tends to give better results than Euclidean distance scores when data isn’t well normalized, i.e. it corrects for grade inflation
Cost Function
A measurement of the cost to the performance task of making a prediction Y’ when the actual label is y. The use of accuracy to evaluate a model assumes uniform costs of errors and uniform benefits of correct classifications. Example: “Squared error function”.
Cross-validation
A method for estimating the accuracy of an inducer by dividing the data into k mutually exclusive subsets, or folds, of approximately equal size. The inducer is trained and tested k times. Each time it is trained on the data set minus a fold and tested on that fold. The accuracy estimate is the average accuracy for the k folds.
Decision Trees
A tree of questions to guide an end user to a conclusion based on values from a single vector of data. The classic example is a medical diagnosis based on a set of symptoms for a particular patient. A common problem in data science is to automatically or semi-automatically generate decision trees based on large sets of data coupled to known conclusions. Example algorithms are CART and ID3. (Submitted by Michael Malak)
Decision Trees
each node in the tree tests an attribute, decisions are represented by the edges leading away from that node with leaf nodes representing the final decision for all instances that reach that leaf node
Decision Trees: Dealing with missing values
1) have a specific edge for no value; 2) track the number of instances that follow each path and assign that instance to the most popular edge; 3) give instances weights and split the instance evenly down each possible edge. Once the value has reached the leaf nodes, recombine it using the weights
Decision Trees: Pruining
since decision trees are often created by continuously splitting the instances until there is no more information gain, it is possible that some splits were done with too little information gain and result in overfitting of the model to the training data. Basic idea is to check child nodes to see if combining them with their parent would result in an increase in entropy below some threshold
Decision Trees: Replicated subtree problem
because only a single attribute is tested at a node, it is possible that two copies of the same subtree will need to be placed in a tree if the attributes from that subtree were not tested in the root node
Decision Trees: Testing nominal values
if a node has edges for each possible value of that nominal value, then that value will not be tested again further down the tree. If a node groups the values into subsets with an edge per subset, then that attribute may be tested again
Decision Trees: Testing numeric values
can be tested for less than, greater than, equal to, within some range. Can result in just two edges or multiple edges. Can also have an edge that represents no value. May be tested multiple times in a single path
Deep Learning
Refers to a class of methods that includes neural networks and deep belief nets. Useful for finding a hierarchy of the most significant features, characteristics, and explanatory variables in complex data sets. Particularly useful in unsupervised machine learning of large unlabeled datasets. The goal is to learn multiple layers of abstraction for some data. For example, to recognize images, we might want to first examine the pixels and recognize edges; then examine the edges and recognize contours; examine the contours to find shapes; the shapes to find objects; and so on.
Elastic Nets
…
Ensemble Learning
Machine learning approach that combines the results from many different algorithms, whose combined vote (from the ensemble) provides a more robust and accurate predictive output than any single algorithm can muster.
Factor Analysis
used as a variable reduction technique to identify groups of clustered variables. (submitted by Vincent Granville)
Feature Scaling
different features will have different effects on the accuracy of classifications. A scaling factor can be used for each feature to reduce its effect on the classification, possibly to 0. These scales can be optimized to not only improve classifications, but to also show the relative importance of the features.
Feature Scaling
Mean normalization of data to speed up gradient descend
Feature Selection
…
Gaussian Mixture Models
…
Gini Impurity
1: measures the expected error rate if one of the results from a set is randomly applied to one of the items in the set.
Boosting: Gradient Boosted Decision Trees
…
Gradient Descent
An iterative optimization algorithm for finding a local minimum. At each iteration, gradient descent operates by moving the current solution in the direction of the negative gradient of the function (the direction of “steepest descent”). Also known as steepest descent.
Graph Databases
they use graph structures (a finite set of ordered pairs or certain entities), with edges, properties and nodes for data storage. It provides index-free adjacency, meaning that every element is directly linked to its neighbour element.
HDPs or other Bayesian non-parametric model
…
Hidden Layer
The second layer of a three-layer network where the input layer sends its signals, performs intermediary processing
Hidden Markov Models
…
Hyperplane
A hyperplane in an n-dimensional Euclidean space is a flat, n-1 dimensional subset of that space that divides the space into two disconnected parts. First think of a line line. Now pick a point. That point divides the real line into two parts (the part above that point, and the part below that point). The real line has 1 dimension, while the point has 0 dimensions. So a point is a hyperplane of the real line. Now think of the two-dimensional plane. Now pick any line. That line divides the plane into two parts (“left” and “right” or maybe “above” and “below”). The plane has 2 dimensions, but the line has only one. So a line is a hyperplane of the 2d plane. Notice that if you pick a point, it doesn’t divide the 2d plane into two parts. So one point is not enough. Now think of a 3d space. Now to divide the space into two parts, you need a plane. Your plane has two dimensions, your space has three. So a plane is the hyperplane for a 3d space.
Kernel
replacing the dot-product function with a new function that returns what the dot product would have been if the data had first been transformed to a higher dimensional space. Usually done using the radial-basis function
Lagrange
…
LDA
…
Linear Algebra
gives a single number from two vectors by multiplying each value in the first vector by the corresponding value in the second vector and adding them all together