Starting Flashcards
A Sequential model is appropriate when?
for a plain stack of layers where each layer has exactly one input tensor and one output tensor.
Bayesian Networks
A graphical formalism for representing the structure of a probabilistic model:
Show the ways in which the random variables may depend on each other
Good at representing domains with a causal structure
Edges in the graph determine which variables directly influence which other variables
Factorization structure of the joint probability distribution
Encoding a set of conditional independence assumptions
Cluster Analysis: Distance Measures Between Clusters
In hierarchical clustering: 1. Average linkage: It is the average distance between all the points in two clusters. 2. Single linkage: It is the distance between nearest points in two clusters 3. Complete linkage: It is the distance between farthest points in two clusters.
Bayes Theorem
P(A | B) = P(B | A) * P(A) / P(B); P(A) being the number of instances of a given value divided by the total number of instances; P(B) is often ignored since this equation is typically used in a probability ratio that compares two different values for A, with P(B) being the same for both
Generalizing E–M: Gaussian Mixture Models
A Gaussian mixture model (GMM) attempts to find a mixture of multi-dimensional Gaussian probability distributions that best model any input dataset. In the simplest case, GMMs can be used for finding clusters in the same manner as k-means:
Ensemble Learning
Machine learning approach that combines the results from many different algorithms, whose combined vote (from the ensemble) provides a more robust and accurate predictive output than any single algorithm can muster.
Elastic Net
Elastic Net is a regularized form of regression. The penalty used is a linear combination of the L1 and L2 penalties used in LASSO and ridge regression respectively.
Difference between LASSO and ridge regression?
not sure yet
Boosting:AdaBoost
AdaBoost can be interpreted as a sequential procedure for minimizing the exponential loss on the training set with respect to the coefficients of a particular basis function expansion. This leads to generalizations of the algorithm to different loss functions.
Hidden Layer/Caculating Layer
The second layer of a three-layer network where the input layer sends its signals, performs intermediary processing
Cluster Analysis: K-Means: Contending with size increases
Note that each iteration needs N × k comparisons, which determines the time complexity of one iteration. The number of iterations required for convergence varies and may depend on N, but as a first cut, this algorithm can be considered linear in the dataset size. The k-means algorithm can take advantage of data parallelism. When the data objects are distributed to each processor, step 3 can be parallelized easily by doing the assignment of each object into the nearest cluster in parallel.
Hyperplane
A hyperplane in an n-dimensional Euclidean space is a flat, n-1 dimensional subset of that space that divides the space into two disconnected parts. First think of a line line. Now pick a point. That point divides the real line into two parts (the part above that point, and the part below that point). The real line has 1 dimension, while the point has 0 dimensions. So a point is a hyperplane of the real line. Now think of the two-dimensional plane. Now pick any line. That line divides the plane into two parts (“left” and “right” or maybe “above” and “below”). The plane has 2 dimensions, but the line has only one. So a line is a hyperplane of the 2d plane. Notice that if you pick a point, it doesn’t divide the 2d plane into two parts. So one point is not enough. Now think of a 3d space. Now to divide the space into two parts, you need a plane. Your plane has two dimensions, your space has three. So a plane is the hyperplane for a 3d space.
Association Rules
Detect relationships or associations between specific values of categorical variables in large data sets.
Market basket analysis: uncover hidden patterns in large data sets, such as “customers who order product A often also order product B or C” or “employees who said positive things about initiative X also frequently complain about issue Y but are happy with issue Z.”
Linear Algebra
gives a single number from two vectors by multiplying each value in the first vector by the corresponding value in the second vector and adding them all together
Graph Databases
they use graph structures (a finite set of ordered pairs or certain entities), with edges, properties and nodes for data storage. It provides index-free adjacency, meaning that every element is directly linked to its neighbour element.
Lagrange
Technique for turning constrained optimization
problems into unconstrained ones
Bayesian Nonparametrics
Bayesian Nonparametrics is a class of models with a potentially infinite number of parameters. High flexibility and expressive power of this approach enables better data modelling compared to parametric methods.
Bayesian Nonparametrics is used in problems where a dimension of interest grows with data, for example, in problems where the number of features is not fixed but allowed to vary as we observe more data. Another example is clustering where the number of clusters is automatically inferred from data.
Logistic Regression
A kind of regression analysis often used when the dependent variable is dichotomous and scored 0 or 1. It is usually used for predicting whether something will happen or not, such as graduation, business failure, or heart attack-anything that can be expressed as event/non-event. Independent variables may be categorical or continuous in logistic regression analysis.
Machine Learning
A computer program is said to learn from from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E
Kernal
replacing the dot-product function with a new function that returns what the dot product would have been if the data had first been transformed to a higher dimensional space. Usually done using the radial-basis function
Hidden Markov Models
HMM assume there is another processs Y whose behavior “depends on X. The goal is to learn about X by observing Y. HMM stipulate each time instance n knot , the conditional probability ditr…. sosmoentint
Hierarchical clustering:Agglomerative
“Bottom-up” approach: each observation starts in it’s own cluster, and pairs of clusters are merged as one moves up the hierarchy.