Starting Flashcards

1
Q

A Sequential model is appropriate when?

A

for a plain stack of layers where each layer has exactly one input tensor and one output tensor.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Bayesian Networks

A

A graphical formalism for representing the structure of a probabilistic model:

Show the ways in which the random variables may depend on each other
Good at representing domains with a causal structure
Edges in the graph determine which variables directly influence which other variables
Factorization structure of the joint probability distribution
Encoding a set of conditional independence assumptions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Cluster Analysis: Distance Measures Between Clusters

A

In hierarchical clustering: 1. Average linkage: It is the average distance between all the points in two clusters. 2. Single linkage: It is the distance between nearest points in two clusters 3. Complete linkage: It is the distance between farthest points in two clusters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Bayes Theorem

A

P(A | B) = P(B | A) * P(A) / P(B); P(A) being the number of instances of a given value divided by the total number of instances; P(B) is often ignored since this equation is typically used in a probability ratio that compares two different values for A, with P(B) being the same for both

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Generalizing E–M: Gaussian Mixture Models

A

A Gaussian mixture model (GMM) attempts to find a mixture of multi-dimensional Gaussian probability distributions that best model any input dataset. In the simplest case, GMMs can be used for finding clusters in the same manner as k-means:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Ensemble Learning

A

Machine learning approach that combines the results from many different algorithms, whose combined vote (from the ensemble) provides a more robust and accurate predictive output than any single algorithm can muster.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Elastic Net

A

Elastic Net is a regularized form of regression. The penalty used is a linear combination of the L1 and L2 penalties used in LASSO and ridge regression respectively.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Difference between LASSO and ridge regression?

A

not sure yet

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Boosting:AdaBoost

A

AdaBoost can be interpreted as a sequential procedure for minimizing the exponential loss on the training set with respect to the coefficients of a particular basis function expansion. This leads to generalizations of the algorithm to different loss functions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Hidden Layer/Caculating Layer

A

The second layer of a three-layer network where the input layer sends its signals, performs intermediary processing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Cluster Analysis: K-Means: Contending with size increases

A

Note that each iteration needs N × k comparisons, which determines the time complexity of one iteration. The number of iterations required for convergence varies and may depend on N, but as a first cut, this algorithm can be considered linear in the dataset size. The k-means algorithm can take advantage of data parallelism. When the data objects are distributed to each processor, step 3 can be parallelized easily by doing the assignment of each object into the nearest cluster in parallel.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Hyperplane

A

A hyperplane in an n-dimensional Euclidean space is a flat, n-1 dimensional subset of that space that divides the space into two disconnected parts. First think of a line line. Now pick a point. That point divides the real line into two parts (the part above that point, and the part below that point). The real line has 1 dimension, while the point has 0 dimensions. So a point is a hyperplane of the real line. Now think of the two-dimensional plane. Now pick any line. That line divides the plane into two parts (“left” and “right” or maybe “above” and “below”). The plane has 2 dimensions, but the line has only one. So a line is a hyperplane of the 2d plane. Notice that if you pick a point, it doesn’t divide the 2d plane into two parts. So one point is not enough. Now think of a 3d space. Now to divide the space into two parts, you need a plane. Your plane has two dimensions, your space has three. So a plane is the hyperplane for a 3d space.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Association Rules

A

Detect relationships or associations between specific values of categorical variables in large data sets.
Market basket analysis: uncover hidden patterns in large data sets, such as “customers who order product A often also order product B or C” or “employees who said positive things about initiative X also frequently complain about issue Y but are happy with issue Z.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Linear Algebra

A

gives a single number from two vectors by multiplying each value in the first vector by the corresponding value in the second vector and adding them all together

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Graph Databases

A

they use graph structures (a finite set of ordered pairs or certain entities), with edges, properties and nodes for data storage. It provides index-free adjacency, meaning that every element is directly linked to its neighbour element.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Lagrange

A

Technique for turning constrained optimization

problems into unconstrained ones

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Bayesian Nonparametrics

A

Bayesian Nonparametrics is a class of models with a potentially infinite number of parameters. High flexibility and expressive power of this approach enables better data modelling compared to parametric methods.

Bayesian Nonparametrics is used in problems where a dimension of interest grows with data, for example, in problems where the number of features is not fixed but allowed to vary as we observe more data. Another example is clustering where the number of clusters is automatically inferred from data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Logistic Regression

A

A kind of regression analysis often used when the dependent variable is dichotomous and scored 0 or 1. It is usually used for predicting whether something will happen or not, such as graduation, business failure, or heart attack-anything that can be expressed as event/non-event. Independent variables may be categorical or continuous in logistic regression analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Machine Learning

A

A computer program is said to learn from from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Kernal

A

replacing the dot-product function with a new function that returns what the dot product would have been if the data had first been transformed to a higher dimensional space. Usually done using the radial-basis function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Hidden Markov Models

A

HMM assume there is another processs Y whose behavior “depends on X. The goal is to learn about X by observing Y. HMM stipulate each time instance n knot , the conditional probability ditr…. sosmoentint

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Hierarchical clustering:Agglomerative

A

“Bottom-up” approach: each observation starts in it’s own cluster, and pairs of clusters are merged as one moves up the hierarchy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Hierarchical clustering: Divisive

A

“top-down” approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.Normally in greedy manner

24
Q

Linear Discriminant Analysis(LDA)

A

This is a dimensional reduction technique, that separates the best classes that are related to the dependent variable.

25
Q

SVD:Singular Value Decomposition

A

factorizaation of a mXn matrix into A = UEV^T. Matrix Factorization purpose and Ways to decompose a matrix. Finding eigen values and vectors

26
Q

Neural nets: Zero initialization vs Random initialization

A

zero is pointless.

27
Q

five types of clustering methods:

A

Partitioning Clustering

Density-Based Clustering

Distribution Model-Based Clustering

Hierarchical Clustering

Fuzzy Clustering

28
Q

Optimization: when do optimization algorithms fail?

A

when there is no clear slope in the change in cost function as the values are varied. This is because these algorithms all attempt to follow that slope toward a global minimum.

29
Q

Optimization: random-restart hill climbing

A

running a hill climbing algorithm several times using new random initial input to attempt to reach the global minimum

30
Q

Types of Linkages in Clustering

A

Single Linkage: For two clusters R and S, the single linkage returns the minimum distance between two points i and j such that i belongs to R and j belongs to S

Complete Linkage: For two clusters R and S, the single linkage returns the maximum distance between two points i and j such that i belongs to R and j belongs to S.

  1. Average Linkage: For two clusters R and S, first for the distance between any data-point i in R and any data-point j in S and then the arithmetic mean of these distances are calculated. Average Linkage returns this value of the arithmetic mean.
31
Q

Confusion matrix

A

A matrix showing the predicted and actual classifications. A confusion matrix is of size LxL, where L is the number of different label values. Rows for each of the actual values cross-tabbed against columns of the predicted values.

32
Q

Sigmoid Function

A

an S-shaped mathamatical curve is often used to describe the activation function of a neuron over time

33
Q

Parametric vs non-parametric

A

?

34
Q

Quadratic vs linear

A

?

35
Q

Unsupervised Learning

A

Allows us to approach problems with little or no idea what our results should look like. We can derive structure from data where we don’t necessarily know the effect of the variables.

36
Q

Supervised

A

We are given a data set and already know what our correct output should look like, having the idea that there is a relationship between the input and the output. Categorized into “regression” and “classification” problems.

37
Q

Discriminative vs. Generative

A

Generative approach focuses on modeling, discriminative focuses on a solution

38
Q

Frequentist vs. Bayesian

A

One needs lot of data, one assumes a prior

39
Q

Deterministic vs stochastic

A

output of the model is fully determined by parameter values

40
Q

stochastic

A

possess some inherent randomness

41
Q

Correlation: Pearson vs Spearman

A

Pearson is for linear data. Spearman is for nonlinar

42
Q

Cosine similarity

A

mostly use to measure the similarity between data sets

43
Q

Minkowski Distance

A

Minkowski distance is a generalized distance metric. We can manipulate the above formula by substituting ‘p’ to calculate the distance between two data points in different ways. Thus, Minkowski Distance is also known as Lp norm distance.

A Normed vector space is a vector space on which a norm is defined.” Suppose A is a vector space then a norm on A is a real-valued function ||A||which satisfies below conditions -
Zero Vector- Zero vector will have zero length.
Scalar Factor- The direction of the vector doesn’t change when you multiply it with a positive number though its length will be changed.
Triangle Inequality- If the distance is a norm then the calculated distance between two points will always be a straight line.

44
Q

Euclidean distance

A

Euclidean distance formula can be used to calculate the distance between two data points in a plane.

45
Q

Manhattan Distance

A

Manhattan Distance is used to calculate the distance between two data points in a grid like path.

46
Q

Hamming distance

A

Hamming distance is one of several string metrics for measuring the edit distance between two sequences. Hamming distance can be used to measure how many attributes must be changed in order to match one another

47
Q

What is factor analysis use for?

A

Factor analysis isn’t a single technique, but a family of statistical methods that can be used to identify the latent factors driving observable variables. Factor analysis is commonly used in market research, as well as other disciplines like technology, medicine, sociology, field biology, education, psychology and many more.

48
Q

Key concepts in factor analysis

A

One of the most important ideas in factor analysis is variance – how much your numerical values differ from the average. When you perform factor analysis, you’re looking to understand how the different underlying factors influence the variance among your variables. Every factor will have an influence, but some will explain more variance than others, meaning that the factor more accurately represents the variables it’s comprised of.

The amount of variance a factor explains is expressed in an eigenvalue. If a factor solution has an eigenvalue of 1 or above, it explains more variance than a single observed variable – which means it can be useful to you in cutting down your number of variables. Factor solutions with eigenvalues less than 1 account for less variability than a single variable and are not retained in the analysis. In this sense, a solution would contain fewer factors than the original number of variables.

Another important metric is factor score. This is a numerical measure that describes how strongly a variable from the original research data is related to a given factor. Another term for this association or weighting towards a certain factor is factor loading.

49
Q

Gini impurity

A

measures the expected error rate if one of the results from a set is randomly applied to one of the items in the set.

50
Q

Kernel

A

a kernel is a shortcut that helps us do certain calculation faster which otherwise would involve computations in higher dimensional space.
Mathematical definition: K(x, y) = . Here K is the kernel function, x, y are n dimensional inputs. f is a map from n-dimension to m-dimension space. < x,y> denotes the dot product. usually m is much larger than n.

51
Q

Q8- Explain the difference between L1 and L2 regularization.

A

L2 regularization tends to spread error among all the terms, while L1 is more binary/sparse, with many variables either being assigned a 1 or 0 in weighting. L1 corresponds to setting a Laplacean prior on the terms, while L2 corresponds to a Gaussian prior.

52
Q

Q10- What’s the difference between Type I and Type II error?

A

Type I error is a false positive, while Type II error is a false negative. Briefly stated, Type I error means claiming something has happened when it hasn’t, while Type II error means that you claim nothing is happening when in fact something is.

A clever way to think about this is to think of Type I error as telling a man he is pregnant, while Type II error means you tell a pregnant woman she isn’t carrying a baby.5

53
Q

Describe a linear function?

A
y = w*x + b
y = predicted value (predicated label)
x = input variable (feature)
w = weight (weight vector, gives the slope)
b = initial bias
54
Q

Diffusion maps

A

a dimensionality reduction or feature extraction algorithm introduced by Coifman and Lafon[1][2][3][4] which computes a family of embeddings of a data set into Euclidean space (often low-dimensional) whose coordinates can be computed from the eigenvectors and eigenvalues of a diffusion operator on the data. The Euclidean distance between points in the embedded space is equal to the “diffusion distance” between probability distributions centered at those points. Different from linear dimensionality reduction methods such as principal component analysis (PCA), diffusion maps is part of the family of nonlinear dimensionality reduction methods which focus on discovering the underlying manifold that the data has been sampled from. By integrating local similarities at different scales, diffusion maps give a global description of the data-set. Compared with other methods, the diffusion map algorithm is robust to noise perturbation and computationally inexpensive.

55
Q

Cluster Analysis: Gaussian Mixture Models (GMM)

A

An unsupervised learning technique for clustering that generates a mixture of clusters from the full data set using a Gaussian (normal) data distribution model for each cluster. The GMM’s output is a set of cluster attributes (mean, variance, and centroid) for each cluster, thereby producing a set of characterization metadata that serves as a compact descriptive model of the full data collection.

56
Q

HMM advantages and disadvantage

A

Advantages
HMM is an analyzed probabilistic graphical model. The algorithms applied in this model are studied for approximate learning and conclusion.
Hidden Markov Models (HMM) are said to acquire the contingency between successive measurements, as defined in the switch continuity principle.
HMMs represent the variance of appliances’ power demands via probability distributions.
Disadvantages
HMM cannot represent any dependency between the appliances. The conditional HMM can capture the dependencies, though.
HMM does not consider the state sequence dominating any given state because of its Markovian nature.
HMMs do not explicitly capture the time in a specified state due to their Markovian behavior. Nonetheless, the hidden semi-Markov model is responsible for capturing that kind of behavior

57
Q

Q

What are hyperparameters?

A

Hyperparameters are the configuration settings used to tune how the model is trained.