Final Flashcards

1
Q

hyperparameter and examples

A

a parameter whose value is used to control the learning process
ex) batch size, number of epochs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

parameter grid

A

specifies search space, combination of hyperparameters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

probabilistic graphical models

A

graphical representations of probability distributions, variables depend on other variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what are the benefits of graphical models?

A

learning dependencies, visualizing a probability model, graphical manipulations over latent variables, obtaining insights (like conditional independence)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

conditional independence

A

2 events A and B are conditionally independent given a 3rd event C if the occurrence of A and the occurrence of B are independent events.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How many types of probabilistic graphical models are there and what are they?

A

2 types: Bayesian Networks, Markov Networks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the difference between Bayesian Networks and Markov Networks?

A

Bayesian Networks have directed graphs and Markov Networks have undirected graphs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Bayesian network

A

directed edges between nodes that describe conditional dependencies
ex) sprinkler, rain, grass wet

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

joint probability

A

Probability of 2 or more events happening at the same time. This uses product/chain rule
ex) Probability that a card drawn is red and 4

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

marginal probability

A

probability of an event irrespective of the outcome of another variable (unconditional probability). This is the probability of a single event and this uses the sum rule.
ex) Probability that a card drawn is red

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

conditional probability

A

probability of one event with some relationship to one of more events
ex) given that we drew a red card, what is the probability that the red card has a 4

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Bayesian Networks

A

directed acyclic graph (graph having no cycles) and model dependencies between the variables of the data set. Vertices are variables and edges are conditional probability. It allows us to capture variable dependencies within the data which we can’t capture with linear and logistics regression. Bayesian networks use Bayesian Inference.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Inference

A

Process of using a trained machine learning algorithm to make a prediction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Posterior Probability

A

Probability of A (the hypothesis) to occur given event B (the evidence) already occured

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Likelihood

A

Probability of B (the evidence) being true given that A is true

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Prior

A

Probability of A (the hypothesis) being true

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Evidence

A

Probability of B (the evidence) being true

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Probability Density Function

A

Finds probability of outcomes of random variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What are two ways to build a classifier?

A

1) Calculate posterior probabilities for a sample and assign it to a class that has the highest probability
2) create a discriminant function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What would you use for a continuous random variable?

A

gaussian naive bayes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What would you use for a categorical random variable?

A

categorical naive bayes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What would you use for a multinomial distribution?

A

multinomial naive bayes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What would you use for a binary random variable?

A

bernouli naive bayes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

discriminant function

A

we don’t need to calculate evidence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What is the difference between a bayesian network and a naive bayes classifier?
bayesian network assumes that there's dependency between variables whereas naive bayes classifier assumes there's no dependency between variables (input features are independent variables)
26
If the formula for Naive Bayes is given by P(class|data) =[P(data|class)P(class)]\ P(data) , then which of the components makes this algorithm ”naive”: (a) P(class|data) (b) P(data|class) (c) P(class) (d) P(data)
(b) P(data|class)
27
True or False: Naive Bayes assumes that the input features are independent
True
28
For continuous data, which formulation of Naive Bayes is appropriate: (a) Binomial Naive Bayes (b) Multinomial Naive Bayes (c) Gaussian Naive Bayes (d) None of the above
(c) Gaussian Naive Bayes
29
``` In the above problem, how would smoothing change the probability that Document 5 is Spam? (a) Increase the probability (b) Decrease the probability (c) No change ```
(a) Increase the probability
30
``` According to the lecture material, which of these is the typical loss function used for SVM’s? (a) Mean Squared Error (b) Hinge Loss (c) Gini Coefficient (d) Cross Entropy ```
(b) Hinge Loss
31
SVM’s are less effective when: (a) The data is linearly separable (b) The data is clean and ready to use (c) The data is noisy and contains overlapping points
(c) The data is noisy and contains overlapping points
32
In SVM what is the meaning of a hard margin? (a) The SVM allows very low error in classification (b) The SVM allows high amount of error in classification (c) None of the above
(a) The SVM allows very low error in classification
33
True or False: SVM uses the kernel trick to classify non-linear data
True
34
True or False: Grid search can be used to optimize hyperparameters of a machine learning algorithm
True
35
What does a kernel function do? (a) Transforms linearly inseparable data into separable data by transforming to a higher dimension (b) Transforms linearly inseparable data into separable data by transforming to a lower dimension
(a) Transforms linearly inseparable data into separable data by transforming to a higher dimension
36
True or False: The decision boundary in non-linear SVM must be linear.
True
37
When fitting an SVM, we attempt to optimize: (a) The normal vector of the decision boundary (b) The margin between the decision boundary and the data (c) The density of the data on either side of the decision boundary (d) None of the above
(b) The margin between the decision boundary and the data
38
``` Given the following decision trees grown from the same dataset, which is the most likely to be overfit? (a) f1-score 0.8, leaf count = 50 (b) f1-score 0.9, leaf count = 20 (c) f1-score 0.7, leaf count = 10 ```
(a) f1-score 0.8, leaf count = 50 All else held equal, a decision tree with more leaves will have more complicated decision boundaries, thus increasing our expectation that it will overfit to the data.
39
Why is pruning of decision trees required? (a) To avoid underfitting (b) To avoid overfitting and make the model more generalized (c) To eliminate unnecessary structure (d) Both b and c
(d) Both b and c
40
What does entropy of 0 on an attribute in the dataset mean? (a) The data can be divided on the basis of this attribute (b) The data is homogenous across this attribute (c) Neither of the above
(b) The data is homogenous across this attribute
41
True or False: Information gain measures the increase in uncertainty obtained by splitting the dataset on a particular attribute
False
42
Bagging of a decision tree involves (a) Generating B different decision trees (b) Bootstrapping B different datasets from existing data (c) Taking an average of predictions across generated decision trees (d) All of the above
(d) All of the above
43
Using Information Gain, determine which of the following splits is most optimal: (a) E(parent) = 0.9, E(child1) = 0.1, E(child2) = 0.7 (b) E(parent) = 0.9, E(child1) = 0.7, E(child2) = 0.7 (c) E(parent) = 0.4, E(child1) = 0.1, E(child2) = 0.3
(a) E(parent) = 0.9, E(child1) = 0.1, E(child2) = 0.7 The difference between the entropy of the parent and the average entropy of the children is greatest in the first answer, which makes this the optimal split.
44
What is the impact of not scaling features prior to fitting a decision tree? (a) Poor classification performance (b) Slow fitting time (c) Overfitting (d) None of the above
(d) None of the above
45
``` Calculate the centroid of a cluster with the following data points: {(18,19), (17,10), (19,13), (15,17), (16,11)} (a) (17,14) (b) (17.35,14.2) (c) (18,12) (d) (16,14) ```
(a) (17,14)
46
Under what conditions can we say a K-means clustering model has converged? a) When the inter-cluster distance is less than a threshold (b) When the number of iterations has reached a certain value (c) When a wall-time has been reached (d) All of the above
(d) All of the above
47
True or False: Exclusive clustering stipulates that each data point can only exist in one cluster
True
48
True or False: K-means clustering is robust against outliers
False
49
``` What is the relation between the distance between clusters and the corresponding class discriminability? (a) Proportional (b) Inversely-proportional (c) No relation ```
(a) Proportional
50
If we had a dog picture dataset that had an unknown number of dog breeds depicted in it, which of these is not a good use case for k-means clustering: (a) Discovering similarities between dog pictures (b) Discriminating between different breeds (c) Identifying outlier dog pictures (d) None of the above
(b) Discriminating between different breeds
51
True or False: Principal Component Analysis, like Min-Max Normalization, is a preprocessing step used to scale data prior to machine learning. (a) True (b) False
(b) False
52
True or False: Principal Component Analysis can be used to reduce the dimensionality from 100 features down to a single feature. (a) True (b) False
(a) True
53
Which of the following techniques would perform better for reducing dimensions of a data set? (a) Removing columns which have too many missing values (b) Removing columns which have high variance in data (c) Removing columns with dissimilar data trends (d) None of these
(a) Removing columns which have too many missing values
54
True or False: Dimensionality reduction algorithms are one of the possible ways to reduce the computation time required to build a model.
True
55
In PCA, the largest Eigenvector gives the direction of the (a) Maximum scatter of the data (b) Minimum scatter of the data (c) No such information can be interpreted (d) Second largest Eigenvector which is in the same direction.
(a) Maximum scatter of the data
56
Which of the following is/are true about PCA? 1) PCA is an unsupervised method 2) It searches for the directions that data have the largest variance 3) Maximum number of principal components >= number of features 4) All principal components are orthogonal to each other (a) 1 and 2 (b) 1 and 3 (c) 1, 2 and 4 (d) All of the above
(c) 1, 2 and 4
57
Which of the following statements is correct for t-SNE and PCA? (a) t-SNE is linear whereas PCA is non-linear (b) t-SNE and PCA both are linear (c) t-SNE and PCA both are nonlinear (d) t-SNE is nonlinear whereas PCA is linear
(d) t-SNE is nonlinear whereas PCA is linear
58
Principal Component Analysis has a closed form solution. (a) True (b) False
(a) True
59
Distance Measure
* Defines how the similarity of two elements (x,y) is calculated. * Determines the shape of the clusters.
60
What are the different distance measures?
manhattan, corellation-based, euclidean, hamming, cosine
61
Euclidean Distance
•Used in common clustering algorithms
62
Correlation-based Distance
* Eisen Cosine Correlation Distance, Kendall Correlation Distance, Pearson’s correlation (sensitive to outliers), Spearman Correlation Distance (not sensitive to outliers) * used in gene expression data analysis
63
Hamming Distance
•The Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different. (minimum distance between any two vertices ) Used in information retrieval.
64
Cosine Distance (Cosine Similarity Measure)
``` Used in text and image processing applications. •The cosine similarity is the cosine function of the angle between the two feature vectors in a multidimensional space. (vector space model) ```
65
Manhattan Distance
distance between two points in a grid based on a strictly horizontal and/or vertical path (that is, along the grid lines), as opposed to the diagonal distance.
66
Pearson Correlation
measure of correlation (or linear dependence) between two | variables. Values from -1 to +1.
67
difference between similarity measure and distance metric?
similarity measure doens't have to meet any constraints. Distance metric needs to meet triangle inequality, symmetry, and identity
68
Hierarchical Clustering
union between the two nearest clusters. | - can build it either Divisive (Top-down) or Agglomerative (bottom-up)
69
single-linkage cluster defines distance as
minimum distance between elements of each cluster
70
complete-linkage clustering defines distance as
maximum distance between elements of each cluster
71
average-linkage clustering defines distance as
mean distance between all elements in each cluster
72
centroid linkage clustering defines distance as
distance between centroids of each cluster
73
Biclustering (or co-clustering)
because feature selection doesnt always identify important features this helps us find clusters within different subspaces -we can group features at the same time
74
Principal Component Analysis
- used for dimensionality reduction and feature selection. - matrix factorization approach which preserves the direction with maximal variance •It is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. •It is sensitive to outliers and missing values. -have to use it with standardized data preserves global distances
75
how do you find pca?
you use z-score and then find covariance matrix to solve for eigenvalues
76
Covariance matrix
determines the high correlations between the variables of the dataset. Highly correlated variables contain redundant information.
77
Pearson Correlation Coefficient
``` reduce dimensionality in data. •Preserves maximum information represented by features while eliminating redundant information represented by the features. ```
78
T-distributed Stochastic | Neighbor Embedding T-SNE
non-linear dimension reduction technique for higher-dimensional data. •T-SNE converts a multi-dimensional dataset into a lower dimensional dataset.
79
decision trees
used for classification and regression, considers variable dependency
80
greedy search for decision trees
uses a measure of purity for each node, we select the attribute that generates the purest child nodes
81
entropy
measure of purity of a node(measures uncertainty) 0 = pure node 1 = equally divided node
82
how do we minimize entropy?
by maximizing information gain
83
information gain
measure of the decrease in uncertainty obtained by splitting a dataset based on some additional attribute
84
what are the approaches to pruning
1. prepruning: happens when we are making trees. It doesnt look at the combination of attributes 2. post pruning: after building a tree we prune from the leaves
85
difference between decision trees and naive bayes
in decision trees we show dependencies between data which we dont do in naive bayes
86
what are the vairations of decision trees that can help prevent over fitting?
bagging: 1 parameter (train tree based on n observations) | random forest: 2 parameters (we can randomize features)
87
unsupervised learning
technique to find the groupings in a set of unlabeled data
88
k-means clustering
partition data into. k clusters
89
association rule discovery
discover rules that describe large portions of the data
90
fuzzy c means overlapping clustering
each sample can belong to more than one cluster (all weights add up to 1)
91
what is the difference between bic and aic?
bic gives a greater value, they both penalize complex models
92
elbow technique
optimal clusters is point where distortion/inertia decreases
93
distortion
average of sum of squares
94
support vector machine
supervised learning | want to maximize margin between support vectors and decision boundary
95
hinge loss
helps to maximize the margin by penalizing misclassified samples