Final Flashcards

1
Q

hyperparameter and examples

A

a parameter whose value is used to control the learning process
ex) batch size, number of epochs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

parameter grid

A

specifies search space, combination of hyperparameters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

probabilistic graphical models

A

graphical representations of probability distributions, variables depend on other variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what are the benefits of graphical models?

A

learning dependencies, visualizing a probability model, graphical manipulations over latent variables, obtaining insights (like conditional independence)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

conditional independence

A

2 events A and B are conditionally independent given a 3rd event C if the occurrence of A and the occurrence of B are independent events.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How many types of probabilistic graphical models are there and what are they?

A

2 types: Bayesian Networks, Markov Networks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the difference between Bayesian Networks and Markov Networks?

A

Bayesian Networks have directed graphs and Markov Networks have undirected graphs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Bayesian network

A

directed edges between nodes that describe conditional dependencies
ex) sprinkler, rain, grass wet

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

joint probability

A

Probability of 2 or more events happening at the same time. This uses product/chain rule
ex) Probability that a card drawn is red and 4

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

marginal probability

A

probability of an event irrespective of the outcome of another variable (unconditional probability). This is the probability of a single event and this uses the sum rule.
ex) Probability that a card drawn is red

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

conditional probability

A

probability of one event with some relationship to one of more events
ex) given that we drew a red card, what is the probability that the red card has a 4

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Bayesian Networks

A

directed acyclic graph (graph having no cycles) and model dependencies between the variables of the data set. Vertices are variables and edges are conditional probability. It allows us to capture variable dependencies within the data which we can’t capture with linear and logistics regression. Bayesian networks use Bayesian Inference.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Inference

A

Process of using a trained machine learning algorithm to make a prediction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Posterior Probability

A

Probability of A (the hypothesis) to occur given event B (the evidence) already occured

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Likelihood

A

Probability of B (the evidence) being true given that A is true

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Prior

A

Probability of A (the hypothesis) being true

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Evidence

A

Probability of B (the evidence) being true

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Probability Density Function

A

Finds probability of outcomes of random variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What are two ways to build a classifier?

A

1) Calculate posterior probabilities for a sample and assign it to a class that has the highest probability
2) create a discriminant function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What would you use for a continuous random variable?

A

gaussian naive bayes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What would you use for a categorical random variable?

A

categorical naive bayes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What would you use for a multinomial distribution?

A

multinomial naive bayes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What would you use for a binary random variable?

A

bernouli naive bayes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

discriminant function

A

we don’t need to calculate evidence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is the difference between a bayesian network and a naive bayes classifier?

A

bayesian network assumes that there’s dependency between variables whereas naive bayes classifier assumes there’s no dependency between variables (input features are independent variables)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

If the formula for Naive Bayes is given by P(class|data) =[P(data|class)P(class)]\ P(data) ,
then which of the components makes this algorithm ”naive”:
(a) P(class|data)
(b) P(data|class)
(c) P(class)
(d) P(data)

A

(b) P(data|class)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

True or False: Naive Bayes assumes that the input features are independent

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

For continuous data, which formulation of Naive Bayes is appropriate:

(a) Binomial Naive Bayes
(b) Multinomial Naive Bayes
(c) Gaussian Naive Bayes
(d) None of the above

A

(c) Gaussian Naive Bayes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q
In the above problem, how would smoothing change the probability that 
Document 5 is Spam? 
(a) Increase the probability 
(b) Decrease the probability 
(c) No change
A

(a) Increase the probability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q
According to the lecture material, which of these is the typical loss function 
used for SVM’s? 
(a) Mean Squared Error 
(b) Hinge Loss 
(c) Gini Coefficient 
(d) Cross Entropy
A

(b) Hinge Loss

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

SVM’s are less effective when:

(a) The data is linearly separable
(b) The data is clean and ready to use
(c) The data is noisy and contains overlapping points

A

(c) The data is noisy and contains overlapping points

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

In SVM what is the meaning of a hard margin?

(a) The SVM allows very low error in classification
(b) The SVM allows high amount of error in classification
(c) None of the above

A

(a) The SVM allows very low error in classification

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

True or False: SVM uses the kernel trick to classify non-linear data

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

True or False: Grid search can be used to optimize hyperparameters of a machine learning
algorithm

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

What does a kernel function do?
(a) Transforms linearly inseparable data into separable data by
transforming to a higher dimension
(b) Transforms linearly inseparable data into separable data by transforming to a lower dimension

A

(a) Transforms linearly inseparable data into separable data by transforming to a higher dimension

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

True or False: The decision boundary in non-linear SVM must be linear.

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

When fitting an SVM, we attempt to optimize:

(a) The normal vector of the decision boundary
(b) The margin between the decision boundary and the data
(c) The density of the data on either side of the decision boundary
(d) None of the above

A

(b) The margin between the decision boundary and the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q
Given the following decision trees grown from the same dataset, which is the 
most likely to be overfit? 
(a) f1-score 0.8, leaf count = 50 
(b) f1-score 0.9, leaf count = 20 
(c) f1-score 0.7, leaf count = 10
A

(a) f1-score 0.8, leaf count = 50
All else held equal, a decision tree with more leaves will have more complicated
decision boundaries, thus increasing our expectation that it will overfit to the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Why is pruning of decision trees required?

(a) To avoid underfitting
(b) To avoid overfitting and make the model more generalized
(c) To eliminate unnecessary structure
(d) Both b and c

A

(d) Both b and c

40
Q

What does entropy of 0 on an attribute in the dataset mean?

(a) The data can be divided on the basis of this attribute
(b) The data is homogenous across this attribute
(c) Neither of the above

A

(b) The data is homogenous across this attribute

41
Q

True or False: Information gain measures the increase in uncertainty obtained by splitting
the dataset on a particular attribute

A

False

42
Q

Bagging of a decision tree involves

(a) Generating B different decision trees
(b) Bootstrapping B different datasets from existing data
(c) Taking an average of predictions across generated decision trees
(d) All of the above

A

(d) All of the above

43
Q

Using Information Gain, determine which of the following splits is most

optimal:
(a) E(parent) = 0.9, E(child1) = 0.1, E(child2) = 0.7
(b) E(parent) = 0.9, E(child1) = 0.7, E(child2) = 0.7
(c) E(parent) = 0.4, E(child1) = 0.1, E(child2) = 0.3

A

(a) E(parent) = 0.9, E(child1) = 0.1, E(child2) = 0.7
The difference between the entropy of the parent and the average entropy of the
children is greatest in the first answer, which makes this the optimal split.

44
Q

What is the impact of not scaling features prior to fitting a decision tree?

(a) Poor classification performance
(b) Slow fitting time
(c) Overfitting
(d) None of the above

A

(d) None of the above

45
Q
Calculate the centroid of a cluster with the following data points: {(18,19), 
(17,10), (19,13), (15,17), (16,11)} 
(a) (17,14) 
(b) (17.35,14.2) 
(c) (18,12) 
(d) (16,14)
A

(a) (17,14)

46
Q

Under what conditions can we say a K-means clustering model has
converged?
a) When the inter-cluster distance is less than a threshold
(b) When the number of iterations has reached a certain value
(c) When a wall-time has been reached
(d) All of the above

A

(d) All of the above

47
Q

True or False: Exclusive clustering stipulates that each data point can only exist in one cluster

A

True

48
Q

True or False: K-means clustering is robust against outliers

A

False

49
Q
What is the relation between the distance between clusters and the 
corresponding class discriminability?  
(a) Proportional 
(b) Inversely-proportional 
(c) No relation
A

(a) Proportional

50
Q

If we had a dog picture dataset that had an unknown number of dog breeds
depicted in it, which of these is not a good use case for k-means clustering:
(a) Discovering similarities between dog pictures
(b) Discriminating between different breeds
(c) Identifying outlier dog pictures
(d) None of the above

A

(b) Discriminating between different breeds

51
Q

True or False: Principal Component Analysis, like Min-Max Normalization, is a
preprocessing step used to scale data prior to machine learning.
(a) True
(b) False

A

(b) False

52
Q

True or False: Principal Component Analysis can be used to reduce the dimensionality from
100 features down to a single feature.
(a) True
(b) False

A

(a) True

53
Q

Which of the following techniques would perform better for reducing
dimensions of a data set?
(a) Removing columns which have too many missing values
(b) Removing columns which have high variance in data
(c) Removing columns with dissimilar data trends
(d) None of these

A

(a) Removing columns which have too many missing values

54
Q

True or False: Dimensionality reduction algorithms are one of the possible ways to reduce
the computation time required to build a model.

A

True

55
Q

In PCA, the largest Eigenvector gives the direction of the

(a) Maximum scatter of the data
(b) Minimum scatter of the data
(c) No such information can be interpreted
(d) Second largest Eigenvector which is in the same direction.

A

(a) Maximum scatter of the data

56
Q

Which of the following is/are true about PCA?

1) PCA is an unsupervised method
2) It searches for the directions that data have the largest variance
3) Maximum number of principal components >= number of features
4) All principal components are orthogonal to each other
(a) 1 and 2
(b) 1 and 3
(c) 1, 2 and 4
(d) All of the above

A

(c) 1, 2 and 4

57
Q

Which of the following statements is correct for t-SNE and PCA?

(a) t-SNE is linear whereas PCA is non-linear
(b) t-SNE and PCA both are linear
(c) t-SNE and PCA both are nonlinear
(d) t-SNE is nonlinear whereas PCA is linear

A

(d) t-SNE is nonlinear whereas PCA is linear

58
Q

Principal Component Analysis has a closed form solution.

(a) True
(b) False

A

(a) True

59
Q

Distance Measure

A
  • Defines how the similarity of two elements (x,y) is calculated.
  • Determines the shape of the clusters.
60
Q

What are the different distance measures?

A

manhattan, corellation-based, euclidean, hamming, cosine

61
Q

Euclidean Distance

A

•Used in common clustering algorithms

62
Q

Correlation-based Distance

A
  • Eisen Cosine Correlation Distance, Kendall Correlation Distance, Pearson’s correlation (sensitive to outliers), Spearman Correlation Distance (not sensitive to outliers)
  • used in gene expression data analysis
63
Q

Hamming Distance

A

•The Hamming distance between two strings of equal length is the
number of positions at which the corresponding symbols are
different. (minimum distance between any two vertices )
Used in information retrieval.

64
Q

Cosine Distance (Cosine Similarity Measure)

A
Used in text and image processing 
applications. 
•The cosine similarity is the cosine 
function of the angle between the two 
feature vectors in a multidimensional 
space. 
(vector space model)
65
Q

Manhattan Distance

A

distance
between two points in a grid based on a
strictly horizontal and/or vertical path
(that is, along the grid lines), as opposed
to the diagonal distance.

66
Q

Pearson Correlation

A

measure of correlation (or linear dependence) between two

variables. Values from -1 to +1.

67
Q

difference between similarity measure and distance metric?

A

similarity measure doens’t have to meet any constraints. Distance metric needs to meet triangle inequality, symmetry, and identity

68
Q

Hierarchical Clustering

A

union between the two nearest clusters.

- can build it either Divisive (Top-down) or Agglomerative (bottom-up)

69
Q

single-linkage cluster defines distance as

A

minimum distance between elements of each cluster

70
Q

complete-linkage clustering defines distance as

A

maximum distance between elements of each cluster

71
Q

average-linkage clustering defines distance as

A

mean distance between all elements in each cluster

72
Q

centroid linkage clustering defines distance as

A

distance between centroids of each cluster

73
Q

Biclustering (or co-clustering)

A

because feature selection doesnt always identify important features this helps us find clusters within different subspaces
-we can group features at the same time

74
Q

Principal Component Analysis

A
  • used for dimensionality
    reduction and feature selection.
  • matrix factorization approach which preserves the direction with maximal variance
    •It is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.
    •It is sensitive to outliers and missing values.
    -have to use it with standardized data
    preserves global distances
75
Q

how do you find pca?

A

you use z-score and then find covariance matrix to solve for eigenvalues

76
Q

Covariance matrix

A

determines the high correlations between the
variables of the dataset. Highly correlated variables contain redundant
information.

77
Q

Pearson Correlation Coefficient

A
reduce dimensionality in 
data.
•Preserves maximum information represented 
by features while eliminating redundant 
information represented by the features.
78
Q

T-distributed Stochastic

Neighbor Embedding T-SNE

A

non-linear dimension reduction
technique for higher-dimensional data.
•T-SNE converts a multi-dimensional dataset
into a lower dimensional dataset.

79
Q

decision trees

A

used for classification and regression, considers variable dependency

80
Q

greedy search for decision trees

A

uses a measure of purity for each node, we select the attribute that generates the purest child nodes

81
Q

entropy

A

measure of purity of a node(measures uncertainty)
0 = pure node
1 = equally divided node

82
Q

how do we minimize entropy?

A

by maximizing information gain

83
Q

information gain

A

measure of the decrease in uncertainty obtained by splitting a dataset based on some additional attribute

84
Q

what are the approaches to pruning

A
  1. prepruning: happens when we are making trees. It doesnt look at the combination of attributes
  2. post pruning: after building a tree we prune from the leaves
85
Q

difference between decision trees and naive bayes

A

in decision trees we show dependencies between data which we dont do in naive bayes

86
Q

what are the vairations of decision trees that can help prevent over fitting?

A

bagging: 1 parameter (train tree based on n observations)

random forest: 2 parameters (we can randomize features)

87
Q

unsupervised learning

A

technique to find the groupings in a set of unlabeled data

88
Q

k-means clustering

A

partition data into. k clusters

89
Q

association rule discovery

A

discover rules that describe large portions of the data

90
Q

fuzzy c means overlapping clustering

A

each sample can belong to more than one cluster (all weights add up to 1)

91
Q

what is the difference between bic and aic?

A

bic gives a greater value, they both penalize complex models

92
Q

elbow technique

A

optimal clusters is point where distortion/inertia decreases

93
Q

distortion

A

average of sum of squares

94
Q

support vector machine

A

supervised learning

want to maximize margin between support vectors and decision boundary

95
Q

hinge loss

A

helps to maximize the margin by penalizing misclassified samples