Final Flashcards by A B

hyperparameter and examples

a parameter whose value is used to control the learning process
ex) batch size, number of epochs

How well did you know this?

Not at all

Perfectly

parameter grid

specifies search space, combination of hyperparameters

How well did you know this?

Not at all

Perfectly

probabilistic graphical models

graphical representations of probability distributions, variables depend on other variables

How well did you know this?

Not at all

Perfectly

what are the benefits of graphical models?

learning dependencies, visualizing a probability model, graphical manipulations over latent variables, obtaining insights (like conditional independence)

How well did you know this?

Not at all

Perfectly

conditional independence

2 events A and B are conditionally independent given a 3rd event C if the occurrence of A and the occurrence of B are independent events.

How well did you know this?

Not at all

Perfectly

How many types of probabilistic graphical models are there and what are they?

2 types: Bayesian Networks, Markov Networks

How well did you know this?

Not at all

Perfectly

What is the difference between Bayesian Networks and Markov Networks?

Bayesian Networks have directed graphs and Markov Networks have undirected graphs

How well did you know this?

Not at all

Perfectly

Bayesian network

directed edges between nodes that describe conditional dependencies
ex) sprinkler, rain, grass wet

How well did you know this?

Not at all

Perfectly

joint probability

Probability of 2 or more events happening at the same time. This uses product/chain rule
ex) Probability that a card drawn is red and 4

How well did you know this?

Not at all

Perfectly

marginal probability

probability of an event irrespective of the outcome of another variable (unconditional probability). This is the probability of a single event and this uses the sum rule.
ex) Probability that a card drawn is red

How well did you know this?

Not at all

Perfectly

conditional probability

probability of one event with some relationship to one of more events
ex) given that we drew a red card, what is the probability that the red card has a 4

How well did you know this?

Not at all

Perfectly

Bayesian Networks

directed acyclic graph (graph having no cycles) and model dependencies between the variables of the data set. Vertices are variables and edges are conditional probability. It allows us to capture variable dependencies within the data which we can’t capture with linear and logistics regression. Bayesian networks use Bayesian Inference.

How well did you know this?

Not at all

Perfectly

Inference

Process of using a trained machine learning algorithm to make a prediction.

How well did you know this?

Not at all

Perfectly

Posterior Probability

Probability of A (the hypothesis) to occur given event B (the evidence) already occured

How well did you know this?

Not at all

Perfectly

Likelihood

Probability of B (the evidence) being true given that A is true

How well did you know this?

Not at all

Perfectly

Prior

Probability of A (the hypothesis) being true

How well did you know this?

Not at all

Perfectly

Evidence

Probability of B (the evidence) being true

How well did you know this?

Not at all

Perfectly

Probability Density Function

Finds probability of outcomes of random variables

How well did you know this?

Not at all

Perfectly

What are two ways to build a classifier?

1) Calculate posterior probabilities for a sample and assign it to a class that has the highest probability
2) create a discriminant function

How well did you know this?

Not at all

Perfectly

What would you use for a continuous random variable?

gaussian naive bayes

How well did you know this?

Not at all

Perfectly

What would you use for a categorical random variable?

categorical naive bayes

How well did you know this?

Not at all

Perfectly

What would you use for a multinomial distribution?

multinomial naive bayes

How well did you know this?

Not at all

Perfectly

What would you use for a binary random variable?

bernouli naive bayes

How well did you know this?

Not at all

Perfectly

discriminant function

we don’t need to calculate evidence

How well did you know this?

Not at all

Perfectly

What is the difference between a bayesian network and a naive bayes classifier?

bayesian network assumes that there's dependency between variables whereas naive bayes classifier assumes there's no dependency between variables (input features are independent variables)

If the formula for Naive Bayes is given by P(class|data) =[P(data|class)P(class)]\ P(data) , then which of the components makes this algorithm ”naive”: (a) P(class|data) (b) P(data|class) (c) P(class) (d) P(data)

(b) P(data|class)

True or False: Naive Bayes assumes that the input features are independent

True

For continuous data, which formulation of Naive Bayes is appropriate: (a) Binomial Naive Bayes (b) Multinomial Naive Bayes (c) Gaussian Naive Bayes (d) None of the above

``` In the above problem, how would smoothing change the probability that Document 5 is Spam? (a) Increase the probability (b) Decrease the probability (c) No change ```

(a) Increase the probability

``` According to the lecture material, which of these is the typical loss function used for SVM’s? (a) Mean Squared Error (b) Hinge Loss (c) Gini Coefficient (d) Cross Entropy ```

(b) Hinge Loss

SVM’s are less effective when: (a) The data is linearly separable (b) The data is clean and ready to use (c) The data is noisy and contains overlapping points

In SVM what is the meaning of a hard margin? (a) The SVM allows very low error in classification (b) The SVM allows high amount of error in classification (c) None of the above

(a) The SVM allows very low error in classification

True or False: SVM uses the kernel trick to classify non-linear data

True

True or False: Grid search can be used to optimize hyperparameters of a machine learning algorithm

True

What does a kernel function do? (a) Transforms linearly inseparable data into separable data by transforming to a higher dimension (b) Transforms linearly inseparable data into separable data by transforming to a lower dimension

(a) Transforms linearly inseparable data into separable data by transforming to a higher dimension

True or False: The decision boundary in non-linear SVM must be linear.

True

When fitting an SVM, we attempt to optimize: (a) The normal vector of the decision boundary (b) The margin between the decision boundary and the data (c) The density of the data on either side of the decision boundary (d) None of the above

(b) The margin between the decision boundary and the data

``` Given the following decision trees grown from the same dataset, which is the most likely to be overfit? (a) f1-score 0.8, leaf count = 50 (b) f1-score 0.9, leaf count = 20 (c) f1-score 0.7, leaf count = 10 ```

(a) f1-score 0.8, leaf count = 50 All else held equal, a decision tree with more leaves will have more complicated decision boundaries, thus increasing our expectation that it will overfit to the data.

Why is pruning of decision trees required? (a) To avoid underfitting (b) To avoid overfitting and make the model more generalized (c) To eliminate unnecessary structure (d) Both b and c

(d) Both b and c

What does entropy of 0 on an attribute in the dataset mean? (a) The data can be divided on the basis of this attribute (b) The data is homogenous across this attribute (c) Neither of the above

(b) The data is homogenous across this attribute

True or False: Information gain measures the increase in uncertainty obtained by splitting the dataset on a particular attribute

False

Bagging of a decision tree involves (a) Generating B different decision trees (b) Bootstrapping B different datasets from existing data (c) Taking an average of predictions across generated decision trees (d) All of the above

(d) All of the above

Using Information Gain, determine which of the following splits is most optimal: (a) E(parent) = 0.9, E(child1) = 0.1, E(child2) = 0.7 (b) E(parent) = 0.9, E(child1) = 0.7, E(child2) = 0.7 (c) E(parent) = 0.4, E(child1) = 0.1, E(child2) = 0.3

(a) E(parent) = 0.9, E(child1) = 0.1, E(child2) = 0.7 The difference between the entropy of the parent and the average entropy of the children is greatest in the first answer, which makes this the optimal split.

What is the impact of not scaling features prior to fitting a decision tree? (a) Poor classification performance (b) Slow fitting time (c) Overfitting (d) None of the above

(d) None of the above

``` Calculate the centroid of a cluster with the following data points: {(18,19), (17,10), (19,13), (15,17), (16,11)} (a) (17,14) (b) (17.35,14.2) (c) (18,12) (d) (16,14) ```

(a) (17,14)

Under what conditions can we say a K-means clustering model has converged? a) When the inter-cluster distance is less than a threshold (b) When the number of iterations has reached a certain value (c) When a wall-time has been reached (d) All of the above

(d) All of the above

True or False: Exclusive clustering stipulates that each data point can only exist in one cluster

True

True or False: K-means clustering is robust against outliers

False

``` What is the relation between the distance between clusters and the corresponding class discriminability? (a) Proportional (b) Inversely-proportional (c) No relation ```

(a) Proportional

If we had a dog picture dataset that had an unknown number of dog breeds depicted in it, which of these is not a good use case for k-means clustering: (a) Discovering similarities between dog pictures (b) Discriminating between different breeds (c) Identifying outlier dog pictures (d) None of the above

(b) Discriminating between different breeds

True or False: Principal Component Analysis, like Min-Max Normalization, is a preprocessing step used to scale data prior to machine learning. (a) True (b) False

(b) False

True or False: Principal Component Analysis can be used to reduce the dimensionality from 100 features down to a single feature. (a) True (b) False

(a) True

Which of the following techniques would perform better for reducing dimensions of a data set? (a) Removing columns which have too many missing values (b) Removing columns which have high variance in data (c) Removing columns with dissimilar data trends (d) None of these

(a) Removing columns which have too many missing values

True or False: Dimensionality reduction algorithms are one of the possible ways to reduce the computation time required to build a model.

True

In PCA, the largest Eigenvector gives the direction of the (a) Maximum scatter of the data (b) Minimum scatter of the data (c) No such information can be interpreted (d) Second largest Eigenvector which is in the same direction.

(a) Maximum scatter of the data

Which of the following is/are true about PCA? 1) PCA is an unsupervised method 2) It searches for the directions that data have the largest variance 3) Maximum number of principal components >= number of features 4) All principal components are orthogonal to each other (a) 1 and 2 (b) 1 and 3 (c) 1, 2 and 4 (d) All of the above

Which of the following statements is correct for t-SNE and PCA? (a) t-SNE is linear whereas PCA is non-linear (b) t-SNE and PCA both are linear (c) t-SNE and PCA both are nonlinear (d) t-SNE is nonlinear whereas PCA is linear

(d) t-SNE is nonlinear whereas PCA is linear

Principal Component Analysis has a closed form solution. (a) True (b) False

(a) True

Distance Measure

* Defines how the similarity of two elements (x,y) is calculated. * Determines the shape of the clusters.

What are the different distance measures?

manhattan, corellation-based, euclidean, hamming, cosine

Euclidean Distance

•Used in common clustering algorithms

Correlation-based Distance

* Eisen Cosine Correlation Distance, Kendall Correlation Distance, Pearson’s correlation (sensitive to outliers), Spearman Correlation Distance (not sensitive to outliers) * used in gene expression data analysis

Hamming Distance

•The Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different. (minimum distance between any two vertices ) Used in information retrieval.

Cosine Distance (Cosine Similarity Measure)

``` Used in text and image processing applications. •The cosine similarity is the cosine function of the angle between the two feature vectors in a multidimensional space. (vector space model) ```

Manhattan Distance

distance between two points in a grid based on a strictly horizontal and/or vertical path (that is, along the grid lines), as opposed to the diagonal distance.

Pearson Correlation

measure of correlation (or linear dependence) between two | variables. Values from -1 to +1.

difference between similarity measure and distance metric?

similarity measure doens't have to meet any constraints. Distance metric needs to meet triangle inequality, symmetry, and identity

Hierarchical Clustering

union between the two nearest clusters. | - can build it either Divisive (Top-down) or Agglomerative (bottom-up)

single-linkage cluster defines distance as

minimum distance between elements of each cluster

complete-linkage clustering defines distance as

maximum distance between elements of each cluster

average-linkage clustering defines distance as

mean distance between all elements in each cluster

centroid linkage clustering defines distance as

distance between centroids of each cluster

Biclustering (or co-clustering)

because feature selection doesnt always identify important features this helps us find clusters within different subspaces -we can group features at the same time

Principal Component Analysis

- used for dimensionality reduction and feature selection. - matrix factorization approach which preserves the direction with maximal variance •It is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. •It is sensitive to outliers and missing values. -have to use it with standardized data preserves global distances

how do you find pca?

you use z-score and then find covariance matrix to solve for eigenvalues

Covariance matrix

determines the high correlations between the variables of the dataset. Highly correlated variables contain redundant information.

Pearson Correlation Coefficient

``` reduce dimensionality in data. •Preserves maximum information represented by features while eliminating redundant information represented by the features. ```

T-distributed Stochastic | Neighbor Embedding T-SNE

non-linear dimension reduction technique for higher-dimensional data. •T-SNE converts a multi-dimensional dataset into a lower dimensional dataset.

decision trees

used for classification and regression, considers variable dependency

greedy search for decision trees

uses a measure of purity for each node, we select the attribute that generates the purest child nodes

entropy

measure of purity of a node(measures uncertainty) 0 = pure node 1 = equally divided node

how do we minimize entropy?

by maximizing information gain

information gain

measure of the decrease in uncertainty obtained by splitting a dataset based on some additional attribute

what are the approaches to pruning

1. prepruning: happens when we are making trees. It doesnt look at the combination of attributes 2. post pruning: after building a tree we prune from the leaves

difference between decision trees and naive bayes

in decision trees we show dependencies between data which we dont do in naive bayes

what are the vairations of decision trees that can help prevent over fitting?

bagging: 1 parameter (train tree based on n observations) | random forest: 2 parameters (we can randomize features)

unsupervised learning

technique to find the groupings in a set of unlabeled data

k-means clustering

partition data into. k clusters

association rule discovery

discover rules that describe large portions of the data

fuzzy c means overlapping clustering

each sample can belong to more than one cluster (all weights add up to 1)

what is the difference between bic and aic?

bic gives a greater value, they both penalize complex models

elbow technique

optimal clusters is point where distortion/inertia decreases

distortion

average of sum of squares

support vector machine

supervised learning | want to maximize margin between support vectors and decision boundary

hinge loss

helps to maximize the margin by penalizing misclassified samples

Final Flashcards

(95 cards)