ML Flashcards

Question 1

Q

What is the difference between KNN and k-means clustering?

Answer

A

K-means is a clustering algorithm that tries to partition a set of points into K sets (clusters) such that the points in each cluster tend to be near each other. It is unsupervised because the points have no external classification.

K-nearest neighbours is a classification (or regression) algo that in order to determine the classification of a point, combines the classification of the K nearest points. It is supervised because you are trying to classify a point based on the known classification of other points.

Question 2

Q

Define: 1) DataSet:
2) Data Wrangling:
3) Data Vizualization:
4) Outliers:
5) Data Imputation:

Answer

A

1) A dataset is a particular instance of data that is used for analysis or model building at any given time.

2) Data Wrangling is the process of converting data from its raw form to a tidy form ready for analysis. It is an important step in data pre-processing and includes several processes like data importing, data cleaning, data structuring, string processing, HTML parsing, handling dates and times, handling missing data and text mining.

3) Main tool to analyze and study relationships between different variables. Used for data preprocessing and analysis, feature selection, model building, model testing and model evaluation.

4) An outlier is a data point that is very different from the rest of the dataset. The best way to detect one is by using a box plot. Advanced method to deal with outliers —> RANSAC

5) We use different interpolation techniques to estimate the missing values in the dataset. One such interpolation is mean imputation wherein the missing value is replaced with the mean value of the entire feature column.

Question 3

Q

VGG 16 Implementation:

Answer

A

Unique factor: Instead of having a large number of hyperparameters, VGG 16 was designed with convolution layers of 33 with stride 1 and always with the same padding and max pool layer of 22 with stride 2.
Has 2 FC followed by softmax for output.
The 16 in VGG16 refers to it has 16 layers that have weights.

Question 4

Q

Answer

A

Cov(X,Y)=E[(X−E[X])(Y−E[Y])]=E[XY]−E[X]E[Y]
covariance indicates the direction of the linear relationship between X and Y and can take on any potential value from negative infinity to infinity. The units of covariance are based on the units of X and Y, which may differ.

ρ(X,Y)= Cov(X,Y)/ sqrt(Var(X)Var(Y))

[normalized version of covariance] —-> Since correlation results from scaling covariance, it is dimensionless (unlike covariance) and is always between -1 and 1