Machine Learning revision Flashcards

1
Q

What are the benefits of preprocessing data / data wrangling?

A
  • Transform raw data into a state which the machine can understand and interpret easily.
  • Remove redundant information ~noise.
  • Spot outliers and deal with them to make sure the training is effective and not skewed.
  • Data in real world is not perfect.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a feature?

A

A feature is an individual measurable property or characteristic of a phenomenon being observed. Features can be categorical or numerical.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Name some common steps for data preprocessing in DS/ML?

A
  • Taking care of the missing data
  • Encoding Categorical Data
  • Feature Scaling
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Which library to use for taking care of missing data? Recall the entire code.

A

from sklearn.impute import SimpleImputer

imputer = SimpleImputer(

missing_values = np.nan,

strategy = ‘mean’)

imputer.fit(X[:, a:b])

X[:, a:b] = imputer.transform(X[X:, 1:3])

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are some strategies to take care of missing data apart from averaging?

A
  1. Mean, median, Mode
  2. Deleting missing data (delete full row, deleting the variable (not recommended)).
  3. Time Series Specific Methods (Last Observation Carried Forward (LOCF) & Next Observation Carried Backward (NOCB), Linear Interpolation (not good for data with seasonal oscillations), Seasonal Adjustment + Linear Interpolation).
  4. Use Regression
  5. Multiple Imputation (question to explain what it is.)
  6. Use k-NN Classification to impute categorial variables.
  7. Imputation using Deep Learning - Datawig
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is multiple imputation method for taking care of missing data?

A

Multiple ImputationImputation:

  • Impute the missing entries of the incomplete data sets m times (m=3 in the figure). Note that imputed values are drawn from a distribution. Simulating random draws doesn’t include uncertainty in model parameters. Better approach is to use Markov Chain Monte Carlo (MCMC) simulation. This step results in m complete data sets.
  • Analysis: Analyze each of the m completed data sets.
  • Pooling: Integrate the m analysis results into a final result
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Types of missing data?

A

3- MCAR, MAR, NMAR :-

Missing completely at random (MCAR): a certain missing value has nothing to do with its hypothetical value and with the values of other variables.

Missing at random (MAR): Missing at random means that the propensity for a data point to be missing is not related to the missing data, but it is related to some of the observed data

Not missing at random (NMAR): missing value is dependent on some other variable’s value ex. People with high salaries don’t always want to reveal their salaries.

Difference between MCAR and MAR:

For example, if high school GPA data is missing randomly across all schools in a district, that data will be considered MCAR. However, if data is randomly missing for students in specific schools of the district, then the data is MAR.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How to encode categorical data? Which library to use? Provide code:-

A

OneHotEncoding- to transform n categories to n-dimensional vector.

Encoding independent variables:-

import sklearn.compose import ColumnTransformer

from sklearn.preprocessing import OneHotEncoder

ct = ColumnTransformer(

transformers = [(‘encoder’,

                              OneHotEncoder(), 

                              [specific the columns])], 

remainder = ‘passthrough’)

X = np.array(ct.fit_transform(X))

Encoding yes-no variable:-

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

y = le.fit_transform(y)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the two types of feature scaling? Give their formulas. Code one of them.

A

Standardisation and normalisation
Standardisation: X = (X - mean(X))/s.d.(X)
Normalisation: X = (X- min(X))/(max(X) - min(X))

Standardisation:
from sklearn.preprocessing import StandardScaler
sc = StandardScalar()
X = sc.fot_transform(X)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Benefits of Splitting the dataset to train and test set

A

To make sure that the hasn’t been overfitting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Code to split the dataset to train and test set

A

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_testsplit(X, y, test_size = between 0 and 1, random_state = optional)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are some types of Regression?

A
  1. Simple Linear
  2. Multiple Linear
  3. Polynomial Linear
  4. Support Vector
  5. Decision Tree
  6. Random Forest
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Train Linear regression

A

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

NOTE: Almost same as Linear Regression but not this uses LinearRegression and not Linear Regressor.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the five methods of building models?

A
  1. All in
  2. Backward Elimination
  3. Forward Selection
  4. Bidirectional Elimination
  5. Score Comparison
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Steps for Backward Elimination

A
  1. Select a significance level to stay in the model (Default: 5%)
  2. Fit the full model with all possible predictors.
  3. Consider the predictor with the highest P-value. If, p>sig. level, go to step 4, otherwise finish.
  4. Remove the predictor.
  5. Fit the model without this variable. GO back to step 3.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Train Multiple linear Regression

A

from sklearn.linear_model import LinearRegressor
regressor = LinearRegressor()
regressor.fit(X_train, y_train)

NOTE: Almost same as Linear Regression but not this uses LinearRegressor and not Linear Regression.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Train Polynomial Linear Regression

A

Idea: treat polynomial as multiple linear regression with x2 = x^2, x3 = x^3 ….. So, we can create a matrix for powered features.

from sklearn.preprocessing import LinearRegression, PolynomialFeatures

poly_reg = PolynomialFeatures(degree = 2)
X_poly = poly_reg.fit_transform(X)
lin_reg_2 = LinearRegression()
lin_reg_2.fit(X_poly, y)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

reshape a 1D array to a 2D array

A

Let y be a 1D array.

y = y.reshape(len(y), 1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Train the Support Vector Regression Model

A

Note: It is needed to use Standard Scaling in SVR.

from sklearn.svm import SVR
regressor = SVR(kernel = ‘rbf’) # select your own ketnel
regressor.fit(X, y)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

train the Decision Tree Regression model

A

from sklearn.tree import DecisionTreeRegressor
regressor = DecisionTreeRegressor()
regressor.fit(X, y)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Which regression model(s) requires the data to be standardised?

A

SVR

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

train the Random Forest Regression model

A

from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators = 10)
regressor.fit(X, y)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Evaluating the Regression models

A
R squared (R^2)
A better one is -> Adjested (R^2)
code:-
from sklearn.metrics import r2_score
r2_Score(y_test, y_pred)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Train Logistic Regression

A

data should be scaled.

from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
classifier.fit(X_train, y_train)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Why do we standardise data?

A

We standardise data to ensure that one variable in data doesn’t have a biased influence on the result from the start (just because its value is larger than all the other values).

26
Q

What can we use to evaluate the results of a classification model? Write code as well.

A

Confusion Matrix and accuracy score, CAP curve

from sklearn.metrics import comfusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

27
Q

Code to visualize Classification models

A

from matplotlib.colors import ListedColormap
X_set, y_set = fc.inverse_transform(X_train, y_train)
X1, X2 = np.meshgrid(
np.arrange(start=X_set[:, 0].min() - 10,
stop=X_set[:,0].max() + 10, step = 0.25),
np.arrange(start=X_set[:, 1].min() - 10,
stop=X_set[:,1].max() + 10, step = 0.25))
plt.ccontourf(X1, X2,
classifier.predict(fc.transform(np.array([X1.ravel(),
X2.ravel()]).T))>reshape(X1.shape),
alpha = 0.75,
cmap = ListedColormap([‘red’, ‘green’]))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i,j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j,0], X_set[y_set == j, 1],
c = ListedColormap((‘red’, ‘green’))[i], label = j)
plt.title(‘…’)
plt.xlabel(‘…’)
plt.ylabel(‘…’)
plt.legend()
plt.show()

28
Q

Train the K_NN model Classification model

A

Data should be standardised

from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors = 5,
metric = ‘minkowski,
p = 2)
classifier.fit(X_train, y_train)

29
Q

Train the Support Vector Machine (SVM) Classification model

A

Data should be standardised

from sklearn.svm import SVC
classifier = SVC(kernel = ‘linear’) #’poly’/’rbf’/’sigmoid’ etc.
classifier.fit(X_train, y_train)

30
Q

code for the Naive-Bayes Classification model

A

Data should be standardised

from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)

31
Q

Code for the Decision Tree Classification model

A

Data should be standardised

from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion = ‘entropy’)
classifier.fit(X_train, y_train)

32
Q

Code for the Random Forest Classification model

A

Data should be standardised

from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(
      n_estimators = 10,
     criterion = 'entropy')
classifier.fit(X_train, y_train)
33
Q

List some types of Classification Models

A
  • Logistic Regression model
  • K-NN
  • Support Vector Machine Classification
  • Naive-Bayes
  • Decision Tree
  • Random Forest
34
Q

Types of Clustering

A
  • K-Means Clustering (K-Means++ avoids random initialisation trap).
  • Hierarchal Clustering
    a. Agglomerative (bottom-up approach)
    b. Divisive (top-to-bottom approach)
35
Q

K-Means++ Clustering Logic

A
  1. Choose the number K of clusters
  2. Select at random K points, the centroids (not necessarily from your dataset).
  3. Assign each data point to the closest centroid -> forming K clusters.
  4. Compute and place the new centroid of each cluster.
  5. Reassign each data point to the new closest centroid. If reassignment took place, go to step 4, otherwise FINSIH.
36
Q

Clustering is supervised or unsupervised model?

A

Clustering is an unsupervised model!

37
Q

Train K-Means++ Model

A

from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters = 5, init = ‘k-means++’)
y_kmeans = kmeans.fit_predict(X)

38
Q

Identify the right number of clusters using WCSS (elbow method)

A
from sklearn.cluster import KMeans
wcss = []
for i in range(1, 11):  #11 not fixed, depends on the data.
     kmeans = KMeans(n_clusters = i, init = 'k-means++')
     kmeans.fit(X)
     wcss.append(kmeans.inertia_)
plt.plot(range(1, 11), wcss)
plt.title(...)
plt.xlabel(...)
plt.ylabel(...)
plt.show()
39
Q

Logic for Agglomerative (bottom-up) Clustering

A
  1. Make each data point a single-point cluster, forming N clusters
  2. Take two closest data points and make them one cluster => N-1 clusters
  3. Take two closest cluster and make them one cluster => N-2 cluster
  4. Repeat Step 3till only one cluster in left. FINISH.
  5. Use dendrograms to determine the number of clusters
40
Q

train Hierarchal Agglomerative Clustering

A

‘ward’ is the best to minimise variance

from sklearn.cluster import AgglomerativeClustering
hc = AgglomerativeClustering(n_clusters = 5,
affinity = ‘euclidean’,
linkage = ‘ward’)
y_hc = hc.fit.predict(X)

41
Q

Code for Dendrograms

A
import scipy.cluster.hierarchy as sch
dendrogram = sch.dendrogram(sch.linkage(X,method='ward'))
plt.title(...)
plt.xlabel(...)
plt.ylabel(...)
plt.show()
42
Q

Ways to evaluate clustering models

A

sklearn Clustering performance evaluation :-

  1. Adjusted Rand Index (ARI requires knowledge of the ground truth classes)
  2. Silhoutte Coefficient (when ground truth is unknown)

there are a lot of ways to do both cases, and the sklearn library should be considered before choosing which one to use.

43
Q

Types of Association Rule Learning

A

People who did A, also did B.

  1. APRIORI
  2. ECLAT
44
Q

APRIORI logic

A
  1. Set a minimum support and confidence
  2. Take all the subsets in transactions having higher support than minimum support.
  3. Take all the rules of these subsets having higher confidence than minimum confidence
  4. Sort the rules by decreasing lift.
#note
Support(event A) = (Times event A occurs/ Times all events occur)

Confidence(event A -> event B) =
(Times event B followed by event A occurs/ times event A happened)

Lift(A -> B) = Confidence(A -> B) / Support(B)

45
Q

write the formulas for support, confidence, and lift in association learning.

A

Support(event A) = (Times event A occurs/ Times all events occur)

Confidence(event A -> event B) =
(Times event B followed by event A occurs/ times event A happened)

Lift(A -> B) = Confidence(A -> B) / Support(B)

46
Q

train APRIORI

A

transactions = []
for i in range(0, len(dataset[0])): #columns
transactions.append([str(dataset.values[i,j]) for j in range (0, len(dataset)) #rows

from apyori import apriori
rules = apriori(transactions, min_support = 0.003, min_confidence = 0.2, min_lift = 3, min_length = 2)

47
Q

ECLAT

A

ECLAT uses support only.

Can be implemented using apyrori only by only including min_support as an argument.

48
Q

Ways to implement Reinforcement Learning

A
  1. Upper Confidence Bound
  2. Thompson Sampling

Code implemented on own- see downloaded documents.

49
Q

Natural Language Processing Steps

A
  1. Clean the text
    i. get only the relevant words
    ii. get rid of words like ‘the’, ‘a’, ‘on’, ‘…’, etc.
    iii. can get rid of numbers unless they have significant
    impact.
    iv. Stemming. (transform words to their roots eg. Singing -> sing)
  2. Bag of words model
  3. Use Classification (common models: Naive-Bayes and Random Forest). // Standard Scaling won’t be needed as its mostly 0s and 1s.
50
Q

Types and subtypes of Dimensionality Reduction

A
  1. Feature selection:-
    • backward elimination
    • forward selection
    • bidirectional elimination
    • score comparison etc.
  2. Feature Extraction
    • Principal Component Analysis (PCA).
    • Linear Discriminant Analysis (LDA).
    • Kernel PCA.
    • Quadratic Discriminant Analysis (QDA).
51
Q

Feature Extraction: PCA

A
  • PCA is an unsupervised model
  • It is used for noise filtering, visualisation, feature extraction, stock market predictions, gene data analysis.
  • PCA identifies patterns in data and detects the correlation between the variable.
  • GOAL: Reduce the dimensions of a d-dimensional dataset by projecting it into a k-dimensional subspace (where k
52
Q

Feature Extraction: PCA Steps

A
  1. Standardise the data
  2. Obtain the Eigenvectors or Eigenvalues from the covariance matrix or correlation matrix, or perform Singular Vector Decomposition.
  3. Sort eigenvalues in descending order and choose the k eigenvectors that correspond to the k largest eigenvalues where k is the number of dimensions of the new feature subspace (k <= d)
  4. Construct the projection matrix W from the selected k eigenvectors.
  5. Transform the original dataset X via W to obtain a k-dimensional feature subspace Y.
53
Q

Code for PCA

A

Apply PCA generally after feature scaling and before training the model

Code:-
from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)

…train the model…

54
Q

Feature Extraction: LDA

A
  • commonly used
  • supervised
  • Goal of LDA is to project a feature space onto a small subspace k while maintaining the class-discriminatory information.
55
Q

PCA vs LDA

A
  • Both PCA and LDA are linear transformation techniques used for dimensional reduction.
  • LDA => supervised because of the relation to the
    dependent variable.
    PCA=> unsupervised.
    -
56
Q

Feature Extraction: LDA Steps

A
  1. Compute the d-dimensional mean vectors for the different classes from the dataset.
  2. Compute the scatter matrices (in-between-class and within_class scatter matrix).
  3. Compute the eigenvectors (e1, e2, …, eN) and considering eigenvalues (l1, l2, …, lN) for the scatter matrices.
  4. Sort the eigenvectors by decreasing eigenvalues and choose k eigenvectors with the largest eigenvalues to forma. dxk-dimensional matrix W(where every column represents an eigenvector).
  5. Use this dxK eigenvectors matrix to transform the samples onto the new subspace => Y = XxW (where X is a non-dimensional matrix representing the n samples, and y are the transformed nxk dimensional samples in the subspace.
57
Q

Code for LDA

A

after feature scaling

froms sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
lda = LDA(n_components = 2)
X_train = lda.fit_transform(X_train, y_train)
X_test = lda.transform(X,test)

..train the model ….

58
Q

Code for Kernel PCA

A

Apply PCA generally after feature scaling and before training the model

Code:-
from sklearn.decomposition import KernelPCA
pca = KernelPCA(n_components = 2, kernel = ‘rbf)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)

…train the model…

59
Q

k-Fold Cross Validation

A
  • to avoid the case that we just got lucky.
  • divides training_set to smaller parts and tests the accuracies.

-After training the model

Code:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(
estimator = NAME_OF_MODEL,
X = X_train,
y = y_train,
cv = 10) #10 is most common.
print(“Accuracy:{%.2f}%”.format(accuracies.mean()100))
print(“Std Dev:{%.2f}%”.format(accuracies.std()
100))

60
Q

Grid Search (for hyper-parameters)

A

After training the model and applying k-Fold Cross Validation:-

For Kernel SVM:-
from sklearn.model_selection import GridSearchCV
parameters = #what we want to tune [
   {'c': [0.25, 0.5, 0.75, 1],
    'kernel': ['linear']},
   {'c': [0.25, 0.5, 0.75, 1],
    'kernel':['rbf'],
    'gamma':[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]}]

grid_search = GridSearchCV(estimator = classifier,
param_guide = parameters,
scoring = ‘accuracy’,
cv = 10,
n_jobs = -1)

grid_search.fit(X_train, y_train)
best_accuracy = grid_search.best_score_
best_parameters = grid_search.best_params_
print(“Best Accuracy:{.2f}%”.format(best_accuracy * 100))
print(“Best Parameters:”, best_parameters