Machine Learning for Classification Flashcards

Question 1

Q

What is it called if a customer is leaving a company for another competitor company?

Answer

A

Churn. If we want to identify such customers, we call it churn prediction. If we know if a customer is churning, we can send them discounts so that they do not go to the other company.

Question 2

Q

Binary Classification

Answer

A

Our target variable can either be 0(false) or y(true)
g(X) ~ y
g(x) is the likelihood of the customer being churned between 0 and 1.
X is the feature Matrix
y is the binary target variable.

Question 3

Q

How can you look at all columns at the same time?

Answer

A

df.head().T

Question 4

Q

What can you do in data preparation step?

Answer

A

1) Standardize column names e.g. lower all columns, replace space with underscores etc
2) Check for dtypes of columns e.g. numeric columns are integers or floats. Is there a column which should have been a number but is actually not? Maybe it’s string.
3) Check for missing values and replace it.
4) Check the target variable. In machine learning, we are interested in numeric values so we can convert string target variable e.g. yes or no, to 1 or 0.

Question 5

Q

How can you see docs of a function within Jupyter notebook?

Answer

A

<function_name>?
</function_name>

Question 6

Q

How to use scikit_learn to setup validation framework?

Answer

A

from sklearn.model_selection import train_test_split
# This function is used to split a dataset into two
df_train_full, df_test=train_test_split(df, test_size=0.2,random_state=1)
train, df_val = train_test_split(df_train_full, test_size=0.25, random_state=11)
So here in order to get 20% of the original dataset, we need to split a bit more from df_train_full so thats why we choose 0.25. 20%/80%=1/4

Question 7

Q

What steps need to be taken to set up validation framework?

Answer

A

1) Split dataset into 60% training, 20% test and 20% validation dataset.
2) Store target outcomes in variables
3) delete the variables from the validation, training and test featured datasets.
4) Shuffled the dataset which is done automatically through scikit-learn.

Question 8

Q

What steps need to be taken in exploratory data analysis?

Answer

A

1) Check for missing values.
2) Look at the distribution for the target variable e.g. df_train_full.churn.value_counts()
3) Look at the numerical and categorical variables. e.g. number of unique values per variables
3) Value counts or percentage of target outcome
4) feature importance w.r.t to target variable. If a feature shows that it effects the target variable then it maybe an important feature.

Question 9

Q

How can you see percentages of the target variable?

Answer

A

e.g. you can use value_counts function with parameter normalize=True.

Question 10

Q

How to measure feature importance?

Answer

A

1) Take difference between global mean and feature specific grouped mean. So if (global_mean - grouped _mean) > 0, meaning this group is not effecting the output variable. If it’s < 0 then it means that group effects the output variable. We are interested in large differences.
2) Risk Ratio: Divide the grouped mean by the global mean. If it’s greater than 1, it’s more likely to have target outcome. And if it’s less than 1, it is less likely to have the outcome variable.

Question 11

Q

How to display from within a for loop?

Answer

A

from Ipython.display import display

Question 12

Q

Mutual Information

Answer

A

We can know about the importance of a feature using mutual Information. It’s a concept from information theory, it tells us how much we can learn about one variable if we know the value of another. It can be calculated in sklearn.

from sklearn.metrics import mutual_info_score.

The higher the mutual Information is, the more we learn about the target variable by observing the value of the other variable.

Mutual Information is a way to check the relative importance for categorical values.

Question 13

Q

How can you convert sort_values() to a dataframe?

Answer

A

Using to_frame().

Question 14

Q

Correlation

Answer

A

It’s a way to learn feature importance for numerical variables.

Correlation coefficient is a Pearson correlation and it’s a way to measure dependency between two variables.

The values are between -1<=r<=1.
-ive correlation: increase in one variable decrease in the other variable
+I’ve correlation: Increase in one variable can increase in the other variable.
Higher than 0.5 and closer to 1 are always strong correlation.

Zero correlation means: no effect of variable on the target outcome.

e g. x has values between 0 and 200 i.e. tenure
y has values as 0 or 1 i.e. churn
Positive correlation means more tenure, higher churn
Negative correlation means more tenure less churn
Zero correlation no effect on churn.

Correlation can be calculated in pandas as df[cols].corrwith(df.target_col).to_frame(‘correlation’)

Question 15

Q

One-hot encoding

Answer

A

How can we encode categorical features before giving it to the machine learning algorithm?

When we convert categorical variables to a bunch of binary variables so this way of encoding is called one - hot encoding and it can be achieved via scikit-learn.
from sklearn.feature_extraction import DictVectorizer
1) Convert dataframe into records based dict
2) Use Fit method from DictVectorizer
3) Use the fit output to transform the dict
4) DictVectorizer is smart enough to know which column is numerical and does not convert it.

Question 16

Q

How can we combine fit and transform in scikit-learn?

Answer

Study These Flashcards

A

We can use fit_transform.

Question 17

Q

Logistic Regression

Answer

Study These Flashcards

A

g(X) ~ y where g is the model, X is the feature matrix and y is the target variable.

Logistic Regression is a supervised machine learning algorithm for binary Classification.

In binary Classification, our target could be 0s and 1s. 0 is negative and 1 is positive.

The model outputs a probability between 0 and 1.

It’s similar to linear regression, we have a bias term and weights multiplied by features. The difference is in output. Linear Regression outputs any number between -inf to +inf. However, logistic regression outputs a number between 0 and 1.
The way it does is by using a Sigmoid function.

g(X) = Sigmoid(w0+w1Tx1…..wnTxn)

For negative values, sigmoid is below 0.5 and for positive values, sigmoid is above 0.5.

Sigmoid formula: 1/(1+np.exp(z))

Question 18

Q

dot products in linear algebra are called

Answer

Study These Flashcards

A

Linear operators.
Linear models are good quality and fast to train.

Question 19

Q

What is hard predictions and soft predictions?

Answer

Study These Flashcards

A

Hard predictions are numbers e.g. 1 or 0
Soft predictions are probabilities

Question 20

Q

Output of predict_proba for classification model

Answer

Study These Flashcards

A

It returns a matrix of two columns. The first one is the probability of being 0 class and the second one is the probability of being a 1 class.
We are usually interested in the second column.