Machine Learning for Classification Flashcards
What is it called if a customer is leaving a company for another competitor company?
Churn. If we want to identify such customers, we call it churn prediction. If we know if a customer is churning, we can send them discounts so that they do not go to the other company.
Binary Classification
Our target variable can either be 0(false) or y(true)
g(X) ~ y
g(x) is the likelihood of the customer being churned between 0 and 1.
X is the feature Matrix
y is the binary target variable.
How can you look at all columns at the same time?
df.head().T
What can you do in data preparation step?
1) Standardize column names e.g. lower all columns, replace space with underscores etc
2) Check for dtypes of columns e.g. numeric columns are integers or floats. Is there a column which should have been a number but is actually not? Maybe it’s string.
3) Check for missing values and replace it.
4) Check the target variable. In machine learning, we are interested in numeric values so we can convert string target variable e.g. yes or no, to 1 or 0.
How can you see docs of a function within Jupyter notebook?
<function_name>?
</function_name>
How to use scikit_learn to setup validation framework?
from sklearn.model_selection import train_test_split
# This function is used to split a dataset into two
df_train_full, df_test=train_test_split(df, test_size=0.2,random_state=1)
train, df_val = train_test_split(df_train_full, test_size=0.25, random_state=11)
So here in order to get 20% of the original dataset, we need to split a bit more from df_train_full so thats why we choose 0.25. 20%/80%=1/4
What steps need to be taken to set up validation framework?
1) Split dataset into 60% training, 20% test and 20% validation dataset.
2) Store target outcomes in variables
3) delete the variables from the validation, training and test featured datasets.
4) Shuffled the dataset which is done automatically through scikit-learn.
What steps need to be taken in exploratory data analysis?
1) Check for missing values.
2) Look at the distribution for the target variable e.g. df_train_full.churn.value_counts()
3) Look at the numerical and categorical variables. e.g. number of unique values per variables
3) Value counts or percentage of target outcome
4) feature importance w.r.t to target variable. If a feature shows that it effects the target variable then it maybe an important feature.
How can you see percentages of the target variable?
e.g. you can use value_counts function with parameter normalize=True.
How to measure feature importance?
1) Take difference between global mean and feature specific grouped mean. So if (global_mean - grouped _mean) > 0, meaning this group is not effecting the output variable. If it’s < 0 then it means that group effects the output variable. We are interested in large differences.
2) Risk Ratio: Divide the grouped mean by the global mean. If it’s greater than 1, it’s more likely to have target outcome. And if it’s less than 1, it is less likely to have the outcome variable.
How to display from within a for loop?
from Ipython.display import display
Mutual Information
We can know about the importance of a feature using mutual Information. It’s a concept from information theory, it tells us how much we can learn about one variable if we know the value of another. It can be calculated in sklearn.
from sklearn.metrics import mutual_info_score.
The higher the mutual Information is, the more we learn about the target variable by observing the value of the other variable.
Mutual Information is a way to check the relative importance for categorical values.
How can you convert sort_values() to a dataframe?
Using to_frame().
Correlation
It’s a way to learn feature importance for numerical variables.
Correlation coefficient is a Pearson correlation and it’s a way to measure dependency between two variables.
The values are between -1<=r<=1.
-ive correlation: increase in one variable decrease in the other variable
+I’ve correlation: Increase in one variable can increase in the other variable.
Higher than 0.5 and closer to 1 are always strong correlation.
Zero correlation means: no effect of variable on the target outcome.
e g. x has values between 0 and 200 i.e. tenure
y has values as 0 or 1 i.e. churn
Positive correlation means more tenure, higher churn
Negative correlation means more tenure less churn
Zero correlation no effect on churn.
Correlation can be calculated in pandas as df[cols].corrwith(df.target_col).to_frame(‘correlation’)
One-hot encoding
How can we encode categorical features before giving it to the machine learning algorithm?
When we convert categorical variables to a bunch of binary variables so this way of encoding is called one - hot encoding and it can be achieved via scikit-learn.
from sklearn.feature_extraction import DictVectorizer
1) Convert dataframe into records based dict
2) Use Fit method from DictVectorizer
3) Use the fit output to transform the dict
4) DictVectorizer is smart enough to know which column is numerical and does not convert it.