A-kassen Flashcards
What is a label?
What we want to predict.
What is the label in the paper?
Churn/loyal members.
Why did you use Jupiter?
Jupiter is like a notebook that comprise different code languages and python code is one of the languages. This is a powerful tool that enabled us to do the data mining through specify, program our own process steps.
What was the main purpose of the data preparation?
To take a critical look on the dataset. Make it so clean and small as possible because this gives more prediction power.
What is the first step in the data preparation?
Exploration phase
What does the exploration phase entail?
Three-step process. Missing data, Data types, Outliers
Why was it important to make predictions about customers intentions or tendencies to churn?
- Enable efficiency in regards to planning of retaining customers - for example through specific promotional activities
- Comparison between predictions and actual results allows to spot meaningful indicators, useful for improving performance
How is predictive modelling and data mining useful for companies?
Instruments for improving the decision making process within companies.
The more accurate and timely the knowledge is, the increases the likeliness for the company to improve their business performance and industry position.
Why do companies use machine learning algorithms and big data?
To reduce uncertainty of for example unforeseen market changes or customers behaviour.
What did you do in your paper?
We analysed customer data from Akademikernes A-kasse in an effort to create a predictive model of which attributes are applicable when customers chose to churn. However the model we created does not predict how likely the customer is to churn but gives a binary churn/no churn result.
What do you mean about a binary churn/no churn result?
It means that we classified the two elements of a given set, churn/no churn into two on the basis of a classification rule. And we classified them on the basis of attribute selection in the predictive model.
What could be the next step for Akademikernes A-kasse?
To use regression to calculate the likelihood of churn within the predicted group of churners we have identified. A heat map for example, could visualise which customers and what attributes where more likely to churn in the future.
Why did you chose to use the CRISP model in your paper?
We decided to use it as a guiding principle for the paper, as we saw correlations of what the model present and what we wanted to do in our paper. However we were aware that the model is an exploratory tool that emphasise approaches and strategy rather than software design.
What are the six stages in the data mining process that the CRISP model defines?
Business understanding, data understanding, data preparation, modeling, evaluation, and deployment.
What does the business understanding stage present?
Concern the practical business goals that the organisation wants to achieve. This goal were converted into a problem that the data mining seek to solve. AK seek to uncover the reasoning behind the churn of their members and in turn ultimately reduce the number of churners.
Could you mention some critical perceptions of your paper?
To begin with, we thought we would use decision trees as they are high quality models generate simple rules and make it easy to understand the impact.
Decision trees will be very complex and big if there are many attributes so we were really harsh in the attribute selection as we wanted few attributes. If we did not have this initial plan the number of attributes might be higher .
How can AK use your results?
Knowledge of the customers. Attributes that are linked previous churners, and therefore these tendencies can be used to spot current members that are linked to the same attributes.
What is data mining?
A process used for discovering meaningful relationships, trends, and patterns between large amounts of data collected in a dataset.
What is clustering?
Clustering is gathering groups of people with common attributes in a K number of clusters. Clustering (also called unsupervised learning) is the process of dividing a dataset into groups such that the members of each group are as similar (close) as possible to one another, and different groups are as dissimilar (far) as possible from one another.
How can clustering create value for AK?
Goal is to group together similar instances using some metric of similarity - so create groupings where the members of a given group are similar to each other. For example group similar customers together and design different campaigns.
It is light classification but the groupings are not predefined.
Could find a way to group similar customers together. May or may not relate to the churn question.
What can you do to present results from data mining as informative as possible?
You can sacrifice details - subjective decision.
Switching from ROC (receiver operating characteristics) with AUC (area under curve) and lift curves/cumulative response curve.
ROC curves are not the most intuitive visualisation for many business stakeholders who really ought to understand the results. One of the most common examples of the use of an alternate visualisation is the use of the “cumulative response curve, which is more intuitive
What are lift curves?
Visualisation framework that might not have all of the nice properties of ROC curves, but are more intuitive. So, conceptually as we move down the list of instances ranked by the model, we target increasingly larger proportions of all the instances.
Why is your data affected by selection bias?
Only access to Akademikernes A-kasse dataset, and not other A-kasse organisations. Therefore the dataset does only comprise information about AK´s members and is not a representable representation of the entire population.
What makes machine learning algorithms supervised?
We know the target class, and more specifically what we are looking at. The opposite is unsupervised which does not provide a purpose or target information.
Why did you chose to solely look at supervised machine learning algorithms?
We knew the target class, and what we wanted to look at which is churner/no churner
How do you measure the performance of the classification?
Through classification accuracy. However accuracy does not capture what is important for the problem at hand.
Why did you chose classification?
As we had a label, classification was appropriate as classification involves selecting which, out of a set of labels should be assigned to some data, according to some spot meaningful indicators.
What is binary classification?
It means that we are working with two options.
Why did you disregard some attributes?
Because we found some of them to be irrelevant to predictions related to the target “will churn” and “will not churn”. We observed distinctiveness degree in which the attributes we have chosen affect the target in the attribute selection section of the paper.
Why is it not enough to look at the accuracy?
Because a dataset can be imbalances, which was the case for our dataset, and in this case the accuracy can be misleading. In our example we had 71,5% non-churners.
What did you do to cope with the accuracy “problem”?
We decided to use stratified cross validation in order to evaluate how generalisable our model was.
What is stratified cross validation?
Stratified cross validation is a way of splitting or portion the entire dataset into “k” bin of equal size to get a fair share of data points in both training data and test data.
Why is it efficient to use stratified cross validation?
Utilise the dataset more. By using stratified cross validation, you will get more out of the training and test data, that is, the best validation and learning results as possible.
If we were to only use training data (hold out data), we would only train on parts of the dataset and leave the rest behind for test.
Why is your dataset imbalanced?
Because the class churn_loyal has a high concentration 71,5% - and as the concentration is not equal it can be argued that it is imbalanced
What do you need to take into consideration when your dataset is imbalance?
It is a very normal thing - but we knew that we need to be aware that the accuracy could be misleading and therefore we chose not to focus much upon the accuracy.