AM1 Presentation Flashcards
How did I choose which model to use?
First I made sure I understood the problem. Because we needed to group products and we had no training data, an unsupervised method such as clustering was a natural route to go down. I then had to decide which clustering method to use and following research I was able to choose four to test. I chose these four based on the fact that they were well established, there was plenty of documentation and examples of them in use online and they were fairly quick and easy to implement. I compared the performance of the models using the same metrics (Silhouette score and CH index) so the results were fair and transparent
What metrics did I use to assess model performance and why?
I used Silhouette score as it assesses both the compactness of an individual cluster and the separation of the k clusters – gives a comprehensive evaluation of clustering quality. The score is easily interpretable since it ranges from -1 to 1.
I used the Calinski Harabasz index for similar reasons (assesses compactness and separation) and it is computationally efficient. I wanted two metrics as while the SS focuses on individual data points and their placement within clusters, the CH index provides an overall measure of the clustering structure. Both are industry standard and using both gives a comprehensive assessment. Relying on multiple metrics ensures the evaluation of the clustering model is robust as some metrics might highlight issues that others do not, providing a fuller picture of the clustering performance.
If there was a conflict between the scores would bring in another performance metric/do more research on which one of the two to go with.
What were the ethical considerations of the project?
I made sure to only bring in data that was relevant and needed for the algorithm to work – I made sure not to use PII such as customer information to abide by data protection regulations such as GDPR. Using a rolled up view of sales per week per product (i.e. omitting grouping at store/location/demographic level) meant data could not be tracked back to a particular customer/demographic and therefore would avoid discriminatory practices of the algorithm reinforcing biases e.g. if sales data was segmented at region area and certain lower income regions were seen not to be buying many expensive leather jackets, it may mean less/no stock sent to these regions therefore excluding a portion of the population from higher priced products. I made sure I was transparent with the workings of the clustering algorithm to stakeholders (technical and non technical) since transparency is one of the key pillars of most AI/DS frameworks such as Government Data Ethics Framework (2018).
How did you choose the project management method (waterfall, why not agile?)
I chose waterfall style because I had well defined, static requirements so didn’t require continuous input from stakeholders. Waterfall allowed for detailed planning and design upfront so I had a clear structure with well defined stages and deliverables. Waterfall places a strong emphases on documentation at each stage of the project which I found beneficial when it came to documenting the model and the thoughts, assumptions and caveats that went into it. Useful for end document that contains instructions for how to rerun the model. Also easy to track project against pre-defined plans. While a more agile approach would have offered flexibility, iterative development, and closer client collaboration, waterfall was ideal for my project with well-defined requirements, fixed scope, and documentation needs.
How did you stay focused and motivated throughout?
Having a clear end goal, a tangible outcome and idea of the benefit of the project kept me focused. I knew what I had to aim for and the benefits to the business so this motivated me when I encountered obstacles. Seeing each milestone being achieved was also a motivator, this was a benefit of choosing the waterfall project management method. I also found conversations with my CS data mentor a good way to help me get back on track, offering advice when needed and discussing ideas for improving the model. On a personal level, I made sure to avoid burnout by taking regular breaks (both for rest and to switch to other work based projects) so as to maintain a good balance.
How did you determine what would go into the first iteration?
Before I started any of the plan/build I had initial meetings with the stakeholders to understand their immediate needs and priorities. This gave me an MVP to work towards as well as some ‘nice to haves’/things to include subsequent iterations. The MVP was a model that could group products based on selling patterns and some nice to haves would be to keep refining the model (so potentially use decomposition to consider seasonality?)
What were you most proud of?
I’m most proud of the successful communication and engagement of stakeholders in the project. Initially, I didn’t expect them to be receptive to the idea of using machine learning. However, through clear and direct communication and transparency, I was able to explain the entire process and bring them along on the journey. This not only secured their engagement but also paved the way for future data science and machine learning projects, as stakeholders are now more accustomed to and open to these approaches.
What was the hardest part of the project?
I think the hardest part of the project was trying to translate the skills and techniques I had learned throughout the course and apply them to a real world problem. In our CS workshops, we were taught in neat, compartmentalised modules where you knew what skills you were meant to be applying in each scenario. Our assignments were graded and you could check your score as you went through the assignment to check you were on the right track. In real life, it is harder to quantify if you are applying the correct techniques in the right places or if you’re going along the right lines (e.g. was there a better model that I could have used/could I have tuned parameters/cleaned data better to get a more performant model. I believe as I carry out more Data Science projects I will get more comfortable with this as I gain more experience.
How did you manage conflict or challenges to resource?
Regarding external resource (i.e. not me) I made sure to leave plenty of time in the plan to facilitate discussions/get answers from these people (e.g. data experts) so I could be sure that I could be getting a well-considered, thought-out answer and not quick, brush off responses. Regarding myself, on top of this project I had other competing priorities to deal with. When I was planning the timelines of this project, I made sure to leave ‘buffer’ time to account for unexpected, urgent work tasks that came up. I also made sure to clearly set out my capacity and workload to others so everyone who could potentially give me tasks was aware of what I was working on and how much bandwidth I had. Ultimately, I had to be flexible and building buffer time into my plan helped with this.
How did the data available influence your modelling considerations?
Didn’t have labelled data (even a small sample) so needed to use an unsupervised learning technique. Removed outliers so didn’t have to worry too much about using a clustering algorithm that was robust to outliers (e.g. DBSCAN). I had a fairly large dataset (~1m rows) so needed something that could handle this (k means could). Also needed performance metrics that weren’t too computationally intensive – CH index was good for this.
How did you tailor your communications to different audiences?
For technical audiences I didn’t have to think about it too much as I knew they could understand jargon I was using. I emphasised technical details and requirements and tried to speak as specifically as possible about my work to give them the opportunity to suggest improvements or point out things to be careful of (e.g. was initially using min max scaler to scale the data until my data mentor warned me that this would remove the scale of the data (i.e. it would make Christmas look like any other week).
When communicating with non technical stakeholders, although I wanted to explain the technical side of things to them to increase engagement, I didn’t want to put them off by overloading them with language and information they did not understand. So I made sure to not use technical jargon or abbreviations that they would not understand. I incorporated visualisations including tables, charts, graphs and images to explain key messages and concepts. I encouraged questions and gave people time to go away and digest information before coming back with follow up questions.
What were your recommendations coming out of the project?
I made the recommendation to the business that we could use clustering to define a new attribute grouping that can be assigned to products. This would provide a different way to manage stock levels i.e. do we have the correct mix of seasonal products, looking at July/august do we have enough summer products coming in? It also opened up other ways to define other product attributes e.g. using online customer reviews and sentiment analysis to derive attributes related to quality, comfort, style and overall customer perception.
What will you focus on in future iterations?
Keep on top of developments in clustering and employ any available enhancements to the model. For example, I have recently seen articles about Deep Embedded clustering – combines deep learning with clustering techniques to provide enhanced clustering performance. Could be something that I could trail and compare to the performance of my k means model. Try and get silhouette scores up so clusters are more accurate.
Scope extension – difference in online/ in store sales and how this affects the clustering. Because online shopping is ‘always on’ it allows customers to react immediately to external factors such as sudden changes in the weather. On the other hand, in-store shoppers are generally more likely to make impulse purchases, influenced by appealing displays and the ‘I’ve travelled here now might as well buy something’ mentality. Both these characteristics could affect the timing of each cluster’s sales peak or alternatively the classification of products within each cluster. Therefore could have different stock holding profiles the same product depending on if it was going to store/a warehouse to be sold online. Although would need to do another round of scoping to see if this would be useful from the CP team’s perspective as don’t want to increase their workload for little gain.
If you were to do it again, what would you do differently?
Consider seasonality to ensure clusters are taking into account seasonal variations. Decompose the data into seasonal, trend and residual (statistical noise) components then use seasonal and de-seasonalised components as features in clustering.
Potentially use ensemble methods to get better results? Although typically used in supervised learning I have seen cases where it has been adapted for clustering. Get multiple clustering algorithms up to an acceptable standard then use a ‘majority rule’ to assign each data point to the cluster that most clustering results agree upon.
How did you work with other data scientists?
The data science team at M&S is a small, new team and so I did not have much formal interaction with them. I was able to use their expertise in a more informal capacity to the best way to engage non technical stakeholders following their own learnings. They emphasised the need when communicating to break things down into simple terms, avoid using jargon and keep asking questions to make sure they were following. I was fortunate enough however to have other people on my team who had completed the L7 apprenticeship so I used these ‘data science proxies’ to consult on things like model selection and data sourcing.