Work Based Project Flashcards
What Method did you use to combine different data sources?
Full Outer Join: keeps all the information from both tables regardless if they are matching
What other methods of joining are there?
Left Inner: keeps all the information from the left tables and bring only the matching value from the right table
Inner: keeps only the matching values from both tables
What is a join?
combines two or more tables based on a related column, allowing data to be reviewed and analysed together
What does combining data refer to?
- Process of integrating 🔄
- and merging information 🧩
- from various sources 📚
- into a unified dataset 1️⃣
- for analysis or management purposes📊
What are the risks of combining data?
- Select wrong join type❌ 🔗 could lead to data loss (e.g. data loss)
- Security risks: table w/PII & final outcome not having right protection🛡️
- Large/complex joins = affect performance
- Data Quality - consistency/terminlogy
List in order all the stages of the lifecycle
- Plan
- Data Prep
- Analysis
- Modelling
- Refine & Compare
- Communicate & Implement
Provide an example of what you have done for refine & compare for a model you created
- Error Metrics (RMSE/MAE) 📉
- Changed models and parameters to achieve lowest RMSE/MAE values🧬⚙️
- Confidence Levels📏
- Project Sponsor Feedback💬
- Domain context: knew flat line was not realistic🧭
What is Privacy by Design?
- Embeds Privacy Protection 🛏️🛡️
- as part of the design/implementation 🛠️
- of Systems, Products, Business Practices📦
- from the start, not an after thought.
🏁
How have you applied privacy by design?
- BOBI acess:🔑
- Only used data necessary for project➖
- Google Sheets -** JLP Access **Only 👥
What DQ risks did you come across? (3 examples)
- Completeness: records present in a dataset e.g. missing records against dates, missing matches between the 2 sources
- Consistency: format of PK different between sources
- Accuracy: represents the truth e.g. risk of mistypes and I removed zeros that were not genuine
How did you resolve each DQ risk example?
- Completeness: domain context meant I knew the data was meant to have gaps - weekly usage. Removed the NULLs from the join from the analysis
- Consistency: during data prep I changed the formatting so they were consistent and I could continue the join
- Accuracy: I removed the zeros from the analysis
Provide an example in which you acted logical and analytical
- S- Forecast Code Selection for Streamline
- T - Enough data📊✅, used infrequently 🌜⏳to provide best opportunity for success
- A- Analysed codes for sufficient data/similar for consolidation
- R- Selecting a code that could be consolidated, streamlining codeset and meeting project requirement
What was the conclusion of the analysis? (5 things)
- 125 - low use - removal or consolidation
- 134 - increase - not put forward for removal
- Histogram emphasised need for code review
- Bar charts- several codes not used at all - put forward for removal
- Model was not good for seasonal predictions, need more data, but did help to inform for 125 and 134
What alternative methods or tools did you suggest for the project to be successful?
- Identified low usage code,🔭
- Completed time series forecast to understand future pattern📉
- NLP📝
- Dashboard📊
What were your customer requirements and how did you define them? (2 answers)
- Met with project sponsor to go through their challenges⛰️, strategy🎯 and project objectives🎯
- Drafted KPIs and ensured clarity via email📏📧