AM3 - Exam Flashcards
What does IDENTIFY mean
Select and state a choice or piece of information
What does DESCRIBE mean
Give an account of by saying what something is, does, looks like, size and scale, or how it relates to something else.
What does COMPARE mean
Identify differences and similarities between two or more options.
What does ANALYSE mean
Provide a breakdown of the topic to show your understanding
What does EXPLAIN mean
Set out the reasons for, showing understanding of the process and reasoning behind it.
What does JUSTIFY mean
Show validity in a choice or point of view by discussing and discounting alternatives and considering positives and negatives
What is uncertainty
The concept of working with imperfect or incomplete data
Name three types of uncertainty
Irreducible, reducible and prediction
What are some examples of error in data and how can they be mitigated?
Missing data, duplicate entries, inconsistent formats and erroneous entries. Can be mitigated by data cleaning, data imputation and data validation.
What are 3 types of bias
Sampling, algorithmic and confirmation.
What is sampling bias
too small sample size or oversampling from a particular group e.g. gender
What is algorithmic bias
the wrong choice of algorithm can lead to bias in predictions
What is confirmation bias
Once we start to train our model and evaluate its predictions, we may tend to retain information that affirms our preconceived notions. We might start to exclude or remove data that goes against our theory in the process. This will lead to a certain bias in the data, and therefore our application’s predictions. While this may satisfy us as developers, it can significantly reduce the application’s usability
What is irreducible uncertainty?
this is an inherent property of any dataset i.e. there will always be some noise and randomness present in our data as is reflected in reality e.g. measurement noise (imprecise measurements), intrinsic variability (variations in biological systems or unpredictable human behaviour) or environmental factors (e.g. weather conditions affecting sensor readings)
Can irreducible uncertainty be removed/reduced
THIS CANT BE REDUCED but can be managed by building models that are robust to noise.
What is reducible uncertainty
this is uncertainty that arises from incomplete domain coverage in the data i.e. refers to the uncertainty in the model due to lack of data. Alternatively we could be data rich but information poor (i.e. high quantity of low quality data). This is reducible (e.g. through collecting more/better data or improving model training through cross validation/regularisation) although cannot be removed entirely.
What is prediction uncertainty
this encompasses both reducible and irreducible uncertainty. It represents the total uncertainty in the models predictions
What is the difference between uncertainty in data collection and analysis
- Data collection – accuracy, reliability and representativeness of the raw data
- Data Analysis – focuses on the model’s ability to correctly interpret and predict based on the data (e.g. is the choice of model correct for the data, has the model overfit to the training data and therefore wont generalise to new data)
What is data exposure
This is when sensitive information is accessible to unintended or unauthorised parties. It indicates that there are missing proper security controls or processes e.g. lack of encryption mechanisms. May include PII.
What is data linking
Combining data from different sources or datasets to create a more comprehensive and enriched dataset. Involves identifying and merging records that refer to the same entity (e.g. the same person or product). Can use exact matches or fuzzy matching (e.g. potentially if names aren’t exactly the same format (first last vs first middle last vs first initial and last name ))
What are 3 different types of data storage
Relational database, data lake, data lakehouse
What are 4 different types of data storage LOCATIONS?
Local, cloud, remote, temporary
What is local storage and what are the advantages and disadvantages?
storing data on physical devices like hard drives, SSDs that are physically located within the organisation. Full control over data and storage devices, can be highly secure if proper measures are in place since data is not transmitted over the internet, high speed access to data without reliance on internet HOWEVER limited by physical space and hardware capacity – scaling up can be costly and cumbersome, high upfront costs for hardware and data is only accessible from locations where storage devices are located.
What is cloud storage and what are the advantages and disadvantages?
storing data on servers managed by third party providers and accessed over the internet. Scalable based on demand, accessible from anywhere with an internet connection, pay as you go models reduce upfront costs HOWEVER data breaches can occur – dependent on providers security measures, dependent on internet connection so outages can affect availability and less control over physical storage infrastructure.
What is remote storage and what are the advantages and disadvantages?
storing data on servers located offsite, typically within the same organisation but at different physical locations, accessed via network connections. Data backup processes can provide disaster recovery in case of site specific failures, security can be controlled by the organisation, often less expensive than local storage for large datasets HOWEVER higher latency compare to local storage due to network transmission, it is complex, requiring robust networking infrastructure and management and it is dependent on network reliability and bandwidth.
What is temporary storage and what are the advantages and disadvantages
storage used for short-term data retention, often in memory (RAM) or cache storage solutions. It has extremely fast access times and can be cost effective for short term data needs and reduces load on persistent storage systems by offloading transient data HOWEVER it is volatile – data can be lost when power is turned off or the system restarts, it is limited in size compared to other storage methods and is not suitable for long term data retention
What are the differences between supervised and unsupervised learning?
Supervised - input data is labelled
Unsupervised - data not labelled
Supervised - used for prediction
Unsupervised - used for analysis
Supervised - data classified based on training set
Unsupervised - data assigned a classification
Supervised - Divided into regression and classification
Unsupervised - mainly clustering
Name 4 types of supervised learning (2 for regression and 2 for classification)
Regression - linear and ridge
Classification - Logistic and decision tree
Name 2 types of unsupervised learning
K-means clustering, hierarchical clustering
Mean absolute error
absolute differences between predicted values and actual values
Mean squared error
average of the squared differences between predicted values and actual values
R-Squared
goodness of fit of the regression model (perfect fit would be 1)
Accuracy
How often the model is correct overall
Precision
How often a model is correct when predicting target class (i.e. of all _ve predictions, how many are truly +ve)
Recall
Of all real positive cases, how many are predicted positive?
Inertia
sum of squared distances between each data point and the centroid of its assigned cluster
Silhouette score
measure of compactness and separation of clusters
CH-Index
measure of compactness and separation of clusters
Cophenetic correlation coefficient
measures how the dendrogram preserves pairwise distances between original data points
What are 3 principles of ethics
- Contribute to society and human well-being – all people are stakeholders in computing.
- Avoid harm, be honest and trustworthy, be fair and take action not to discriminate.
- Respect privacy and honour confidentiality
What is GDPR
Governs EU citizens and aims to safeguard individuals’ personal data by enhancing privacy rights and imposing tule of data handling and processing by organisations. E.g. gaining explicit consent for data collection. Non-compliance can result in hefty fines (up to €20m or 4% of annual global turnover) and reputational impact
What are the principles of the Government Data Ethics Framework (2018)
- Transparency – be transparent about data sources, methods, and intentions. Communicate clearly with the public and stakeholders.
- Accountability - Establish clear responsibilities and accountability mechanisms. Ensure all team members understand their ethical obligations
- Use data Ethically - Share best practices and resources. Provide guidance and support to others on ethical data use
What are the Asilomar AI Principles
The Asilomar AI Principles are a set of guidelines developed to ensure that artificial intelligence (AI) technologies are beneficial and safe for humanity. Focus on three areas, Research, Ethics and Values and Longer term issues (e.g. risks, self-improvement and common good).
What is the BCS Code of Conduct
is a set of professional standards and ethical guidelines that members of the BCS, The Chartered Institute for IT, are expected to follow. It outlines the principles and standards of behaviour required to maintain the integrity and reputation of the profession. The key elements are:
1. Public Interest: Members must have due regard for public health, privacy, security, and the well-being of others and the environment. They should avoid harm and ensure their work contributes to the public good.
2. Professional Competence and Integrity: They must keep their skills and knowledge up-to-date and act with honesty and integrity at all times.
3. Duty to Relevant Authority: respect the rules and procedures of their employer or any relevant authority.
4. Duty to the Profession: should act in a manner that promotes trust and confidence in the profession
What are the differences between agile and waterfall PM styles (for approach, flexibility, documentation and project size and scope)?
- Approach:
o Waterfall: A linear and sequential approach where each phase (requirements, design, implementation, verification, maintenance) is completed before the next one begins.
o Agile: An iterative and incremental approach where projects are divided into small cycles called sprints, allowing for continuous feedback and adjustments. - Flexibility:
o Waterfall: Inflexible; changes are difficult to implement once the project is underway as each phase must be completed before moving to the next.
o Agile: Highly flexible; changes and improvements can be made throughout the project based on ongoing feedback. - Documentation:
o Waterfall: Extensive documentation is required upfront and throughout each phase.
o Agile: Emphasizes working software over comprehensive documentation, focusing more on collaboration and responsiveness to change. - Project Size and Scope:
o Waterfall: Better suited for projects with well-defined scope and requirements that are unlikely to change.
o Agile: Ideal for projects with evolving requirements and scope, allowing for frequent reassessment and adjustment.
What are necessary resources and architecture that may be required for a ML project?
i.e. need the necessary computational resources (CPU, memory, storage) and software tools (libraries, frameworks, dev environments) for machine learning models. This may need to be scaled depending on size of dataset/complexity of the model. Will need to consider increasing computational demand as the system grows. May need to consider distributed computing, parallel processing and optimisation techniques to improve performance and reduce latency. Also need correct security measures in place to protect sensitive data, ensure model robustness against adversarial attacks and comply with privacy regulations (GDPR). Techniques such as encryption and access controls can be employed to enhance security and privacy.
What are the benefits of integrating new solutions with existing technology?
Maximizes investments, streamlines processes, and enhances data accessibility. Compatibility ensures seamless integration, leveraging existing infrastructure and workflows while minimising disruptions. By embedding machine learning capabilities into familiar interfaces, user adoption is enhanced, leading to improved efficiency. Leveraging existing resources also boosts scalability and performance, distributing computational tasks efficiently. Maintenance and support are simplified through centralized management, reducing operational overhead.
What to include in Q1
Uncertainty (different types), error and bias
What to include in Q2
Data Exposure, data Linkage, data Storage, data Analysis and data Visualisation (ELSAV)
What to include in Q3
Ethical, legal, regulatory, professional constraints and the business impact of non compliance
What to include in Q4
Resources and architecture needed to solve the problem. Project management workflows (compare 2) , UX.
What legal issues should be considered (Q3)
- Consent from people to use their data (GDPR)
- Anonymize data to ensure privacy
- Does the business have the right to use 3rd party data (data sharing agreement etc)
What social issues should be considered (Q3)
- Incorrect predictions could lead to customer dissatisfaction
- Is there an impact on employees (e.g. increased workload?)
- Need to be transparent about how data is used and predictions are made to maintain public trust in company
What ethical issues should be considered (Q3)
- Model should be free of bias (should not perpetuate inequalities)
- Clear accountability for model development, deployment and for monitoring impact
- Transparency about models limitations and uncertainty in predictions
What professional issues should be considered (Q3)
- Engage correct people in model development
- Adhere to any professional standards
- Regularly update model to reflect changes
- Provide training on how to use model and outputs