AM3 - Exam Flashcards

1
Q

What does IDENTIFY mean

A

Select and state a choice or piece of information

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What does DESCRIBE mean

A

Give an account of by saying what something is, does, looks like, size and scale, or how it relates to something else.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What does COMPARE mean

A

Identify differences and similarities between two or more options.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What does ANALYSE mean

A

Provide a breakdown of the topic to show your understanding

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What does EXPLAIN mean

A

Set out the reasons for, showing understanding of the process and reasoning behind it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does JUSTIFY mean

A

Show validity in a choice or point of view by discussing and discounting alternatives and considering positives and negatives

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is uncertainty

A

The concept of working with imperfect or incomplete data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Name three types of uncertainty

A

Irreducible, reducible and prediction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are some examples of error in data and how can they be mitigated?

A

Missing data, duplicate entries, inconsistent formats and erroneous entries. Can be mitigated by data cleaning, data imputation and data validation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are 3 types of bias

A

Sampling, algorithmic and confirmation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is sampling bias

A

too small sample size or oversampling from a particular group e.g. gender

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is algorithmic bias

A

the wrong choice of algorithm can lead to bias in predictions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is confirmation bias

A

Once we start to train our model and evaluate its predictions, we may tend to retain information that affirms our preconceived notions. We might start to exclude or remove data that goes against our theory in the process. This will lead to a certain bias in the data, and therefore our application’s predictions. While this may satisfy us as developers, it can significantly reduce the application’s usability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is irreducible uncertainty?

A

this is an inherent property of any dataset i.e. there will always be some noise and randomness present in our data as is reflected in reality e.g. measurement noise (imprecise measurements), intrinsic variability (variations in biological systems or unpredictable human behaviour) or environmental factors (e.g. weather conditions affecting sensor readings)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Can irreducible uncertainty be removed/reduced

A

THIS CANT BE REDUCED but can be managed by building models that are robust to noise.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is reducible uncertainty

A

this is uncertainty that arises from incomplete domain coverage in the data i.e. refers to the uncertainty in the model due to lack of data. Alternatively we could be data rich but information poor (i.e. high quantity of low quality data). This is reducible (e.g. through collecting more/better data or improving model training through cross validation/regularisation) although cannot be removed entirely.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is prediction uncertainty

A

this encompasses both reducible and irreducible uncertainty. It represents the total uncertainty in the models predictions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the difference between uncertainty in data collection and analysis

A
  • Data collection – accuracy, reliability and representativeness of the raw data
  • Data Analysis – focuses on the model’s ability to correctly interpret and predict based on the data (e.g. is the choice of model correct for the data, has the model overfit to the training data and therefore wont generalise to new data)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is data exposure

A

This is when sensitive information is accessible to unintended or unauthorised parties. It indicates that there are missing proper security controls or processes e.g. lack of encryption mechanisms. May include PII.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is data linking

A

Combining data from different sources or datasets to create a more comprehensive and enriched dataset. Involves identifying and merging records that refer to the same entity (e.g. the same person or product). Can use exact matches or fuzzy matching (e.g. potentially if names aren’t exactly the same format (first last vs first middle last vs first initial and last name ))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What are 3 different types of data storage

A

Relational database, data lake, data lakehouse

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What are 4 different types of data storage LOCATIONS?

A

Local, cloud, remote, temporary

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is local storage and what are the advantages and disadvantages?

A

storing data on physical devices like hard drives, SSDs that are physically located within the organisation. Full control over data and storage devices, can be highly secure if proper measures are in place since data is not transmitted over the internet, high speed access to data without reliance on internet HOWEVER limited by physical space and hardware capacity – scaling up can be costly and cumbersome, high upfront costs for hardware and data is only accessible from locations where storage devices are located.

24
Q

What is cloud storage and what are the advantages and disadvantages?

A

storing data on servers managed by third party providers and accessed over the internet. Scalable based on demand, accessible from anywhere with an internet connection, pay as you go models reduce upfront costs HOWEVER data breaches can occur – dependent on providers security measures, dependent on internet connection so outages can affect availability and less control over physical storage infrastructure.

25
Q

What is remote storage and what are the advantages and disadvantages?

A

storing data on servers located offsite, typically within the same organisation but at different physical locations, accessed via network connections. Data backup processes can provide disaster recovery in case of site specific failures, security can be controlled by the organisation, often less expensive than local storage for large datasets HOWEVER higher latency compare to local storage due to network transmission, it is complex, requiring robust networking infrastructure and management and it is dependent on network reliability and bandwidth.

26
Q

What is temporary storage and what are the advantages and disadvantages

A

storage used for short-term data retention, often in memory (RAM) or cache storage solutions. It has extremely fast access times and can be cost effective for short term data needs and reduces load on persistent storage systems by offloading transient data HOWEVER it is volatile – data can be lost when power is turned off or the system restarts, it is limited in size compared to other storage methods and is not suitable for long term data retention

27
Q

What are the differences between supervised and unsupervised learning?

A

Supervised - input data is labelled
Unsupervised - data not labelled
Supervised - used for prediction
Unsupervised - used for analysis
Supervised - data classified based on training set
Unsupervised - data assigned a classification
Supervised - Divided into regression and classification
Unsupervised - mainly clustering

28
Q

Name 4 types of supervised learning (2 for regression and 2 for classification)

A

Regression - linear and ridge
Classification - Logistic and decision tree

29
Q

Name 2 types of unsupervised learning

A

K-means clustering, hierarchical clustering

30
Q

Mean absolute error

A

absolute differences between predicted values and actual values

31
Q

Mean squared error

A

average of the squared differences between predicted values and actual values

32
Q

R-Squared

A

goodness of fit of the regression model (perfect fit would be 1)

33
Q

Accuracy

A

How often the model is correct overall

34
Q

Precision

A

How often a model is correct when predicting target class (i.e. of all _ve predictions, how many are truly +ve)

35
Q

Recall

A

Of all real positive cases, how many are predicted positive?

36
Q

Inertia

A

sum of squared distances between each data point and the centroid of its assigned cluster

37
Q

Silhouette score

A

measure of compactness and separation of clusters

38
Q

CH-Index

A

measure of compactness and separation of clusters

39
Q

Cophenetic correlation coefficient

A

measures how the dendrogram preserves pairwise distances between original data points  

40
Q

What are 3 principles of ethics

A
  • Contribute to society and human well-being – all people are stakeholders in computing.
  • Avoid harm, be honest and trustworthy, be fair and take action not to discriminate.
  • Respect privacy and honour confidentiality
41
Q

What is GDPR

A

Governs EU citizens and aims to safeguard individuals’ personal data by enhancing privacy rights and imposing tule of data handling and processing by organisations. E.g. gaining explicit consent for data collection. Non-compliance can result in hefty fines (up to €20m or 4% of annual global turnover) and reputational impact

42
Q

What are the principles of the Government Data Ethics Framework (2018)

A
  • Transparency – be transparent about data sources, methods, and intentions. Communicate clearly with the public and stakeholders.
  • Accountability - Establish clear responsibilities and accountability mechanisms. Ensure all team members understand their ethical obligations
  • Use data Ethically - Share best practices and resources. Provide guidance and support to others on ethical data use
43
Q

What are the Asilomar AI Principles

A

The Asilomar AI Principles are a set of guidelines developed to ensure that artificial intelligence (AI) technologies are beneficial and safe for humanity. Focus on three areas, Research, Ethics and Values and Longer term issues (e.g. risks, self-improvement and common good).

44
Q

What is the BCS Code of Conduct

A

is a set of professional standards and ethical guidelines that members of the BCS, The Chartered Institute for IT, are expected to follow. It outlines the principles and standards of behaviour required to maintain the integrity and reputation of the profession. The key elements are:
1. Public Interest: Members must have due regard for public health, privacy, security, and the well-being of others and the environment. They should avoid harm and ensure their work contributes to the public good.
2. Professional Competence and Integrity: They must keep their skills and knowledge up-to-date and act with honesty and integrity at all times.
3. Duty to Relevant Authority: respect the rules and procedures of their employer or any relevant authority.
4. Duty to the Profession: should act in a manner that promotes trust and confidence in the profession

45
Q

What are the differences between agile and waterfall PM styles (for approach, flexibility, documentation and project size and scope)?

A
  • Approach:
    o Waterfall: A linear and sequential approach where each phase (requirements, design, implementation, verification, maintenance) is completed before the next one begins.
    o Agile: An iterative and incremental approach where projects are divided into small cycles called sprints, allowing for continuous feedback and adjustments.
  • Flexibility:
    o Waterfall: Inflexible; changes are difficult to implement once the project is underway as each phase must be completed before moving to the next.
    o Agile: Highly flexible; changes and improvements can be made throughout the project based on ongoing feedback.
  • Documentation:
    o Waterfall: Extensive documentation is required upfront and throughout each phase.
    o Agile: Emphasizes working software over comprehensive documentation, focusing more on collaboration and responsiveness to change.
  • Project Size and Scope:
    o Waterfall: Better suited for projects with well-defined scope and requirements that are unlikely to change.
    o Agile: Ideal for projects with evolving requirements and scope, allowing for frequent reassessment and adjustment.
46
Q

What are necessary resources and architecture that may be required for a ML project?

A

i.e. need the necessary computational resources (CPU, memory, storage) and software tools (libraries, frameworks, dev environments) for machine learning models. This may need to be scaled depending on size of dataset/complexity of the model. Will need to consider increasing computational demand as the system grows. May need to consider distributed computing, parallel processing and optimisation techniques to improve performance and reduce latency. Also need correct security measures in place to protect sensitive data, ensure model robustness against adversarial attacks and comply with privacy regulations (GDPR). Techniques such as encryption and access controls can be employed to enhance security and privacy.

47
Q

What are the benefits of integrating new solutions with existing technology?

A

Maximizes investments, streamlines processes, and enhances data accessibility. Compatibility ensures seamless integration, leveraging existing infrastructure and workflows while minimising disruptions. By embedding machine learning capabilities into familiar interfaces, user adoption is enhanced, leading to improved efficiency. Leveraging existing resources also boosts scalability and performance, distributing computational tasks efficiently. Maintenance and support are simplified through centralized management, reducing operational overhead.

48
Q

What to include in Q1

A

Uncertainty (different types), error and bias

49
Q

What to include in Q2

A

Data Exposure, data Linkage, data Storage, data Analysis and data Visualisation (ELSAV)

50
Q

What to include in Q3

A

Ethical, legal, regulatory, professional constraints and the business impact of non compliance

51
Q

What to include in Q4

A

Resources and architecture needed to solve the problem. Project management workflows (compare 2) , UX.

52
Q

What legal issues should be considered (Q3)

A
  • Consent from people to use their data (GDPR)
  • Anonymize data to ensure privacy
  • Does the business have the right to use 3rd party data (data sharing agreement etc)
53
Q

What social issues should be considered (Q3)

A
  • Incorrect predictions could lead to customer dissatisfaction
  • Is there an impact on employees (e.g. increased workload?)
  • Need to be transparent about how data is used and predictions are made to maintain public trust in company
54
Q

What ethical issues should be considered (Q3)

A
  • Model should be free of bias (should not perpetuate inequalities)
  • Clear accountability for model development, deployment and for monitoring impact
  • Transparency about models limitations and uncertainty in predictions
55
Q

What professional issues should be considered (Q3)

A
  • Engage correct people in model development
  • Adhere to any professional standards
  • Regularly update model to reflect changes
  • Provide training on how to use model and outputs