AM3 - Exam Flashcards
What does IDENTIFY mean
Select and state a choice or piece of information
What does DESCRIBE mean
Give an account of by saying what something is, does, looks like, size and scale, or how it relates to something else.
What does COMPARE mean
Identify differences and similarities between two or more options.
What does ANALYSE mean
Provide a breakdown of the topic to show your understanding
What does EXPLAIN mean
Set out the reasons for, showing understanding of the process and reasoning behind it.
What does JUSTIFY mean
Show validity in a choice or point of view by discussing and discounting alternatives and considering positives and negatives
What is uncertainty
The concept of working with imperfect or incomplete data
Name three types of uncertainty
Irreducible, reducible and prediction
What are some examples of error in data and how can they be mitigated?
Missing data, duplicate entries, inconsistent formats and erroneous entries. Can be mitigated by data cleaning, data imputation and data validation.
What are 3 types of bias
Sampling, algorithmic and confirmation.
What is sampling bias
too small sample size or oversampling from a particular group e.g. gender
What is algorithmic bias
the wrong choice of algorithm can lead to bias in predictions
What is confirmation bias
Once we start to train our model and evaluate its predictions, we may tend to retain information that affirms our preconceived notions. We might start to exclude or remove data that goes against our theory in the process. This will lead to a certain bias in the data, and therefore our application’s predictions. While this may satisfy us as developers, it can significantly reduce the application’s usability
What is irreducible uncertainty?
this is an inherent property of any dataset i.e. there will always be some noise and randomness present in our data as is reflected in reality e.g. measurement noise (imprecise measurements), intrinsic variability (variations in biological systems or unpredictable human behaviour) or environmental factors (e.g. weather conditions affecting sensor readings)
Can irreducible uncertainty be removed/reduced
THIS CANT BE REDUCED but can be managed by building models that are robust to noise.
What is reducible uncertainty
this is uncertainty that arises from incomplete domain coverage in the data i.e. refers to the uncertainty in the model due to lack of data. Alternatively we could be data rich but information poor (i.e. high quantity of low quality data). This is reducible (e.g. through collecting more/better data or improving model training through cross validation/regularisation) although cannot be removed entirely.
What is prediction uncertainty
this encompasses both reducible and irreducible uncertainty. It represents the total uncertainty in the models predictions
What is the difference between uncertainty in data collection and analysis
- Data collection – accuracy, reliability and representativeness of the raw data
- Data Analysis – focuses on the model’s ability to correctly interpret and predict based on the data (e.g. is the choice of model correct for the data, has the model overfit to the training data and therefore wont generalise to new data)
What is data exposure
This is when sensitive information is accessible to unintended or unauthorised parties. It indicates that there are missing proper security controls or processes e.g. lack of encryption mechanisms. May include PII.
What is data linking
Combining data from different sources or datasets to create a more comprehensive and enriched dataset. Involves identifying and merging records that refer to the same entity (e.g. the same person or product). Can use exact matches or fuzzy matching (e.g. potentially if names aren’t exactly the same format (first last vs first middle last vs first initial and last name ))
What are 3 different types of data storage
Relational database, data lake, data lakehouse
What are 4 different types of data storage LOCATIONS?
Local, cloud, remote, temporary