Data and Data Governance Flashcards
Welcome to Section 2: Data and Data Governance.
Data is fundamental to any AI model. When dealing with this data, it is vitally important that legal and ethical considerations are considered. As highlighted in the RCR report, “Overcoming Barriers to AI Implementation in Imaging(opens in a new tab)” (June 2023), data governance is one of the main barriers facing AI vendors and those working in the AI field.1 It is therefore imperative that you learn how to deal with data and ensure data governance principles are followed.
This section takes you through the key questions and answers related to current data governance procedures: preparing data, techniques to utilise data, as well as sources of data.
Key resources and helpful papers are referenced throughout, which provide further det
https://www.rcr.ac.uk/our-services/artificial-intelligence-ai/overcoming-barriers-to-ai-implementation-in-imaging/
Key Questions to Ask About your Data
Before starting any project, it is important to ask questions of your data: who it pertains to, why you need it, where it is coming from, and so on.
Why?
Why is each part of the data collected needed (e.g. personal identifiers, types of cancer, machine vendor) and is each part necessary?
Who?
Who will have access to the data in an identifiable and de-identified form? For example, medical physicists, PhD students, companies.
Who are the patients that will be included in this dataset (inclusion / exclusion criteria) and how many people will be included?
Where?
Where will the data be stored at each step? For example, on an NHS site / University site / Trusted Research Environment (TRE) / Research PACS / Cloud based storage. Is the data being processed as part of routine care? For example, within NHS firewall.
Is data going outside the UK or outside the EU? For example, is data being stored in the cloud and where is the cloud based? Will the data be shared with third parties? For example, commercial companies / academics.
Where is the backup of the data located? Is there a backup?
What?
What is the data going to be used for? For example, training or testing or both.
What is the task the AI algorithm is carrying out?
What level of anonymisation / de-identification will be used? For example, anonymisation, pseudonymisation or synthetic data (which parts of the data need de-identifying, e.g. dates, addresses, DOB).
What form of consent for data will be used? For example, opt in or opt out.
What data is being accessed and on what systems? For example, picture archiving and communication system (PACS), electronic health record (EHR).
What route will be used to move the data if it needs to be moved off site?
What format does the data need to be in? For example, comma separated value (CSV), Digital image and communications in medicine (DICOM).
When?
When is the time frame the data is from? For example, 01/01/2010-31/12/20.
When will it be stored until? For example, 5 years.
How to Answer Questions About your Data
Now that you know what kind of questions you need to be asking of your data, you will now learn about how to answer those questions.
The following three areas will help you to answer key questions about the data you need access to:
1 Type and amount of data
2 Data input structure for AI algorithms
3 Outcome measures from data
Type and Amount of Data
Lesson 11 of 32
The diagram shown below can be used as an aid in deciding the type and amount of data that will be used in AI algorithm development and testing. In this lesson, you will get to grips with the five core components to consider when it comes to the type and amount of data you will need: use of data, generalisability, sources, type, and size.
You may also wish to refer to the paper, Preparing Medical Imaging Data for Machine Learning(opens in a new tab) by Willemink et al, which details key steps for data preparation. Please note that the paper is written with the USA system in mind, and variations for different countries must be considered.
https://pubs.rsna.org/doi/10.1148/radiol.2020192224?url_ver=Z39.88-2003%F0%9D%94%AFid=ori:rid:crossref.org%F0%9D%94%AFdat=cr_pub%20%200pubmed
- Use of data
How the data is going to be used is an important step to establish, as this will drive the amount of data that is going to be required as well as what information needs to be collected.
For example: will the data be used as part of the AI algorithm development process only, or as part of the testing evaluation pathway only, or both? What is the task the AI algorithm is carrying out?
This relates to the train, validate, and test steps of building an AI algorithm because the data collected, and the data governance principles applied to that data, will directly impact how the AI algorithm is built.
- Generalisability
Generalisability describes how well an algorithm can perform on different populations and machine vendors. When an AI algorithm has good levels of generalisability, it means it can adapt to different environments and is reliable in its performance.
It is therefore important that training and test sets are representative of the population in terms of multiple factors (for example, disease distribution, ethnicity, age range), as well as the different machine vendors (AI suppliers) and parameters used. This makes it more likely that an AI algorithm will have good levels of generalisability when it is tested on real data and when it is applied to new datasets.
Thus, it is important that there are multi-centre and multi-national collaborations as this will help with the diversity of the data. Likewise, as treatment protocols and radiological imaging technologies advance, changes over time necessitate monitoring. Monitoring is required because the algorithm has been developed and validated on historic data, which may not be the same as current data. Prospective trials may also provide an opportunity for generalisability testing. This is because the trials will often involve diverse populations meaning AI algorithms can be tested based on how well they adapt to such diversity in the data.
- Sources
Local datasets are those from within an NHS trust, which must go through the appropriate local and national data governance procedures.
The data must also go through data collection and curation steps if being used off-site.
Open datasets include those from other sites (e.g. from a previous trial), that are available to access and be shared. Some open datasets also require certain approvals to be in place.
Before moving onto to type of data, you will first have the opportunity to take a deep dive on sources of data by looking at open-source datasets.
Download an open-source datasets resource below to learn more.
- Type
Imaging data is stored in DICOM format within Picture Archiving and Communication Systems (PACS).
DICOM data contains both the image and DICOM meta-data (information held about the image). Therefore, it is important to make sure both the image and the DICOM meta-data is de-identified, if required.
Imaging data can also be annotated to add further detail about lesion location and size. Most extracts from Electronic Health Records (EHR) are in CSV format, which is a type of data table format like an Excel sheet. Most imaging reports are stored in a free text format (like a Word document containing text). As well as annotating data, you can also outline types of data; for instance, whether it is anonymised, pseudonymised or synthetic.
DICOM
As previously mentioned, DICOM images are typically stored in Picture Archiving and Communication Systems (PACS). This is a specific type of image format for medical imaging. However, there are multiple other types of image format, such as Neuroimaging Informatics Technology Initiative (NIfTI), Portable Network Graphics (PNG), Tag Image File Format (TIFF), and Joint Photographic Expert Group (JPEG), which might be used by AI algorithms. It is important to check the image format required for each AI algorithm.
DICOM meta-data is also stored within the DICOM file which contains the image. This meta-data includes patient identifiers and machine characteristics (called the headers and tags). Therefore, care should be taken to remove unnecessary and/or identifiable information and retain all needed machine characteristic information (e.g. important details about the AI algorithm used such as the model architecture or its processing power) when applying any de-identification protocols.
It is also important to note that different machine vendors will apply different private tags. Different open-source tools are available for handling DICOM meta-data (such as MATLAB, DICOM CleanerTM, and Python DicomAnonymizer packages). However, it is best to check with your local PACS and medical physics teams first to see if there is an in-house tool or if the open-source tools meet the required specification.1
Annotations
Numerous tools have been developed that can be used for DICOM image annotation and the creation of region of interests (ROIs). Examples of ROIs include ITK-Snap and 3D Slicer. These annotations can be saved in multiple formats (e.g. DICOM-SC, TIFF, JPEG).
It is important to note who is doing the annotations (e.g. level of experience) and the number of people who have annotated to check for inter-reader variability.
These annotations allow for a model’s accuracy to be tested for tasks such as automated segmentation and locating the site of cancer on an image.2
- Size
Medical images can vary in size. Thus, the size of the data being used is an important factor when considering where the data will be stored and if it needs to be moved to a different location to be used. If necessary, files can be compressed to aid in storage size.
It is also important to check that transfer mechanisms (for example, a network’s bandwidth when transferring large amounts of data from one location to another) can handle the size of data and how long this process will take.
Furthermore, it is also important to check that the correct storage drive with a backup of data is available, as well as what the cost of this will be.
The pyramid to the left shows different file sizes, from bit (the smallest file size) at the top all the way down to yottabyte (the largest file size).
5.1 Creative data solutions – what to do when you don’t have enough data
Different techniques can be used when there is not enough data or when data is not diverse enough.
So, what do you do when you haven’t got enough data?
There are some different techniques you can employ when there’s not enough data or when data is not diverse enough.
One way is through the use of Transfer learning.
Transfer learning is the application of knowledge gained from completing one task to help solve a different, but related, problem.
This helps when there is not enough data because it can use the knowledge gained from previous datasets
and apply this knowledge to smaller datasets.
You can see Transfer learning applied here, where an algorithm originally trained on an ImageNet dataset has been applied to a medical dataset.
You may also use Synthetic data to overcome data issues.
This is where AI algorithms are used to generate artificial data from examples of real data.
There are a few different ways of producing Synthetic data: for example, Synthetic Minority Oversampling Techniques, Generative Adversarial Networks, and Diffusion models.
Synthetic Minority Oversampling Technique is a method used to balance class distribution in a dataset by generating synthetic examples for the minority class.
Generative Adversarial Networks are a type of machine learning model where two neural networks, a generator and a discriminator, compete against each other to create realistic-looking data from random input.
Lastly, Federated learning is a machine learning approach, where instead of collecting data in one central research institute, the model is distributed to multiple centres with data (like hospitals).
At each centre, the model is trained and returned to the original research institute, resulting in one model per centre.
These models are then aggregated into one global model combining the learned information from all centres.
This is important because it removes the need to transfer sensitive data outside of hospital environments.
File structure
File structure relates to how data is stored in folders.
It is important to know how data is expected to be received by the AI algorithm when designing the folder structure.
It is also important to use unique identifiers that allow data to be broken down by case / exam / image. The image below shows an example of a chest x-ray dataset file structure.1
Data structure
Snake case and Camel case are examples of naming conventions used in databases.1
Gaps between words often cause problems in computer languages as they break the path the computer is trying to read, as well as make it difficult to read.
There are different ways of rectifying this issue. For example, in the snake case example, an underscore (_) has been added between words. In the camel case example, the second word has been capitalised.
File naming conventions
Use clear names to identify what the field shows as well as ensure there is consistency between data for the same column name.
Save any data with easy-to-understand file names, e.g. TEST_RUN_1_DD/MM/YY.csv
Data type naming conventions
Data cleaning (where data is ‘tidied up’ to ensure consistency) is the most time intensive element of data studies.
It is advised that you use consistent naming conventions for both project and classification names. In the example shown here, the following key has been used: B = benign, C = Cancer, N = Normal. See the difference in Table 1 (red) and Table 2 (green).
Table 2 is a lot easier to work with and will ensure cases are classified correctly.