Module 3: Understanding the AI Development Lifecycle: Design Flashcards
What are the elements of the design phase (data gathering)?
- Implementing a data strategy.
- Determine what data is required.
- Determine how much data is needed.
- Determine how data will be collected and stored.
- Consider whether pre-trained data should be used.
- Decide whether to use internal data only and/or external data.
- Consider the quality of the data.
Name some different formats of data.
- Structured data (data that is labeled and categorized)
- Unstructured data (data that is not labeled or categorized e.g. images, videos, audio)
- Semi-structured (does not have a rigid structure, but has properties that make it easier to process and analyze than unstructured data e.g. email with standard format and free form text)
- Static data (does not change e.g. past sales)
- Streaming data (data that will change e.g. customers visiting a website)
Which part of development takes the most effort?
- Preparing the data (approx. 80% of the cycle)
What are the 5 V’s of data preparation?
1) Volume of the data
2) Velocity of the data
3) Variety of the data
4) Voracity of the data
5) Value of the data
What is the definition of cleansing data?
Removing erroneous or irrelevant data (helps to avoid privacy issues later).
What is involved in labeling data?
Tagging or annotating data.
How do you anonymize data?
Remove identifiers from the data.
What is the concept of data minimization?
If data is not needed in the application, do not use it to train the model.
List some privacy-enhancing technologies (PETs) that can be applied to AI systems.
- Anonymization
- Minimization
- Differential privacy
- Federated learning
During which phase is the AI architecture and model chosen?
The Design Phase
What is Differential Privacy?
Blurs the data by using an algorithm. The data is still usable but non-specific (unable to identify individuals).
What is Federated Learning?
One central model is downloaded to local models and data is trained in each location. This avoids sharing sensitive data. Results of the training are sent back to the central location, where all of the data gets aggregated together.
What are elements of implementing a data strategy?
- Data gathering
- Data wrangling
- Data cleansing
- Data labeling