1 AWS ML Eng Assoc Data Eng and Storage Flashcards
AWS S3
Amazon S3 is an object storage service that offers industry-leading scalability; durability; availability; security; and performance. Customers of all sizes and industries can use it to store and protect any amount of data for a range of use cases; such as data lakes; websites; mobile applications; backup and restore; archive; enterprise applications; IoT devices; and big data analytics.
S3 Buckets
S3 stores data as objects within buckets. You can store any type of object in a bucket; such as text files; photo files; video files; etc. Buckets are created in a specific AWS Region and can be accessed from anywhere.
S3 Object Key
The name of the object is known as the key. The combination of bucket name and key uniquely identifies the object.
EBS Volumes
EBS (Elastic Block Store) volumes are network drives that can be attached to and detached from EC2 instances. They allow persistent data storage; even after the instance is terminated.
EBS Volumes Use Case
EBS volumes allow you to persist data; even after the instance is terminated. This is helpful when you need to recreate an instance and mount the same EBS volume from before to get your data back.
EFS (Elastic File System)
EFS is a managed NFS (Network File System) that can be mounted on many EC2 instances across multiple Availability Zones (AZs). It’s highly available; scalable; and expensive (about 3x the cost of GP2 EBS volumes). Use cases include content management; web serving; data sharing; and WordPress.
Data Ingestion and Storage
All machine learning starts with potentially large amounts of data that need to be stored in a central repository in a scalable; secure manner. This section covers types of data; properties of data; storage strategies like data warehouses; data lakes; and data lakehouses; as well as pipelines for extracting; transforming; and loading data.
3 Types of Data
Structured (organized in a defined schema; found in relational databases); Unstructured (data without a predefined schema; like raw text files; videos; audios); Semi-structured (has some structure like tags or hierarchies; but needs work to extract; like XML; JSON; log files).
3 V’s of Data Properties
Volume (amount/size of data); Velocity (speed at which data is generated/processed); Variety (different types/sources of data).
Data Warehouse
A centralized repository optimized for analysis; where data from different sources is stored in a structured format. Designed for complex queries and analytics. Data is cleaned; transformed and loaded using an ETL process. Typically uses a star or snowflake schema.
Data Lake
Stores a vast amount of raw data in its native format (structured; semi-structured; unstructured). Data is loaded as-is without predefined schema. Supports batch; real-time and stream processing for data transformation and exploration.
Data Lakehouse
A hybrid architecture combining features of data lakes and data warehouses. Supports structured and unstructured data; schema-on-write and schema-on-read; detailed analytics and machine learning. Built on cloud/distributed architectures.
Data Mesh
An organizational paradigm for decentralized data management. Individual teams own their data and offer it as data products to others in the organization. Promotes domain ownership; federated governance; and self-service data infrastructure.
ETL (Extract; Transform; Load)
A process in data integration that extracts data from sources; transforms it into a desired format; and loads it into a target data repository. Used for data warehouses; Contrast with ELT for data lakes.
ELT (Extract; Load; Transform)
A process that extracts data from sources; loads it into a data repository in its raw format; and transforms it later as needed. Used for data lakes.