Basics Flashcards
What is data engineering?
- Gather data from different sources
- Optimize database
- Clean / Transform data
ETL Process +
Maintains large scale data processing systems for preparing structured and unstructured data for analytical modeling
The task of the data engineer
4 + bonus
- develop scalable data architecture
- streamline data acquisition
- set up processes to bring data together
- clean corrupt data
also well versed in cloud tech
Data Integration
Ingest, transforms, integrates, and delivers structured/unstructured data to a data warehouse platform.
Combines data from disparate sources.
Data Engineer Tools
and examples
- Databases (MySQL, postgresql, etc.)
- Parallel Processing Tools (Spark, Hive, etc.)
- Scheduling Tools (Airflow, Cron - Linux batch scheduler, etc.)
Types of NoSQL Databases
5 types, definitions, and use cases
Key-Value: simpler type of database where each data item comes in a key and value pairing, the key uniquely identifies the record. Also known as a dictionary (object-oriented programming) or hash table and used for simple queries or caching. Use cases such as gaming, ad tech, and IoT. Example: Dynamo DB.
Document: data is stored as an object or JSON-like document containing pairs of fields and values. Utilizes the same object storage and document model format that is used in application code (easier for developers), making it highly flexible. Use cases catalogs, user profiles, and content management systems. Example: Mongo DB.
Wide-Column Store: two-dimensional key-value storage, where the names and formats of columns can vary from row to row. Utilized in IOT, user profile data, geographic data, reporting systems, time-series data. Joins not supported, each query is backed by a table, a lot of data duplication.
Graph: stores data in nodes and edges. Utilized in use cases where you need to traverse relationships and look for patterns, social networking, recommendation engines, fraud detection, knowledge graphs. Examples: Neo4j and Giraph.
In-Memory: primarily relies on main memory storage making it faster than other types of databases. Utilized where application response time is critical, gaming, telecom, real-time analytics. Examples: Amazon ElastiCache
Search: built for providing near-real-time visualizations and analytics of machine-generated data. Full-text search use cases. Examples: Expedia use Amazon Elasticsearch Service (Amazon ES).