Basics Flashcards

Question 1

Q

What is data engineering?

Answer

A

Gather data from different sources
Optimize database
Clean / Transform data

ETL Process +

Maintains large scale data processing systems for preparing structured and unstructured data for analytical modeling

Question 2

Q

The task of the data engineer

4 + bonus

Answer

A

develop scalable data architecture
streamline data acquisition
set up processes to bring data together
clean corrupt data

also well versed in cloud tech

Question 3

Q

Data Integration

Answer

A

Ingest, transforms, integrates, and delivers structured/unstructured data to a data warehouse platform.

Combines data from disparate sources.

Question 4

Q

Data Engineer Tools

and examples

Answer

A

Databases (MySQL, postgresql, etc.)
Parallel Processing Tools (Spark, Hive, etc.)
Scheduling Tools (Airflow, Cron - Linux batch scheduler, etc.)

Question 5

Q

Types of NoSQL Databases

5 types, definitions, and use cases

Answer

A

Key-Value: simpler type of database where each data item comes in a key and value pairing, the key uniquely identifies the record. Also known as a dictionary (object-oriented programming) or hash table and used for simple queries or caching. Use cases such as gaming, ad tech, and IoT. Example: Dynamo DB.

Document: data is stored as an object or JSON-like document containing pairs of fields and values. Utilizes the same object storage and document model format that is used in application code (easier for developers), making it highly flexible. Use cases catalogs, user profiles, and content management systems. Example: Mongo DB.

Wide-Column Store: two-dimensional key-value storage, where the names and formats of columns can vary from row to row. Utilized in IOT, user profile data, geographic data, reporting systems, time-series data. Joins not supported, each query is backed by a table, a lot of data duplication.

Graph: stores data in nodes and edges. Utilized in use cases where you need to traverse relationships and look for patterns, social networking, recommendation engines, fraud detection, knowledge graphs. Examples: Neo4j and Giraph.

In-Memory: primarily relies on main memory storage making it faster than other types of databases. Utilized where application response time is critical, gaming, telecom, real-time analytics. Examples: Amazon ElastiCache

Search: built for providing near-real-time visualizations and analytics of machine-generated data. Full-text search use cases. Examples: Expedia use Amazon Elasticsearch Service (Amazon ES).

Basics Flashcards

(5 cards)