Basics Flashcards

1
Q

What is data engineering?

A
  1. Gather data from different sources
  2. Optimize database
  3. Clean / Transform data

ETL Process +

Maintains large scale data processing systems for preparing structured and unstructured data for analytical modeling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

The task of the data engineer

4 + bonus

A
  • develop scalable data architecture
  • streamline data acquisition
  • set up processes to bring data together
  • clean corrupt data

also well versed in cloud tech

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Data Integration

A

Ingest, transforms, integrates, and delivers structured/unstructured data to a data warehouse platform.

Combines data from disparate sources.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Data Engineer Tools

and examples

A
  1. Databases (MySQL, postgresql, etc.)
  2. Parallel Processing Tools (Spark, Hive, etc.)
  3. Scheduling Tools (Airflow, Cron - Linux batch scheduler, etc.)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Types of NoSQL Databases

5 types, definitions, and use cases

A

Key-Value: simpler type of database where each data item comes in a key and value pairing, the key uniquely identifies the record. Also known as a dictionary (object-oriented programming) or hash table and used for simple queries or caching. Use cases such as gaming, ad tech, and IoT. Example: Dynamo DB.

Document: data is stored as an object or JSON-like document containing pairs of fields and values. Utilizes the same object storage and document model format that is used in application code (easier for developers), making it highly flexible. Use cases catalogs, user profiles, and content management systems. Example: Mongo DB.

Wide-Column Store: two-dimensional key-value storage, where the names and formats of columns can vary from row to row. Utilized in IOT, user profile data, geographic data, reporting systems, time-series data. Joins not supported, each query is backed by a table, a lot of data duplication.

Graph: stores data in nodes and edges. Utilized in use cases where you need to traverse relationships and look for patterns, social networking, recommendation engines, fraud detection, knowledge graphs. Examples: Neo4j and Giraph.

In-Memory: primarily relies on main memory storage making it faster than other types of databases. Utilized where application response time is critical, gaming, telecom, real-time analytics. Examples: Amazon ElastiCache

Search: built for providing near-real-time visualizations and analytics of machine-generated data. Full-text search use cases. Examples: Expedia use Amazon Elasticsearch Service (Amazon ES).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly