Core data concepts (15% - 20%) Flashcards
What is data?
A collection of facts such as numbers, descriptions and observations used in decision making. Values used to record information.
What are the three types of data?
Structured (tabular, csv), semi-structured (JSON) and unstructured.
What is structured data?
Structured data is typically tabular data, represented by columns and rows in a database.
What are relational databases?
Databases that hold tables.
What is stream processing?
Processing data as it arrives.
What is batch processing?
The processing of groups (or batches) of data.
What are the differences between batch and stream processing?
Batch processing can process all the data in a dataset. Stream processing only processes the newest piece of data, or data within a rolling window (30 seconds).
Batch processing is more suitable for processing large datasets. Stream processing is for micro batches containing only a few records.
Batch processing can have a take a few hours to complete, so its latency is pretty high. Stream processing is relatively instantaneous with latency in the milliseconds or seconds.
Batch processing is typically used for complex analytic workloads. Stream processing is used for simple response functions, aggregates and calculations like rolling averages.
How are batch processes run? Give an example.
A batch process is usually triggered by an event such as a certain amount of data collected, a scheduled time (e.g. 3pm every Tuesday) or some other trigger.
What is a transactional system?
A system that records transactions. (E.g. an online store has a transactional system that records orders and payments.)
What is a transaction?
A small, discrete, unit of work.
What is data ingestion?
The processing of capturing raw data for immediate use or storage in a database.
Depending on the source, the data may arrive as a a stream or in batches.
What is data processing?
The cleaning and conversion of raw data into a more useful format (tables, graphs, documents, and so on).
What is data transformation?
The process of transforming data from one format to another.
What are seven of the common tools Database Administrators use?
SSMS (SQL Server Management Studio) pgAdmin for PostgreSQL systems MySQL Workbench for MySQL databases. Azure Data Studio for a number of database engines. The Azure portal The Azure CLI sqlcmd
What are some of the common tools a Data Engineer uses?
SQL
Azure Synapse Studio
Azure CLI
What are some of the tools Data Analysts use?
Power BI, which includes: Power BI Desktop, Power BI Services, Power BI Embedded and Power BI Report.
What does a Database Administrator do?
A database administrator (DBA) manages, maintains and optimises databases, manages security, grants users and apps access, backups the database and monitors database performance.
What is an entity?
An entity is a thing about which information needs to be known or held.
What are the main characteristics of a relational database?
1) All data is tabular
2) All rows in the same table have the same set of columns
3) A table can contain any number of rows.
4) A primary key uniquely identifies each row in a table. No two rows can share the same PK.
5) A foreign key references rows in another, related table. For every value in the FK column, there should be a row with the same value is the corresponding PK column in the other table.
What is normalisation?
Normalisation is the process of organising data in a database.
This includes creating tables and establishing relationships between those tables according to rules designed to protect the data and making the database more flexible by eliminating redundancy and inconsistent dependency.