Data Repositories and File Formats File Formats Flashcards
Data Repositories
- Databases.
- Relational Databases.
- Non-Relational Databases (NoSQL).
- Data Warehouses.
- Data Marts.
- Data Lakes.
- Big Data Stores.
- Cloud-based Relational Databases.
File Formats
- Delimited Text file format, or CSV.
- Microsoft Excel Open, XLSX.
- Extensible Markup Language, or XML.
- Portable Document Format, or PDF.
- JavaScript Object Notation, or JSON.
Databases
It is an organized collection of data that it is use for storing and managing data for specific purposes (transaction, queries, or reports), it is controlled by a database management system.
Data Warehouse
It is a central repository that merges information coming from disparate/different sources. After gathering, it consolidates the data through the extract, transform, and load process known as “ETL process” into one comprehensive database for analytics and business intelligence.
Data Marts
It is a sub-section of the data warehouse. There can be multiple Data Marts in one Data Warehouse, each one having specific data that users can interact/use depending on their business function, purpose, or community.
Data Lakes
Is a storage repository that can store large amounts of structured, semi-structured, and unstructured data in their native format (Raw).
Big Data Stores
a storage system designed to efficiently store, retrieve, and analyze massive amounts of data in different structures that are not stored in traditional relational databases.
ETL process
Extract, Transform, and Load.
- Helps to extract data from different data sources.
- Transform the data into a clean and usable state.
- Load the data into the data repository (In this case Data Warehouse).
Data Pipeline
Is a term that encompasses the moving of data from one system to another, including the ETL process. Data Pipeline doesn’t transform data or it may transform it but after loading.
Delimited Text (CSV)
Is a file format that stores data in rows and columns, with each column separated by a delimiter characters.
Delimiter Character
It can be a Comma “,”, Tab “”, Colon “:”, Vertical Bar “|”, and Space. In CSV. Formats the delimiter is a Comma “,” and in TSV. Formats the delimiter is a Tab “”.
Extensible Markup Language (XML)
Is a markup language and file format for storing, transmitting, and reconstructing arbitrary(random) data.
JavaScrip Obj Notion (JSON)
Is a text-based format for storing and exchanging data that it can be both human-readable and machine-parsable.
Relational Databases
It is a database that stores and provides access to data organized into a table structure (rows and columns), where the tables can be linked, or related, to other data tables that had common information.
Non-Relational Databases
It is a database design that provides flexible schemas for the storage and retrieval of structured, semi-structured, and unstructured data.