Course1-M3 Flashcards
Data Platforms, Data Stores, and Security
some of the tools used for data ingestion, supporting both batch and streaming modes?
- Google Cloud DataFlow
- IBM Streams
- IBM Streaming Analytics on Cloud
- Amazon Kinesis
- Apache Kafka
The Storage and Integration layer in a data platform needs to: Store data for processing and long-term use. Transform and merge extracted data, either logically or physically. Make data available for processing in both streaming and batch modes. The storage layer needs to be reliable, scalable, high-performing, and also cost-efficient. some of the popular relational databases are: ?
- IBM DB2
- Microsoft SQL Server
- MySQL
- Oracle Database
- PostgreSQL
are some of the popular relational databases.
Cloud-based relational databases, also referred to as Database-as-a-Service, have gained great popularity over the recent years. Such as:?
In the NoSQL, or non-relational database systems on the cloud, we have: ?
- IBM DB2 on Cloud
- Amazon Relational Database Service (RDS)
- Google Cloud SQL
- SQL Azure.
NoSQL:
- IBM Cloudant
- Redis
- MongoDB
- Cassandra
- Neo4J
Tools for integration include:?
Open-source tools such as ? are also very popular integration tools.
There are a number of vendors offering cloud-based Integration Platform as a Service (or iPaaS). For example:?
Integration tools
* IBM’s Cloud Pak for Data and Cloud Pak for Integration
* Talend’s Data Fabric
* Open Studio
Open source
* Dell Boomi
* SnapLogic
Cloud-based
* Adeptia Integration Suite
* Google Cloud’s Cooperation 534
* IBM’s Application Integration Suite on Cloud
* Informatica’s Integration Cloud
Once the data has been ingested, stored, and integrated, it needs to be processed. Data validations, transformations, and applying business logic to the data are some of the things that need to happen in this layer.
There are a host of tools available for performing these transformations on data, selected based on the data size, structure, and specific capabilities of the tool. Such as ?
- Spreadsheets
- OpenRefine
- Google DataPrep
- Watson Studio Refinery
- Trifacta Wrangler
- Python and R also offer several libraries and packages that are explicitly created for processing data.
Are storage and processing always performed in separate layers?
It’s important to note that storage and processing may not always be performed in separate layers. For example, in relational databases, storage and processing can occur in the same layer, while in Big Data systems, data can be first stored in the Hadoop File Distribution System, or HDFS, and then processed in a data processing engine like Spark. And, the data processing layer can also precede the data storage layer, where transformations are applied before the data is loaded, or stored, in the database.
Note:
The Analysis and User Interface Layer delivers processed data to data consumers. Data consumers can include: Business Intelligence Analysts and business stakeholders who consume this data through interactive visual representations, such as dashboards and analytical reports. Data Scientists and Data Analytics that further process this data for specific use cases. Other applications and services that may need this data as input for further use. The Analysis and UI Layer needs to support: Querying tools and programming languages. For example, SQL for querying relational databases and SQL-like querying tools for non-relational databases, such as CQL for Cassandra, Programming languages such as Python, R, and Java, APIs that can be used to run reports on data for both online and offline processing.
Overlaying the Data Ingestion, Data Storage and Integration, and Data Processing layers is the __?__ layer with the Extract, Transform, and Load tools. This layer is responsible for implementing and maintaining a continuously flowing data pipeline.
Data Pipeline
There are a number of data pipeline solutions available, most popular among them being ____ and ____.
- Apache Airflow
- DataFlow
Some of the primary considerations for designing a data store are: ?
- The type of data you want to store
- Volume of data
- Intended use of data
- Storage considerations
- Privacy, security, and governance needs of your organization
Intended use of data: The number of transactions, frequency of updates, type of operations performed on the data, response time, and backup and recovery requirements all need to be provisioned for in the design of a data store.
Storage considerations: Whether you need to use the data store for recording high-volume transactional data or executing complex queries for analytical purposes, your processing and storage needs will differ.
Non-relational databases, based on the type of data and how you want to query the data, are of four different types: ?
- key-value
- document
- column
- graph-based
Transactional systems, that is systems used for capturing high-volume transactions, need to be designed for ____, ____ and ____ operations.
Analytical systems, on the other hand, need complex queries to be applied to large amounts of historical data aggregated from transactional systems. They need faster ____ to complex queries.
high-speed read, write, and update
response times
Normalization of the database is another important consideration at the design stage. Normalization is ____. Done right, it helps in the optimal use of storage space, makes database maintenance easier, and provides faster access to data. Normalization is important for systems that handle ____ data. But for systems designed for handling analytical queries, normalization can lead to ____ issues.
the process of efficiently organizing data in a database.
transactional
performance
The architecture of a data platform can be seen as a set of layers, or functional components, each one performing a set of specific tasks. These layers include:
- Data Ingestion or Data Collection Layer, responsible for bringing data from source systems into the data platform.
- Data Storage and Integration Layer, responsible for storing and merging extracted data.
- Data Processing Layer, responsible for validating, transforming, and applying business rules to data.
- Analysis and User Interface Layer, responsible for delivering processed data to data consumers.
- Data Pipeline Layer, responsible for implementing and maintaining a continuously flowing data pipeline.