Course1-M2 Flashcards
The different types of Data Repositories include:
Databases:?
Data Warehouses:?
Data Marts:?
Data Lakes:?
Big Data Stores:?
- which can be relational or non-relational, each following a set of organizational principles, the types of data they can store, and the tools that can be used to query, organize, and retrieve data.
- that consolidate incoming data into one comprehensive store house.
- that are essentially sub-sections of a data warehouse, built to isolate data for a particular business function or use case.
- that serve as storage repositories for large amounts of structured, semi-structured, and unstructured data in their native format.
- that provide distributed computational and storage infrastructure to store, scale, and process very large data sets.
The ELT, or Extract Load and Transfer, Process is a variation of the ETL Process. In this process, extracted data is loaded into the target system before the transformations are applied. This process is ideal for ____ and working with Big Data.
Data Lakes
What is data pipeline?
Data Pipeline, sometimes used interchangeably with ETL and ELT, encompasses the entire journey of moving data from its source to a destination data lake or application, using the ETL or ELT process.
What is IBM Db2?
IBM Db2 is a popular relational database from IBM that is used for both OLTP and Data Warehousing workloads.
OLTP or Online Transaction Processing is a type of data processing that consists of executing a number of transactions occurring concurrently—online banking, shopping, order entry, or sending text messages, for example.
Hadoop is a collection of tools that provides ____ of big data. Hive is a ____ built on top of Hadoop. Spark is a distributed data analytics framework designed to perform ____.
distributed storage and processing
data warehouse for data query and analysis
complex data analytics in real-time
- Hadoop provides distributed storage and processing of large datasets across clusters of computers. One of its main components, the Hadoop File Distribution System, or HDFS, is a storage system for big data.
- Hive is a data warehouse software for reading, writing, and managing large datasets.
- Spark is a general-purpose data processing engine designed to extract and process large volumes of data.
Is HIVE suitable for OLTP applications? why?
Hadoop is intended for long sequential scans and, because Hive is based on Hadoop, queries have very high latency—which means Hive is less appropriate for applications that need very fast response times. Also, Hive is read-based, and therefore not suitable for transaction processing that typically involves a high percentage of write operations. Hive is better suited for data warehousing tasks such as ETL, reporting, and data analysis and includes tools that enable easy access to data via SQL.