Databases Flashcards

1
Q

what is DBMS

A

DBMS stands for Database Management System. It is software that is designed to define, create, manage, and control databases. The primary purpose of a DBMS is to provide an efficient and effective way to store, retrieve, and manipulate data. DBMS software serves as an interface between the database and its users or application programs, ensuring that data is consistently organized and remains easily accessible.

Key functions and features of DBMS software include:

Data Storage Management: Efficiently stores data in a structured format and manages the physical and logical data storage.

Data Retrieval: Provides mechanisms for querying and retrieving data using languages like SQL (Structured Query Language).

Data Manipulation: Allows for the insertion, update, and deletion of data in the database.

Transaction Management: Ensures that all database transactions are processed reliably and adhere to ACID properties (Atomicity, Consistency, Isolation, Durability).

Concurrency Control: Manages concurrent access to the database by multiple users, ensuring data integrity and consistency.

Security Management: Implements security measures to control access to data and protect against unauthorized access.

Backup and Recovery: Provides tools and processes for backing up data and recovering it in case of data loss or corruption.

Data Integrity: Ensures the accuracy and consistency of data through constraints and rules.

Data Administration: Provides tools for database administrators to manage the database, monitor performance, and optimize system resources.

Data Abstraction: Separates the physical storage of data from the logical representation, allowing for easier data management and manipulation.

Examples of popular DBMS software include:

Oracle Database: A widely used relational DBMS known for its robustness and scalability.
MySQL: An open-source relational DBMS commonly used in web applications.
Microsoft SQL Server: A relational DBMS developed by Microsoft, used in enterprise environments.
PostgreSQL: An open-source relational DBMS known for its advanced features and standards compliance.
MongoDB: A NoSQ

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what is FCA?

A

FCA typically stands for “Full Copy Access.” It refers to a backup or replication technique where full copies of databases or data sets are made available for various purposes, such as disaster recovery, testing, or analytics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Apache Iceberg

A

Apache Iceberg is an open-source table format for large-scale analytic datasets. It is designed to work with distributed data processing engines like Apache Spark, Trino (formerly PrestoSQL), Apache Flink, and more. Iceberg aims to improve the performance, scalability, and reliability of handling large datasets in a data lake environment.

Key Features of Apache Iceberg:
Schema Evolution:

Supports schema changes like adding, dropping, renaming columns without the need to rewrite the entire dataset.
Handles schema evolution in a way that is backward and forward compatible.
Partitioning:

Enables efficient querying by supporting hidden partitioning, which allows query engines to prune unnecessary data partitions automatically.
Users can define partitions at a logical level, and Iceberg manages the physical layout.
Snapshot Isolation:

Provides a consistent view of the data by using snapshot isolation, which helps in managing concurrent read and write operations.
Supports time travel queries, allowing users to query data as it existed at a specific point in time.
Metadata Management:

Stores metadata in a structured format, making it easier to manage and query large datasets.
Metadata is kept separate from data files, enabling efficient operations and reducing the overhead of maintaining data files.
Data Versioning:

Allows for safe, atomic commits and rollbacks, ensuring data integrity and consistency.
Supports branching and tagging of data, enabling experimentation and versioning without affecting the main dataset.
Performance:

Optimizes query performance by enabling efficient data skipping and pruning.
Supports vectorized reads and writes, improving the performance of data processing operations.
Compatibility:

Integrates with various data processing engines like Apache Spark, Apache Flink, Trino, and Hive, providing flexibility in choosing the right tools for different workloads.
Compatible with existing data formats like Parquet, Avro, and ORC.
Use Cases:
Data Lakes:

Ideal for building and managing large-scale data lakes, ensuring efficient storage, retrieval, and management of big data.
Analytics and BI:

Supports complex analytical queries and business intelligence use cases, providing fast and reliable access to large datasets.
Data Warehousing:

Can be used in conjunction with data warehousing solutions to handle large volumes of data and complex transformations.
Data Science and Machine Learning:

Enables data scientists and machine learning engineers to work with consistent and versioned datasets, facilitating reproducible experiments and models.
Example Scenario:
Suppose you have a large dataset stored in a data lake, and you need to perform analytics and machine learning on this data. With Apache Iceberg, you can:

Efficiently Query Data: Use partition pruning and metadata management to speed up queries.
Handle Schema Changes: Add new features to your dataset without rewriting the entire dataset.
Time Travel: Query data as it was at a specific point in time, useful for debugging and audits.
Concurrent Operations: Perform read and write operations concurrently without worrying about data consistency issues.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

ISM

A

ISM (Integrated Storage Management
SM (Integrated Storage Management):
Purpose:
ISM is an umbrella term referring to various IBM mainframe tools and technologies for managing storage resources. It encompasses a range of functions, including data backup, recovery, hierarchical storage management, and more.
Components:
DFSMS (Data Facility Storage Management Subsystem): A key component of ISM, DFSMS automates storage management tasks such as data classification, backup, and space management.
DFSMShsm (Hierarchical Storage Manager): Manages data migration between different storage tiers based on usage patterns and policies, helping optimize storage utilization and cost.
DFSMSrmm (Removable Media Manager): Manages removable media such as tapes, tracking their usage, location, and lifecycle.
Key Features:
Automation of storage management tasks, reducing manual intervention and administrative overhead.
Improved data availability, protection, and recovery.
Efficient utilization of storage resources through hierarchical storage management and data migration.
Use Cases:
Managing large-scale storage environments in mainframe systems, ensuring data integrity, availability, and efficient use of storage resources.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

VSAM

A

VSAM (Virtual Storage Access Method):
Purpose:
VSAM is a high-performance data management system used on IBM mainframes. It provides access methods for organizing and accessing data on direct access storage devices (DASD).
Data Organization:
VSAM organizes data into datasets, which are collections of records. The primary types of VSAM datasets are:
KSDS (Key-Sequenced Data Set): Records are stored and accessed using a key field, allowing for efficient indexing and retrieval.
ESDS (Entry-Sequenced Data Set): Records are stored in the order they are written, similar to a sequential file.
RRDS (Relative Record Data Set): Records are accessed based on their relative position within the dataset.
LDS (Linear Data Set): A byte-stream dataset with no record boundaries, often used for databases.
Key Features:
High performance for data access and management.
Support for complex data structures and large volumes of data.
Integration with various IBM mainframe applications and systems.
Use Cases:
Commonly used for transactional data processing, database management, and applications requiring fast and efficient data access.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

VSAM vs ISM

A

Summary of Differences:
VSAM:

Focuses on high-performance data access and organization.
Provides specific access methods for different types of datasets (KSDS, ESDS, RRDS, LDS).
Integral to transactional data processing and database applications.
ISM:

Encompasses a suite of tools and technologies for comprehensive storage management.
Includes automation of storage tasks, hierarchical storage management, and data protection.
Ensures efficient storage utilization, data availability, and recovery in mainframe environments.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Cloudera, Snowflake, Terradata, Oracle

A

Cloudera:

Overview: Cloudera is a software company that provides a cloud-based platform for data engineering, data warehousing, machine learning, and analytics. It is known for its enterprise data cloud.
Key Features: Supports multi-function analytics, secure data access, and management across hybrid and multi-cloud environments.
Use Cases: Data engineering, data warehousing, operational database management, and machine learning.
Snowflake:

Overview: Snowflake is a cloud-based data warehousing company. It offers a unique architecture to handle both structured and semi-structured data, providing a single platform for data warehousing, data lakes, and data sharing.
Key Features: Fully managed SaaS, multi-cluster shared data architecture, automatic scaling, and secure data sharing.
Use Cases: Data warehousing, data lakes, data sharing, and data engineering.
Teradata:

Overview: Teradata provides database and analytics-related products and services. It focuses on enterprise-scale data warehousing and analytics solutions.
Key Features: High-performance data warehousing, advanced analytics, and scalable data management.
Use Cases: Enterprise data warehousing, big data analytics, business intelligence.
Databricks:

Overview: Databricks is a unified data analytics platform that provides data engineering, collaborative data science, and machine learning. It is built on Apache Spark.
Key Features: Unified analytics engine, collaborative notebooks, machine learning lifecycle management, and scalable processing.
Use Cases: Big data processing, machine learning, data engineering, and collaborative data science.
Netezza:

Overview: Netezza, part of IBM, offers a data warehousing and analytics appliance. It is known for its simplicity, speed, and scalability in handling large volumes of data.
Key Features: High-speed processing, in-database analytics, and data compression.
Use Cases: Enterprise data warehousing, big data analytics, and data integration.
Oracle:

Overview: Oracle offers a comprehensive suite of cloud applications and platform services, including Oracle Autonomous Database, which leverages machine learning to provide self-driving, self-securing, and self-repairing capabilities.
Key Features: Autonomous database, robust security, high availability, and integrated cloud applications.
Use Cases: Data warehousing, transaction processing, enterprise resource planning (ERP), and customer relationship management (CRM).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Apache Hive

A

Apache Hive is a data warehouse infrastructure built on top of Hadoop, designed for managing and querying large datasets stored in Hadoop’s distributed storage using a SQL-like language called HiveQL. It enables users to perform data analysis on big data without requiring detailed knowledge of Hadoop’s underlying mechanisms.

Key Features of Apache Hive
HiveQL (Hive Query Language):
SQL-like Language: HiveQL is similar to SQL, making it easy for users with a SQL background to write queries.
Extensibility: Supports user-defined functions (UDFs) to handle complex data transformations.
Data Storage:
HDFS Integration: Integrates with Hadoop Distributed File System (HDFS) for distributed storage.
Support for Various Formats: Compatible with various data formats, including text files, Parquet, ORC, Avro, and more.
Schema Management:
Metastore: Maintains metadata about the databases, tables, partitions, and columns. The metastore is typically backed by a relational database.
Query Optimization:
Cost-based Optimizer: Uses statistics to optimize query execution plans.
Indexing and Partitioning: Supports indexing, partitioning, and bucketing to enhance query performance.
Integration with Other Tools:
Hadoop Ecosystem: Seamlessly integrates with other Hadoop ecosystem components like Pig, HBase, and MapReduce.
Data Processing Engines: Can use different processing engines like Apache Tez, Apache Spark, and MapReduce for executing queries.
Use Cases for Apache Hive
Data Warehousing:
ETL Processes: Commonly used for Extract, Transform, Load (ETL) processes to transform raw data into a structured format for analysis.
Reporting and Analytics: Provides a SQL-like interface for querying large datasets, suitable for generating reports and performing data analysis.
Big Data Processing:
Handling Large Datasets: Designed to efficiently manage and query petabytes of data stored in Hadoop.
Batch Processing: Well-suited for batch processing and aggregating large volumes of data.
Business Intelligence:
Data Integration: Integrates data from various sources into a unified schema for comprehensive analysis.
Ad-hoc Queries: Allows users to run ad-hoc queries on large datasets, enabling quick insights and decision-making.
Advantages and Limitations
Advantages:

Scalability: Scales horizontally to handle large datasets.
Ease of Use: SQL-like language makes it accessible to users familiar with SQL.
Flexibility: Supports a wide range of data formats and storage options.
Limitations:

Latency: Typically used for batch processing, so not suitable for low-latency, real-time queries.
Complexity: Performance tuning and optimization can be complex, especially for very large datasets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly