Cloud offerings for Data Warehouses - AWS Redshift Flashcards by Thomas Reddy

What is a Hive table?

A Hive table is a logical structure in Apache Hive that represents data stored in Hadoop Distributed File System (HDFS) or other compatible storage systems.
It organizes data into rows and columns, similar to a traditional relational database table.

How well did you know this?

Not at all

Perfectly

How is the schema defined for a Hive table?

The schema of a Hive table is defined using HiveQL (Hive Query Language), which is similar to SQL.
It includes column names, data types, and optional constraints.

How well did you know this?

Not at all

Perfectly

What storage formats are supported by Hive tables?

Hive supports various storage formats, including text, Parquet, ORC (Optimized Row Columnar), Avro, and others.
Each storage format has its own advantages in terms of performance, compression, and compatibility with different types of data.

How well did you know this?

Not at all

Perfectly

What is partitioning in Hive tables?

Partitioning allows data in Hive tables to be divided into subdirectories based on the values of one or more columns.
It improves query performance by limiting the amount of data scanned during query execution.

How well did you know this?

Not at all

Perfectly

What is the difference between external and managed tables in Hive?

External tables reference data files stored outside of the Hive data warehouse directory, allowing data to be shared across multiple systems.
Managed tables, also known as internal tables, store data in the Hive warehouse directory managed by Hive, providing tighter integration and control over the data.

How well did you know this?

Not at all

Perfectly

What is Hive metadata?

Hive metadata refers to the information about Hive tables, partitions, columns, storage properties, and other metadata objects stored in the Hive Metastore.
It includes schema definitions, data statistics, storage locations, and other metadata attributes.

How well did you know this?

Not at all

Perfectly

What is the Hive Metastore?

The Hive Metastore is a central repository that stores metadata for Hive tables and partitions.
It acts as a catalog or directory for Hive, allowing users to query, access, and manage metadata information.

How well did you know this?

Not at all

Perfectly

Where is Hive metadata stored?

Hive metadata is typically stored in a relational database management system (RDBMS) such as MySQL, PostgreSQL, or Derby.
The metadata storage location is configured in the Hive configuration files.

How well did you know this?

Not at all

Perfectly

How is metadata managed in Hive?

Metadata in Hive is managed through the use of the Hive Metastore service, which provides APIs for creating, updating, and querying metadata objects.
Users can interact with metadata through the Hive CLI, Beeline, or other Hive client applications.

How well did you know this?

Not at all

Perfectly

Why is Hive metadata important?

Hive metadata contains critical information about the structure, location, and properties of Hive tables and partitions.
It enables query optimization, data lineage, schema evolution, and other metadata-driven operations in Hive.

How well did you know this?

Not at all

Perfectly

What is Apache Flink?

Apache Flink is an open-source stream processing framework for distributed, high-throughput, and fault-tolerant data processing.

How well did you know this?

Not at all

Perfectly

What is Hive?

Hive is an open-source data warehouse infrastructure built on top of Hadoop for querying and analyzing large datasets stored in Hadoop Distributed File System (HDFS) or other compatible storage systems.

How well did you know this?

Not at all

Perfectly

What processing models do Apache Flink and Hive support?

Apache Flink supports both batch and stream processing with support for event time processing, windowing, and stateful computations.
Hive primarily focuses on batch processing using SQL-like queries executed via the HiveQL query language.

How well did you know this?

Not at all

Perfectly

Which framework is more suitable for real-time processing: Apache Flink or Hive?

Apache Flink is more suitable for real-time processing scenarios due to its native support for event time processing and stream processing capabilities.
Hive, while capable of processing large datasets, is primarily designed for batch processing and may not be as efficient for real-time use cases.

How well did you know this?

Not at all

Perfectly

Can Apache Flink integrate with Hive?

Yes, Apache Flink can integrate with Hive through connectors such as the HiveCatalog, allowing Flink to query and process data stored in Hive tables.
This integration enables Flink to leverage existing Hive metadata and data stored in Hive for processing.

How well did you know this?

Not at all

Perfectly

What is the Sequence File format?

The Sequence File format is a binary file format used for storing key-value pairs in Hadoop Distributed File System (HDFS) or other distributed file systems.

What is the structure of a Sequence File?

A Sequence File consists of a header followed by a series of records, where each record contains a key-value pair.
Records are grouped into blocks for efficient storage and retrieval.

Does Sequence File support compression?

Yes, Sequence File supports compression to reduce storage space and improve data transfer efficiency.
Common compression codecs such as Gzip, Snappy, and LZO can be used with Sequence File.

What are some common use cases for Sequence File?

Sequence File is commonly used as an intermediate file format in MapReduce jobs to store output data between Map and Reduce phases.
It is also used for storing large amounts of structured or semi-structured data efficiently in Hadoop.

Is Sequence File compatible with other Hadoop ecosystem tools?

Yes, Sequence File is compatible with various Hadoop ecosystem tools such as HBase, Hive, Pig, and Spark.
These tools can read from and write to Sequence Files, making it a versatile and widely adopted file format in the Hadoop ecosystem.

What is the Avro format?

The Avro format is a binary serialization format developed within the Apache Hadoop project.
It is used for efficient data serialization and is schema-based, meaning data is accompanied by a schema that describes its structure.

How does Avro handle schema evolution?

Avro supports schema evolution, allowing schemas to evolve over time without breaking compatibility.
New fields can be added, existing fields can be modified, and fields can be removed from the schema without requiring changes to existing data.

How does Avro serialize data?

Avro serializes data into a compact binary format, making it efficient for storage and transmission.
It also supports JSON serialization for human-readable data interchange.

Is Avro compatible with other data processing frameworks?

Yes, Avro is compatible with various data processing frameworks and tools, including Apache Hadoop, Apache Spark, Apache Kafka, and others.
These tools can read from and write to Avro files, making it a widely adopted format in the big data ecosystem.

What are some common use cases for the Avro format?

Avro is commonly used for data serialization in distributed systems such as Hadoop and Spark. It is suitable for scenarios where schema evolution and compatibility are important, such as storing event data in data pipelines.

What are Protocol Buffers (protobuf)?

Protocol Buffers (protobuf) is a method for serializing structured data developed by Google. It is used to define message formats in a language-agnostic way, enabling efficient data serialization and deserialization.

How are message formats defined in Protocol Buffers?

Message formats in Protocol Buffers are defined using a language-neutral interface description language (IDL) called Protocol Buffer Interface Definition Language (proto). The schema defines the structure of the data, including fields, types, and optional attributes such as default values and field options.

What makes Protocol Buffers efficient for data serialization?

Protocol Buffers use binary encoding, resulting in compact serialized data that requires less storage space and bandwidth compared to text-based formats like JSON or XML. They also support efficient parsing and serialization/deserialization, making them suitable for high-throughput data processing systems.

What programming languages are supported by Protocol Buffers?

Protocol Buffers support multiple programming languages, including C++, Java, Python, Go, and others. Each language has its own Protocol Buffers compiler that generates language-specific code for serializing and deserializing messages.

What are some common use cases for Protocol Buffers?

Protocol Buffers are commonly used in distributed systems for communication between microservices, RPC (Remote Procedure Call) frameworks, and message queuing systems. They are also used for data serialization in high-performance applications, such as databases and streaming platforms.

What is the Parquet format?

Parquet is an open-source columnar storage format for the Hadoop ecosystem. It is designed to store and process large amounts of data efficiently, especially for analytics workloads.

How does Parquet store data?

Parquet stores data in a columnar format, where each column is stored separately. This allows for efficient compression and encoding techniques to be applied to each column independently, reducing storage space and improving query performance.

Does Parquet support compression?

Yes, Parquet supports various compression codecs such as Snappy, Gzip, and LZ4. Compression is applied at the column level, further reducing storage requirements and improving query performance.

What is predicate pushdown in Parquet?

Predicate pushdown is a feature of Parquet that allows query engines to push filtering conditions down to the Parquet reader. This enables Parquet to skip reading entire row groups or columns that don't satisfy the filtering conditions, improving query performance.

How does Parquet handle schema evolution?

Parquet supports schema evolution, allowing schema changes over time without breaking compatibility. New columns can be added, existing columns can be modified, and columns can be removed from the schema without affecting existing data.