Cloud offerings for Data Warehouses - AWS Redshift Continued Flashcards by Thomas Reddy

What is columnar storage?

Columnar storage is a data storage format that stores data by column rather than by row.
Each column is stored separately, allowing for more efficient compression and query processing.

How well did you know this?

Not at all

Perfectly

What are other storage formats commonly used in databases?

Other storage formats include row-based storage and hybrid storage formats.
Row-based storage stores data in rows, where each row contains all the columns of the table.
Hybrid storage formats combine elements of both columnar and row-based storage, optimizing storage based on data characteristics and query patterns.

How well did you know this?

Not at all

Perfectly

How does columnar storage differ from row-based storage in terms of compression?

Columnar storage allows for more efficient compression because data within each column tends to have similar values, making compression algorithms more effective.
Row-based storage may not achieve as high compression ratios because adjacent values in a row may vary significantly.

How well did you know this?

Not at all

Perfectly

How does query performance compare between columnar and row-based storage?

Columnar storage is typically more efficient for analytical queries that involve aggregations, filtering, and selecting a subset of columns.
Row-based storage may be more suitable for transactional workloads or queries that involve accessing entire rows.

How well did you know this?

Not at all

Perfectly

How does data retrieval differ between columnar and row-based storage?

In columnar storage, only the columns relevant to the query need to be accessed, leading to less I/O and faster query processing.
In row-based storage, entire rows must be retrieved, which may result in higher I/O and slower query performance, especially for queries that involve accessing multiple columns.

How well did you know this?

Not at all

Perfectly

What is the Parquet file format?

Parquet is an open-source columnar storage format for the Hadoop ecosystem.
It is designed to store and process large amounts of data efficiently, especially for analytics workloads.

How well did you know this?

Not at all

Perfectly

How does Parquet store data?

Parquet stores data in a columnar format, where each column is stored separately.
This allows for efficient compression and encoding techniques to be applied to each column independently, reducing storage space and improving query performance.

How well did you know this?

Not at all

Perfectly

Does Parquet support compression?

Yes, Parquet supports various compression codecs such as Snappy, Gzip, and LZ4.
Compression is applied at the column level, further reducing storage requirements and improving query performance.

How well did you know this?

Not at all

Perfectly

What is predicate pushdown in Parquet?

Predicate pushdown is a feature of Parquet that allows query engines to push filtering conditions down to the Parquet reader.
This enables Parquet to skip reading entire row groups or columns that don’t satisfy the filtering conditions, improving query performance.

How well did you know this?

Not at all

Perfectly

How does Parquet handle schema evolution?

Parquet supports schema evolution, allowing schema changes over time without breaking compatibility.
New columns can be added, existing columns can be modified, and columns can be removed from the schema without affecting existing data.

How well did you know this?

Not at all

Perfectly

How are columnar storage formats used in OLAP systems?

OLAP (Online Analytical Processing) systems often use columnar storage formats like Parquet, ORC (Optimized Row Columnar), and others to efficiently process analytical queries on large datasets.
Columnar storage enables faster query performance, better compression, and improved data retrieval compared to row-based storage formats.

How well did you know this?

Not at all

Perfectly

What types of queries benefit most from columnar storage?

Analytical queries that involve aggregations, filtering, and analyzing subsets of columns benefit most from columnar storage.
These queries are common in data analysis, reporting, and data visualization tasks.

How well did you know this?

Not at all

Perfectly

How does columnar storage improve query performance?

Columnar storage allows query engines to access only the columns relevant to the query, minimizing I/O and improving query processing speed.
Since data within each column tends to have similar values, compression algorithms can be more effective, further enhancing query performance.

How well did you know this?

Not at all

Perfectly

What is serialization?

Serialization is the process of converting data structures or objects into a format that can be easily stored, transmitted, or reconstructed later.
It typically involves converting complex data types into a byte stream or another format that preserves the data’s structure and content.

How well did you know this?

Not at all

Perfectly

What is deserialization?

Deserialization is the process of reconstructing data structures or objects from a serialized format back into their original form.
It involves interpreting the serialized data and reconstructing the original data structures or objects.

How well did you know this?

Not at all

Perfectly

Why is serialization used?

Serialization is used to transfer data between different systems or processes that may have different architectures, programming languages, or communication protocols.
It enables data to be persisted to storage systems, transmitted over networks, or passed between different components of a distributed system.

Why is deserialization necessary?

Deserialization is necessary to reconstruct data structures or objects from a serialized format so that they can be used by applications or processes.
It reverses the process of serialization, allowing data to be accessed, manipulated, and processed in its original form.

What are some examples of serialized formats?

Serialized formats include binary formats such as Protocol Buffers, Apache Avro, and Apache Thrift.
They also include text-based formats such as JSON (JavaScript Object Notation) and XML (eXtensible Markup Language).

What is the MapReduce execution engine?

The MapReduce execution engine is a distributed data processing framework developed by Google for processing and generating large datasets in parallel across a cluster of commodity hardware.

What are the key phases in the MapReduce execution model?

The MapReduce model consists of two main phases: the Map phase and the Reduce phase.
In the Map phase, input data is processed and transformed into intermediate key-value pairs.
In the Reduce phase, intermediate key-value pairs with the same key are grouped together and processed to produce the final output.

How does MapReduce ensure fault tolerance?

MapReduce achieves fault tolerance through speculative execution and data replication.
Speculative execution involves running duplicate tasks on different nodes to ensure that if one task fails or is slow, the duplicate task can complete the computation.
Data replication involves replicating input data across multiple nodes to ensure data availability in case of node failures.

How does MapReduce achieve scalability?

MapReduce achieves scalability by distributing data and computation across multiple nodes in a cluster.
As data volume and processing requirements increase, additional nodes can be added to the cluster to scale out horizontally.

What are some common use cases for the MapReduce execution engine?

MapReduce is commonly used for batch processing of large datasets, including log processing, data warehousing, ETL (Extract, Transform, Load), and data analytics tasks.
It is particularly well-suited for tasks that can be parallelized and require processing massive amounts of data efficiently.

What is Apache Spark?

Apache Spark is an open-source distributed data processing framework designed for speed, ease of use, and advanced analytics.
It provides in-memory processing capabilities and supports various programming languages, including Scala, Java, Python, and R.

How does the execution model of MapReduce and Apache Spark differ?

MapReduce follows a disk-based processing model, where intermediate data is written to disk between Map and Reduce phases. Apache Spark, on the other hand, follows an in-memory processing model, where intermediate data is stored in memory between operations, leading to faster processing.

How do MapReduce and Apache Spark ensure fault tolerance?

MapReduce achieves fault tolerance through speculative execution and data replication. Apache Spark achieves fault tolerance through resilient distributed datasets (RDDs), which track lineage information to reconstruct lost data partitions in case of failures.

Which framework is considered easier to use: MapReduce or Apache Spark?

Apache Spark is generally considered easier to use than MapReduce due to its high-level APIs, interactive shell (Spark Shell), and support for multiple programming languages. Spark's rich set of libraries and built-in support for SQL, streaming, machine learning, and graph processing also contribute to its ease of use.