Data Warehousing with Apache Hive Continued Flashcards
What is Presto?
Presto is an open-source distributed SQL query engine designed for interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.
How does Presto work?
Presto allows querying data where it lives, including in Hadoop, S3, Cassandra, relational databases, or even proprietary data stores. It executes queries using a distributed architecture where a coordinator node manages worker nodes that are responsible for executing parts of the query.
What are some key features of Presto?
Key features include:
Ability to query across multiple data sources within a single query.
In-memory processing for fast query execution.
Support for standard ANSI SQL including complex joins, window functions, and aggregates.
Extensible architecture via plugins.
What types of data sources can Presto query?
Presto can query a variety of data sources including, but not limited to, Hadoop HDFS, Amazon S3, Microsoft Azure Storage, Google Cloud Storage, MySQL, PostgreSQL, Cassandra, Kafka, and MongoDB.
How is Presto different from Hadoop?
Unlike Hadoop, which is primarily geared towards batch processing with high-latency data access, Presto is designed for low-latency, interactive query performance and can query data directly from data sources without requiring data movement or transformation.
What is the difference between Trino and Presto?
Trino is the rebranded version of PrestoSQL, which was forked from the original PrestoDB by the original creators. Trino and PrestoDB are now separate projects, both evolving independently.
How do you scale Presto for large datasets?
Presto scales horizontally; adding more worker nodes to the Presto cluster can enhance its ability to handle larger datasets and more concurrent queries.
What kind of SQL operations can you perform with Presto?
Presto supports a wide range of SQL operations including SELECT, JOIN (even across different data sources), sub-queries, most SQL standard functions, and even complex aggregations and window functions.
What are the use cases for using Presto?
Common use cases include:
Interactive analytics at scale.
Data lake analytics.
Multi-database queries.
Real-time analytics.
What are some best practices for optimizing Presto performance?
Best practices include:
Ensuring data is formatted in efficient formats like Parquet or ORC.
Properly sizing Presto clusters based on data size and query complexity.
Using appropriate indexing and partitioning on data sources to speed up query processing.
Tuning memory settings for optimal performance.
What is Apache Hive?
Apache Hive is a data warehousing tool in the Hadoop ecosystem that facilitates reading, writing, and managing large datasets residing in distributed storage using SQL-like queries.
How does Apache Hive work?
Hive allows writing SQL-like queries called HQL (Hive Query Language), which are translated into MapReduce, Tez, or Spark jobs under the hood to process data.
What are the key components of Apache Hive’s architecture?
Key components include:
Hive Metastore: Stores metadata for Hive tables and partitions.
Hive Driver: Manages the lifecycle of a HiveQL statement.
Compiler: Parses HiveQL queries, optimizes them, and creates execution plans.
Execution Engine: Executes the tasks using MapReduce, Tez, or Spark.
What is HiveQL?
HiveQL (Hive Query Language) is a SQL-like scripting language used with Hive to analyze data in the Hadoop ecosystem. It extends SQL for easy data summarization, query, and analysis.
How is data stored and managed in Hive?
Data in Hive is stored in tables, which can be either managed (internal) tables where Hive manages both data and schema, or external tables where Hive manages only the schema and the data is managed outside of Hive.
What are partitions and buckets in Hive?
Partitions: Hive tables can be partitioned based on one or more keys to improve performance on large datasets.
Buckets: Data in Hive can be clustered into buckets based on a hash function of a column in a table, useful for efficient sampling and other operations.
What file formats does Hive support?
Hive supports several file formats, including:
TextFile (default)
SequenceFile
RCFile
ORC (Optimized Row Columnar)
Parquet
Avro
What are some use cases for Apache Hive?
Common use cases include:
Data warehousing and large-scale data processing.
Ad-hoc querying over large datasets.
Data mining tasks.
Log processing and analysis.
What are some limitations of Apache Hive?
Limitations include:
Not designed for online transaction processing; it is optimized for batch processing.
Higher latency for Hive queries compared to traditional databases due to the overhead of MapReduce jobs.
Limited subquery support.
How can you optimize Hive performance?
Optimization techniques include:
Using appropriate file formats like ORC to improve compression and I/O efficiency.
Implementing partitioning and bucketing to enhance data retrieval speeds.
Configuring Hive to use Tez or Spark instead of MapReduce as the execution engine to improve performance.
What is Hadoop Distributed File System (HDFS)?
HDFS is a distributed file system designed to run on commodity hardware. It has high fault tolerance and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications with large data sets.
How does HDFS work?
HDFS stores large files across multiple machines. It operates by breaking down files into blocks (default size is 128 MB in Hadoop 2.x, 64 MB in earlier versions), which are stored on a cluster of machines. It manages the distribution and replication of these blocks to ensure redundancy and fault tolerance.
What are the key components of HDFS architecture?
The two main components of HDFS are:
NameNode: The master server that manages the file system namespace and regulates access to files by clients.
DataNode: The worker nodes that manage storage attached to the nodes that they run on and serve read and write requests from the file system’s clients.
What is the role of the NameNode in HDFS?
The NameNode manages the file system namespace. It maintains the file system tree and the metadata for all the files and directories in the tree. This metadata is stored in memory, which allows the NameNode to rapidly respond to queries from clients.
How does HDFS ensure data reliability and fault tolerance?
HDFS replicates blocks of data to multiple nodes in the cluster (default is three replicas across nodes). This replication ensures that even if one or more nodes go down, data is still available from other nodes.
What is the role of DataNodes in HDFS?
DataNodes are responsible for serving read and write requests from the file system’s clients. They also perform block creation, deletion, and replication upon instruction from the NameNode.
What is a block in HDFS?
A block in HDFS is a single unit of data, and it is the minimum amount of data that HDFS reads or writes. Blocks are distributed across multiple nodes to ensure reliability and fast data processing.
What happens when a DataNode fails?
When a DataNode fails, the NameNode is responsible for detecting the failure. Based on the block replication policy, the NameNode will initiate replication of blocks stored on the failed DataNode to other DataNodes, thus maintaining the desired level of data redundancy.
Can HDFS be accessed from outside the Hadoop ecosystem?
Yes, HDFS can be accessed in multiple ways including through the Hadoop command line interface, Java API, and over HTTP using the HDFS HTTP web server. Tools like Apache Hive, HBase, and others built on top of Hadoop can also access HDFS data directly.
What are the limitations of HDFS?
Limitations of HDFS include:
Not suitable for low-latency data access: HDFS is optimized for high throughput and high capacity storage.
Not a fit for small files: Storing a large number of small files can be inefficient because each file, directory, and block in HDFS is represented as an object in the NameNode’s memory.
What is MapReduce?
MapReduce is a programming model designed for processing large volumes of data in parallel by dividing the work into a set of independent tasks. It consists of two main phases: the Map phase, which filters and sorts data, and the Reduce phase, which performs a summary operation.
How does the MapReduce model work?
In the MapReduce model, the input data is divided into independent chunks which are processed by the map tasks in a completely parallel manner. The output of the map tasks is then passed to the reduce tasks, typically requiring sorting of the outputs first, which prepares the data for input to the reduce tasks.