Data Warehousing with Apache Hive Continued Flashcards

1
Q

What is Presto?

A

Presto is an open-source distributed SQL query engine designed for interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How does Presto work?

A

Presto allows querying data where it lives, including in Hadoop, S3, Cassandra, relational databases, or even proprietary data stores. It executes queries using a distributed architecture where a coordinator node manages worker nodes that are responsible for executing parts of the query.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are some key features of Presto?

A

Key features include:
Ability to query across multiple data sources within a single query.
In-memory processing for fast query execution.
Support for standard ANSI SQL including complex joins, window functions, and aggregates.
Extensible architecture via plugins.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What types of data sources can Presto query?

A

Presto can query a variety of data sources including, but not limited to, Hadoop HDFS, Amazon S3, Microsoft Azure Storage, Google Cloud Storage, MySQL, PostgreSQL, Cassandra, Kafka, and MongoDB.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How is Presto different from Hadoop?

A

Unlike Hadoop, which is primarily geared towards batch processing with high-latency data access, Presto is designed for low-latency, interactive query performance and can query data directly from data sources without requiring data movement or transformation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the difference between Trino and Presto?

A

Trino is the rebranded version of PrestoSQL, which was forked from the original PrestoDB by the original creators. Trino and PrestoDB are now separate projects, both evolving independently.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How do you scale Presto for large datasets?

A

Presto scales horizontally; adding more worker nodes to the Presto cluster can enhance its ability to handle larger datasets and more concurrent queries.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What kind of SQL operations can you perform with Presto?

A

Presto supports a wide range of SQL operations including SELECT, JOIN (even across different data sources), sub-queries, most SQL standard functions, and even complex aggregations and window functions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the use cases for using Presto?

A

Common use cases include:
Interactive analytics at scale.
Data lake analytics.
Multi-database queries.
Real-time analytics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are some best practices for optimizing Presto performance?

A

Best practices include:
Ensuring data is formatted in efficient formats like Parquet or ORC.
Properly sizing Presto clusters based on data size and query complexity.
Using appropriate indexing and partitioning on data sources to speed up query processing.
Tuning memory settings for optimal performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is Apache Hive?

A

Apache Hive is a data warehousing tool in the Hadoop ecosystem that facilitates reading, writing, and managing large datasets residing in distributed storage using SQL-like queries.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How does Apache Hive work?

A

Hive allows writing SQL-like queries called HQL (Hive Query Language), which are translated into MapReduce, Tez, or Spark jobs under the hood to process data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the key components of Apache Hive’s architecture?

A

Key components include:
Hive Metastore: Stores metadata for Hive tables and partitions.
Hive Driver: Manages the lifecycle of a HiveQL statement.
Compiler: Parses HiveQL queries, optimizes them, and creates execution plans.
Execution Engine: Executes the tasks using MapReduce, Tez, or Spark.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is HiveQL?

A

HiveQL (Hive Query Language) is a SQL-like scripting language used with Hive to analyze data in the Hadoop ecosystem. It extends SQL for easy data summarization, query, and analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How is data stored and managed in Hive?

A

Data in Hive is stored in tables, which can be either managed (internal) tables where Hive manages both data and schema, or external tables where Hive manages only the schema and the data is managed outside of Hive.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are partitions and buckets in Hive?

A

Partitions: Hive tables can be partitioned based on one or more keys to improve performance on large datasets.
Buckets: Data in Hive can be clustered into buckets based on a hash function of a column in a table, useful for efficient sampling and other operations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What file formats does Hive support?

A

Hive supports several file formats, including:
TextFile (default)
SequenceFile
RCFile
ORC (Optimized Row Columnar)
Parquet
Avro

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are some use cases for Apache Hive?

A

Common use cases include:
Data warehousing and large-scale data processing.
Ad-hoc querying over large datasets.
Data mining tasks.
Log processing and analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What are some limitations of Apache Hive?

A

Limitations include:
Not designed for online transaction processing; it is optimized for batch processing.
Higher latency for Hive queries compared to traditional databases due to the overhead of MapReduce jobs.
Limited subquery support.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

How can you optimize Hive performance?

A

Optimization techniques include:
Using appropriate file formats like ORC to improve compression and I/O efficiency.
Implementing partitioning and bucketing to enhance data retrieval speeds.
Configuring Hive to use Tez or Spark instead of MapReduce as the execution engine to improve performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is Hadoop Distributed File System (HDFS)?

A

HDFS is a distributed file system designed to run on commodity hardware. It has high fault tolerance and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications with large data sets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

How does HDFS work?

A

HDFS stores large files across multiple machines. It operates by breaking down files into blocks (default size is 128 MB in Hadoop 2.x, 64 MB in earlier versions), which are stored on a cluster of machines. It manages the distribution and replication of these blocks to ensure redundancy and fault tolerance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What are the key components of HDFS architecture?

A

The two main components of HDFS are:
NameNode: The master server that manages the file system namespace and regulates access to files by clients.
DataNode: The worker nodes that manage storage attached to the nodes that they run on and serve read and write requests from the file system’s clients.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is the role of the NameNode in HDFS?

A

The NameNode manages the file system namespace. It maintains the file system tree and the metadata for all the files and directories in the tree. This metadata is stored in memory, which allows the NameNode to rapidly respond to queries from clients.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

How does HDFS ensure data reliability and fault tolerance?

A

HDFS replicates blocks of data to multiple nodes in the cluster (default is three replicas across nodes). This replication ensures that even if one or more nodes go down, data is still available from other nodes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What is the role of DataNodes in HDFS?

A

DataNodes are responsible for serving read and write requests from the file system’s clients. They also perform block creation, deletion, and replication upon instruction from the NameNode.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is a block in HDFS?

A

A block in HDFS is a single unit of data, and it is the minimum amount of data that HDFS reads or writes. Blocks are distributed across multiple nodes to ensure reliability and fast data processing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What happens when a DataNode fails?

A

When a DataNode fails, the NameNode is responsible for detecting the failure. Based on the block replication policy, the NameNode will initiate replication of blocks stored on the failed DataNode to other DataNodes, thus maintaining the desired level of data redundancy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Can HDFS be accessed from outside the Hadoop ecosystem?

A

Yes, HDFS can be accessed in multiple ways including through the Hadoop command line interface, Java API, and over HTTP using the HDFS HTTP web server. Tools like Apache Hive, HBase, and others built on top of Hadoop can also access HDFS data directly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What are the limitations of HDFS?

A

Limitations of HDFS include:
Not suitable for low-latency data access: HDFS is optimized for high throughput and high capacity storage.
Not a fit for small files: Storing a large number of small files can be inefficient because each file, directory, and block in HDFS is represented as an object in the NameNode’s memory.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What is MapReduce?

A

MapReduce is a programming model designed for processing large volumes of data in parallel by dividing the work into a set of independent tasks. It consists of two main phases: the Map phase, which filters and sorts data, and the Reduce phase, which performs a summary operation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

How does the MapReduce model work?

A

In the MapReduce model, the input data is divided into independent chunks which are processed by the map tasks in a completely parallel manner. The output of the map tasks is then passed to the reduce tasks, typically requiring sorting of the outputs first, which prepares the data for input to the reduce tasks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What are the main components of a MapReduce job?

A

The main components include:
Mapper: Processes input data and generates intermediate key-value pairs.
Reducer: Aggregates intermediate data tuples into a smaller set of tuples or a unique result.
Combiner (optional): A mini-reducer that performs local aggregation to reduce the amount of data transferred from Mapper to Reducer.
Partitioner: Determines how the outputs of the map tasks are distributed to the reduce tasks.

34
Q

What are the roles of the Mapper and Reducer in MapReduce?

A

The Mapper reads the input data, processes it, and produces intermediate key-value pairs. The Reducer takes the output from a Mapper or multiple Mappers and combines those data tuples into a smaller set of tuples or a final result.

35
Q

Can you explain the concept of shuffling in MapReduce?

A

Shuffling in MapReduce is the process of transferring the Mapper output to the Reducers. It involves sorting and grouping the intermediate key-value pairs by keys, which ensures that all values associated with a particular key are sent to the same Reducer.

36
Q

What is a practical example of a MapReduce application?

A

A common example is word count, where the goal is to count the occurrences of each word in a large set of documents. The Mapper processes each document, emitting a key-value pair for each word with the count of 1, and the Reducer sums these counts for each word.

37
Q

What are the advantages of using MapReduce?

A

Advantages include:
Scalability: Handles vast amounts of data by distributing the workload across many machines.
Fault-tolerance: Automatically handles failures by re-executing failed tasks.
Flexibility: Can process data stored in various formats and coming from different sources.

38
Q

What are the limitations of MapReduce?

A

Limitations include:
Not suitable for all types of tasks, especially those that are not easily divisible into independent units.
Overhead of managing inter-process communication.
High latency, making it unsuitable for real-time processing.

39
Q

How has the use of MapReduce evolved with the advent of new technologies?

A

With the development of newer data processing frameworks like Apache Spark and Apache Flink, the use of MapReduce has declined. These newer technologies offer better performance for iterative and real-time processing tasks.

40
Q

What is a Combiner in MapReduce, and when should it be used?

A

A Combiner function can be used to aggregate intermediate map output locally on each machine where the map task ran, which can significantly reduce the amount of data transferred across the network to the Reducers. It is effective for associative operations like sum or max.

41
Q

What is YARN in the context of Hadoop?

A

YARN is the resource management layer of Apache Hadoop that allows multiple data processing engines to handle data stored on the same platform efficiently. It is responsible for managing computing resources in clusters and using them for scheduling user applications.

42
Q

What are the main components of YARN?

A

The main components of YARN include:
ResourceManager (RM): The master daemon of YARN that manages the use of resources across the cluster.
NodeManager (NM): The per-node agent that manages the user processes and resource usage on that node.
ApplicationMaster (AM): Per-application component that negotiates resources from the ResourceManager and works with the NodeManager(s) to execute and monitor the tasks.

43
Q

How does YARN improve the Hadoop computational capabilities?

A

YARN enhances Hadoop by separating the resource management capabilities from the data processing capabilities, which allows Hadoop to support more varied processing approaches and a broader array of applications. YARN provides better cluster utilization and supports multiple processing models rather than being limited to MapReduce.

44
Q

What is the role of the ResourceManager in YARN?

A

The ResourceManager is the ultimate authority that arbitrates resources among all the applications in the system. It has two main components:
Scheduler: Responsible for allocating resources to various running applications subject to constraints of capacities, queues, etc.
ApplicationManager: Manages the lifecycle of applications and maintains a list of submitted jobs.

45
Q

What is the role of the NodeManager?

A

The NodeManager is a per-machine framework agent responsible for containers, monitoring their resource usage (cpu, memory, disk, network), and reporting the same to the ResourceManager/Scheduler.

46
Q

What are Containers in YARN?

A

In YARN, a container represents a collection of physical resource capabilities (such as memory and CPU) on a single node. The NodeManager oversees containers and manages user processes in these containers.

47
Q

How does YARN handle application failures?

A

The ApplicationMaster has the responsibility for negotiating appropriate resource containers from the Scheduler, tracking their status, and adjusting its requests based on job needs. If an ApplicationMaster fails, it can be restarted by the ResourceManager.

48
Q

What are the benefits of using YARN in a big data environment?

A

Benefits include improved cluster utilization, greater scalability, support for various data processing engines (like MapReduce, Spark, and Tez), and flexibility in managing distributed applications.

49
Q

How does YARN support different data processing engines?

A

YARN is designed to be agnostic to the specific processing framework, providing a common infrastructure to run diverse big data tools. This decouples Hadoop job management and resource management from the specific data processing frameworks, allowing other services beyond MapReduce to use Hadoop resources.

50
Q

How can administrators manage and monitor a YARN cluster?

A

Administrators can manage and monitor YARN clusters using the YARN ResourceManager web interface, which provides information about the current state of the cluster, applications, and resources. CLI tools like yarn application -list and yarn node -list also provide useful operational insights.

51
Q

What is Apache Hive?

A

Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.

52
Q

What is the role of the Metastore in Hive architecture?

A

The Metastore stores metadata about Hive tables, including their schema, location, and partitions, which is crucial for query planning and execution.

53
Q

Name three execution engines supported by Apache Hive.

A

MapReduce, Tez, and Spark are three execution engines supported by Apache Hive.

54
Q

What is the primary storage layer for data processed by Hive?

A

The primary storage layer for data processed by Hive is the Hadoop Distributed File System (HDFS).

55
Q

What query language does Hive use?

A

Hive uses HiveQL, a SQL-like query language, for querying and analyzing data stored in Hive.

56
Q

What is the purpose of HiveServer2 in Hive architecture?

A

HiveServer2 is a service that enables remote clients to execute queries against Hive, handling client connections, query parsing, optimization, and execution.

57
Q

Which component of Hive architecture provides a RESTful interface for submitting and monitoring Hive jobs programmatically?

A

WebHCat (formerly Templeton) provides a RESTful interface for submitting and monitoring Hive jobs programmatically.

58
Q

Name two commonly used storage formats supported by Hive.

A

ORC (Optimized Row Columnar) and Parquet are two commonly used storage formats supported by Hive.

59
Q

What does MPP stand for in the context of data warehousing?

A

MPP stands for Massively Parallel Processing.

60
Q

Describe the primary characteristic of MPP data warehouses.

A

MPP data warehouses distribute processing tasks across multiple nodes in a cluster, enabling parallel processing for improved performance and scalability.

61
Q

How does MPP architecture differ from SMP (Symmetric Multiprocessing) architecture?

A

MPP architecture uses multiple nodes with independent processing power and memory, while SMP architecture relies on shared resources within a single server.

62
Q

What are the benefits of using an MPP data warehouse?

A

Benefits include improved performance, scalability to handle large data volumes, and the ability to support complex analytical workloads.

63
Q

Name a popular MPP data warehouse solution.

A

Amazon Redshift, Google BigQuery, Snowflake, and Teradata are examples of popular MPP data warehouse solutions.

64
Q

How does MPP architecture contribute to query performance?

A

MPP architecture divides queries into smaller tasks that can be processed in parallel across multiple nodes, leading to faster query execution times.

65
Q

What is the role of the master node in an MPP data warehouse cluster?

A

The master node in an MPP data warehouse cluster coordinates query execution, distributes data and tasks among compute nodes, and manages cluster resources.

66
Q

How does MPP architecture support scalability?

A

MPP data warehouses can easily scale out by adding more compute nodes to the cluster, allowing organizations to accommodate growing data volumes and user demands.

67
Q

What is Apache Hive?

A

Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.

68
Q

What is Apache Beeline?

A

Apache Beeline is a command-line interface (CLI) tool for running Hive queries, similar to the traditional Hive CLI but with improved functionality and compatibility.

69
Q

How does Hive differ from Beeline?

A

Hive is a data warehouse infrastructure that includes various components like Metastore, execution engines, and query language, while Beeline is a specific CLI tool designed for interacting with Hive.

70
Q

What are some advantages of using Beeline over the traditional Hive CLI?

A

Beeline offers features like better JDBC compatibility, support for multiple authentication mechanisms, improved scriptability, and more robust error handling compared to the traditional Hive CLI.

71
Q

In what scenarios might you choose to use Hive instead of Beeline?

A

You might choose to use Hive directly when you need access to the full set of Hive functionalities, including interacting with the Metastore, managing resources, or integrating with other components in the Hive ecosystem.

72
Q

How does Beeline enhance JDBC compatibility?

A

Beeline implements the JDBC specification more strictly, making it easier to connect to Hive using standard JDBC libraries and tools.

73
Q

Can Beeline be used for non-interactive scripting?

A

Yes, Beeline supports non-interactive scripting mode, allowing users to run Hive queries from scripts or automate tasks.

74
Q

What is static partitioning in Apache Hive?

A

Static partitioning in Hive involves manually specifying the partitioning columns and their values while inserting data into a partitioned table. Partitions are predefined and do not change unless explicitly altered.

75
Q

What is dynamic partitioning in Apache Hive?

A

Dynamic partitioning in Hive involves automatically determining partition values based on the data being inserted into a partitioned table. Hive dynamically generates partitions based on the values of specified partition columns.

76
Q

How does static partitioning differ from dynamic partitioning?

A

Static partitioning requires users to specify partition values explicitly during data insertion, while dynamic partitioning automatically generates partitions based on data values without the need for explicit partition specification.

77
Q

What are the benefits of static partitioning?

A

Static partitioning can be more efficient for managing predefined partitioning schemes and allows users to control partitioning based on specific requirements or business rules.

78
Q

What are the benefits of dynamic partitioning?

A

Dynamic partitioning simplifies the data insertion process by automatically creating partitions based on the values present in the data, reducing manual effort and making it easier to handle large datasets with varying partition values.

79
Q

What is bucketing in Apache Hive?

A

Bucketing in Hive involves dividing data into a fixed number of buckets based on the hash value of a specified column. It’s used to evenly distribute data across buckets for efficient querying and sampling.

80
Q

How does bucketing differ from partitioning?

A

Partitioning divides data into separate directories or subdirectories based on partition keys, while bucketing divides data into a fixed number of files based on the hash value of a specified column, regardless of partition keys.

81
Q

In which scenarios might you choose to use bucketing instead of partitioning?

A

Bucketing is useful for evenly distributing data across a fixed number of files, which can improve query performance, especially for join operations and sampling, compared to partitioning.