4. Hadoop Related Projects Flashcards
Hive was originally developed by:
A) Google
B) Facebook
C) Apache Software Foundation
D) IBM
B) Facebook
Which of the following is NOT a characteristic of Hive?
A) It supports real-time data processing
B) It uses a SQL-like language called HiveQL
C) It is built on top of Hadoop
D) It is used for data warehousing
A) It supports real-time data processing
What is the main purpose of Spark?
A) To provide a more efficient alternative to MapReduce
B) To support online transaction processing
C) To manage Hadoop clusters
D) To store large datasets
A) To provide a more efficient alternative to MapReduce
Resilient Distributed Datasets (RDDs) in Spark are:
A) Mutable collections of data items
B) Fault-tolerant and can be operated on in parallel
C) Stored on disk by default
D) Only accessible in Scala
B) Fault-tolerant and can be operated on in parallel
Which of the following is an advantage of Spark over MapReduce?
A) Spark cannot handle large datasets
B) Spark writes intermediate results to disk
C) Spark can cache intermediate results in memory
D) Spark supports only batch processing
C) Spark can cache intermediate results in memory
In which language was Spark originally developed?
A) Java
B) Python
C) R
D) Scala
D) Scala
Which of the following is a limitation of HiveQL compared to ANSI SQL?
A) It supports “insert into” for existing tables
B) It does not support the equality operator in join predicates
C) It does not support “update” or “delete” operations
D) It is fully ANSI-compliant
C) It does not support “update” or “delete” operations
Spark’s ability to cache intermediate results in memory is particularly useful for:
A) Online transaction processing
B) Iterative algorithms
C) Long-term data storage
D) Reducing network traffic
B) Iterative algorithms
In Hive, which command is used to load data into a table?
A) INSERT INTO
B) LOAD DATA INPATH
C) UPDATE TABLE
D) SET DATA
B) LOAD DATA INPATH
Which of the following is NOT a feature of Spark’s RDDs?
A) They are mutable
B) They are distributed across the cluster
C) They are resilient
D) They can be cached in memory
A) They are mutable
Which of the following operations is an action in Spark?
A) map()
B) filter()
C) reduce()
D) flatMap()
C) reduce()
HiveQL supports which of the following operations?
A) Real-time processing
B) Transactional updates
C) Ad-hoc querying
D) In-memory computations
C) Ad-hoc querying
In Spark, an RDD can be created from:
A) Only HDFS files
B) Only local files
C) Both HDFS files and local files
D) Neither HDFS files nor local files
C) Both HDFS files and local files
Which of the following is a limitation of HiveQL?
A) It does not support JOIN operations
B) It cannot handle large datasets
C) It does not support “insert into” for existing tables
D) It requires data to be structured
C) It does not support “insert into” for existing tables
Spark’s ability to cache data in memory is beneficial for:
A) Long-term data storage
B) Real-time transaction processing
C) Iterative algorithms
D) Disk-based data processing
C) Iterative algorithms
Which of the following is a transformation in Spark?
A) count()
B) saveAsTextFile()
C) groupByKey()
D) take()
C) groupByKey()
HiveQL’s “insert overwrite” command:
A) Appends data to an existing table
B) Deletes the existing data before inserting new data
C) Updates existing data with new data
D) Inserts data without affecting existing data
B) Deletes the existing data before inserting new data
Which of the following accurately describes Spark’s RDD lineage?
A) A record of all actions performed on an RDD
B) A history of transformations applied to an RDD
C) The distribution of an RDD across the cluster
D) The sequence of RDDs created during a Spark job
B) A history of transformations applied to an RDD
Hive is primarily used for:
A) Online transaction processing
B) Real-time data analysis
C) Data warehousing and batch processing
D) In-memory data processing
C) Data warehousing and batch processing
The main advantage of using Spark over MapReduce is:
A) Spark’s support for SQL queries
B) Spark’s ability to process real-time data
C) Spark’s faster data processing due to in-memory computation
D) Spark’s compatibility with Hadoop’s HDFS
C) Spark’s faster data processing due to in-memory computation
Which of the following is NOT a way to create an RDD in Spark?
A) From an existing RDD
B) From a local file system
C) From a remote database
D) From an HDFS file
C) From a remote database
HiveQL’s support for “join” operations:
A) Is limited to equality joins
B) Includes support for full outer joins
C) Allows for non-equi joins
D) Is not available in Hive
A) Is limited to equality joins