4. Hadoop Related Projects Flashcards

Question 1

Q

Hive was originally developed by:
A) Google
B) Facebook
C) Apache Software Foundation
D) IBM

Answer

A

B) Facebook

Question 2

Q

Which of the following is NOT a characteristic of Hive?
A) It supports real-time data processing
B) It uses a SQL-like language called HiveQL
C) It is built on top of Hadoop
D) It is used for data warehousing

Answer

A

A) It supports real-time data processing

Question 3

Q

What is the main purpose of Spark?
A) To provide a more efficient alternative to MapReduce
B) To support online transaction processing
C) To manage Hadoop clusters
D) To store large datasets

Answer

A

A) To provide a more efficient alternative to MapReduce

Question 4

Q

Resilient Distributed Datasets (RDDs) in Spark are:
A) Mutable collections of data items
B) Fault-tolerant and can be operated on in parallel
C) Stored on disk by default
D) Only accessible in Scala

Answer

A

B) Fault-tolerant and can be operated on in parallel

Question 5

Q

Which of the following is an advantage of Spark over MapReduce?
A) Spark cannot handle large datasets
B) Spark writes intermediate results to disk
C) Spark can cache intermediate results in memory
D) Spark supports only batch processing

Answer

A

C) Spark can cache intermediate results in memory

Question 6

Q

In which language was Spark originally developed?
A) Java
B) Python
C) R
D) Scala

Question 7

Q

Which of the following is a limitation of HiveQL compared to ANSI SQL?
A) It supports “insert into” for existing tables
B) It does not support the equality operator in join predicates
C) It does not support “update” or “delete” operations
D) It is fully ANSI-compliant

Answer

A

C) It does not support “update” or “delete” operations

Question 8

Q

Spark’s ability to cache intermediate results in memory is particularly useful for:
A) Online transaction processing
B) Iterative algorithms
C) Long-term data storage
D) Reducing network traffic

Answer

A

B) Iterative algorithms

Question 9

Q

In Hive, which command is used to load data into a table?
A) INSERT INTO
B) LOAD DATA INPATH
C) UPDATE TABLE
D) SET DATA

Answer

A

B) LOAD DATA INPATH

Question 10

Q

Which of the following is NOT a feature of Spark’s RDDs?
A) They are mutable
B) They are distributed across the cluster
C) They are resilient
D) They can be cached in memory

Answer

A

A) They are mutable

Question 11

Q

Which of the following operations is an action in Spark?
A) map()
B) filter()
C) reduce()
D) flatMap()

Answer

A

C) reduce()

Question 12

Q

HiveQL supports which of the following operations?
A) Real-time processing
B) Transactional updates
C) Ad-hoc querying
D) In-memory computations

Answer

A

C) Ad-hoc querying

Question 13

Q

In Spark, an RDD can be created from:
A) Only HDFS files
B) Only local files
C) Both HDFS files and local files
D) Neither HDFS files nor local files

Answer

A

C) Both HDFS files and local files

Question 14

Q

Which of the following is a limitation of HiveQL?
A) It does not support JOIN operations
B) It cannot handle large datasets
C) It does not support “insert into” for existing tables
D) It requires data to be structured

Answer

A

C) It does not support “insert into” for existing tables

Question 15

Q

Spark’s ability to cache data in memory is beneficial for:
A) Long-term data storage
B) Real-time transaction processing
C) Iterative algorithms
D) Disk-based data processing

Answer

A

C) Iterative algorithms

Question 16

Q

Which of the following is a transformation in Spark?
A) count()
B) saveAsTextFile()
C) groupByKey()
D) take()

Answer

Study These Flashcards

A

C) groupByKey()

Question 17

Q

HiveQL’s “insert overwrite” command:
A) Appends data to an existing table
B) Deletes the existing data before inserting new data
C) Updates existing data with new data
D) Inserts data without affecting existing data

Answer

Study These Flashcards

A

B) Deletes the existing data before inserting new data

Question 18

Q

Which of the following accurately describes Spark’s RDD lineage?
A) A record of all actions performed on an RDD
B) A history of transformations applied to an RDD
C) The distribution of an RDD across the cluster
D) The sequence of RDDs created during a Spark job

Answer

Study These Flashcards

A

B) A history of transformations applied to an RDD

Question 19

Q

Hive is primarily used for:
A) Online transaction processing
B) Real-time data analysis
C) Data warehousing and batch processing
D) In-memory data processing

Answer

Study These Flashcards

A

C) Data warehousing and batch processing

Question 20

Q

The main advantage of using Spark over MapReduce is:
A) Spark’s support for SQL queries
B) Spark’s ability to process real-time data
C) Spark’s faster data processing due to in-memory computation
D) Spark’s compatibility with Hadoop’s HDFS

Answer

Study These Flashcards

A

C) Spark’s faster data processing due to in-memory computation

Question 21

Q

Which of the following is NOT a way to create an RDD in Spark?
A) From an existing RDD
B) From a local file system
C) From a remote database
D) From an HDFS file

Answer

Study These Flashcards

A

C) From a remote database

Question 22

Q

HiveQL’s support for “join” operations:
A) Is limited to equality joins
B) Includes support for full outer joins
C) Allows for non-equi joins
D) Is not available in Hive

Answer

Study These Flashcards

A

A) Is limited to equality joins

4. Hadoop Related Projects Flashcards

(22 cards)