Week 5 - practice quiz Flashcards
Which company has created the MapReduce framework as a concept?
1) Amazon
2) Oracle
3) Microsoft
4) Google
4) Google
Which company has implemented Hadoop an an open-source version of MapReduce?
1) Google
2) Amazon
3) Microsoft
4) Yahoo
4) Yahoo
Which of the following is true about the Hadoop file system?
1) Files are append-only
2) Files split in to 1 GB blocks
3) Meta node stores metadata
4) Each node stores distinct data blocks
1) Files are append-only
What does HDFS stand for?
1) Highly Distributed File System
2) Highly Disturbed File System
3) High Definition File System
4) Hadoop File System
4) Hadoop File System
Hadoop Disturbed File System
What is the data type used by Hadoop for a MapReduce process?
1) Column-based
2) Document-based
3) Graph-based
4) Key-value
4) Key-value
What is the output of the Map function in a MapReduce process?
1) List of graph nodes
2) List of key-value pairs.
3) List of table columns
4) List of network nodes
2) List of key-value pairs.
Where do mapper nodes save their outputs before serving to reducer nodes?
1) Local disk
2) Another node
3) Central node
4) Master node
1) Local disk
What does Hadoop do with a task that crashes in a node?
1) The task is retried on another node.
2) The node is rebooted.
3) The task is failed.
4) The node is shut down.
1) The task is retried on another node.
Apache Spark sorts its data processing operations, such as collect, filter, and sort, by building a graph called DAG. What does DAG stand for?
1) Derived Apache Graph
2) Distributed Apache Graph
3) Directed Acyclic Graph
4) Distributed Asymmetric Graph
3) Directed Acyclic Graph
Which of the following statements about the difference between Hadoop and Spark is true?
1) Hadoop supports in-memory cluster computing.
2) Hadoop is faster than Spark.
3) Both Hadoop and Spark can load data from Hadoop File System (HDFS)
4) Hadoop provides multiple built-in data processing operations such as filter and join.
3) Both Hadoop and Spark can load data from Hadoop File System (HDFS)
What is the input for the Reduce function in a MapReduce process?
1) Keys and their corresponding list of values.
2) Keys and their corresponding maps.
3) Keys and their corresponding nodes.
4) Maps and their corresponding values.
1) Keys and their corresponding list of values.
What is the output of the Reduce function in a MapReduce process?
1) List of key-value pairs
2) List of key-node pairs.
3) List of key-reducer pairs.
4) List of key-mapper pairs.
1) List of key-value pairs
Which of the following is the correct sequence of phases in a MapReduce process?
1) Input, Splitting, Shuffling, Mapping, Reducing, Output
2) Input, Splitting, Mapping, Reducing, Shuffling, Output
3) Input, Splitting, Mapping, Shuffling, Reducing, Output
4) Input, Mapping, Splitting, Shuffling, Reducing,
3) Input, Splitting, Mapping, Shuffling, Reducing, Output
What does Hadoop do with a task that repeatedly crashes in a MapReduce system?
1) The task is failed.
2) The task is retried on another system.
3) The system is rebooted.
4) The system is shut down.
1) The task is failed.
What does Hadoop do when a node crashes during a MapReduce process?
1) Ignores all of the maps created on all of the nodes.
2) Ignores all of the maps created on the node crashed.
3) Re-launches any maps the node previously ran.
4) Re-launches any maps all of the nodes previously ran.
3) Re-launches any maps the node previously ran.
Which of the following data operators requires implementation of a reduce function in a MapReduce
1) GROUP BY
2) SELECT
3) PROJECT
4) SORT
1) GROUP BY
What is the output of a JOIN operation in a MapReduce process?
1) Key-column pairs
2) Key-node pairs
3) Key-map pairs
4) Key-value pairs
4) Key-value pairs
What is Apache Spark?
1) A cloud-based spreadsheet software.
2) Interconnected computing nodes.
3) A cluster of server computers.
4) A distributed data-processing software.
4) A distributed data-processing software.
Apache Spark relies on a database concept called RDD. What does RDD stand for?
1) Relational Dynamic Database
2) Recoverable Distributed Database
3) Resilient Distributed Dataset
4) Rigorous Distributed Database
3) Resilient Distributed Dataset
There are two types of RDD operations in Apache Spark: transformation and action. Which of the following is an action operation?
1) Count
2) Map
3) Filter
4) Join
1) Count
Which of the following was written on top of the Apache Spark software?
1) Python
2) GraphX
3) Java
4) Scala
2) GraphX
Which of the following big data software is implemented by Google to rank websites using their popular PageRank algorithm?
1) Oracle
2) MySQL
3) Spark SQL
4) GraphX
4) GraphX
What is the method implemented by Apache Spark to process live streaming data?
1) Real time processing
2) Batch processing
3) Binary processing
4) On-demand processing
2) Batch processing
Which of the following is an example of live streaming data?
1) Student grades submitted by an instructor.
2) An online banking statement for an individual.
3) A Wikipedia article about a historical figure.
4) A Twitter hashtag containing a company name.
4) A Twitter hashtag containing a company name.
During the processing of live streaming data by Apache Spark, what does each batch correspond to?
1 point
1) RDD (Resilient Distributed Dataset)
2) Node
3) Query
4) Second
1) RDD (Resilient Distributed Dataset)
How is spatial data different from traditional data?
1) Spatial data is tied to physical space.
2) Spatial data represents simple spaces.
3) Spatial data has one dimension.
1) Spatial data is tied to physical space.
What is the best definition of a KNN query?
1) A KNN query is a query that is nested inside a SQL statement and is embedded in the where clause.
2) A KNN query retrieves all records where a value is between an upper and lower boundary.
3) KNN query is the nearest neighbor of a given query point q to find k closest objects from q based on it’s spatial distance.
3) KNN query is the nearest neighbor of a given query point q to find k closest objects from q based on it’s spatial distance.
How does Hadoop and MapReduce relate to each other?
1) MapReduce is the framework used by Hadoop software.
2) Hadoop is a computer operating system while MapReduce is a software application.
3) Hadoop is the framework used by MapReduce software.
4) Hadoop is a server-side application while MapReduce runs on client computers.
1) MapReduce is the framework used by Hadoop software.
Which of the following is the correct order of functions in the typical processing of big data?
1) Map and Reduce functions can run in parallel.
2) Map and Reduce functions can run simultaneously.
3) The Map function has to finish before the Reduce function starts.
4) The Reduce function has to finish before the Map function starts.
3) The Map function has to finish before the Reduce function starts.
What is the name of the transitional phase between the Map and Reduce phases in a big data process?
1) Data mapping
2) Data mining
3) Data scrubbing
4) Data shuffling
4) Data shuffling
What happens during the data shuffling phase in a typical big data process?
1) Data generated during the reduce phase is encrypted to make it secure.
2) Data generated during the reduce phase is routed to different nodes in the cluster.
3) Data generated during the map phase is routed to different nodes in the cluster.
4) Data generated during the map phase is encrypted to make it secure.
3) Data generated during the map phase is routed to different nodes in the cluster.
Which of the following phases in a typical Hadoop process provide full programming control to users?
1) Map and Reduce
2) Map and Shuffling
3) Reduce and Shuffling
4) Compress and Shuffling
1) Map and Reduce
How many copies of a piece of data are generated by the Hadoop File System (HDFS) in order to allow for fault tolerance?
1) 3
2) 8
3) 10
4) 64
1) 3
In which programming language are Map and Reduce functions written?
1) HTML
2) Java
3) C++
4) Python
2) Java
What is the size of each data block in the Hadoop file system?
1) 128 MB
2) 1 GB
3) 100 MB
4) 1 MB
1) 128 MB
What does each node correspond to in a Hadoop cluster?
1) Data center
2) A data block
3) A computing machine
4) A data cloud
3) A computing machine
What is the name of the special node in a Hadoop cluster that stores metadata of the entire cluster?
1) Name node
2) Master node
3) Hub node
4) Meta node
2) Master node