Hadoop Flashcards
hadoop: The first phase in a mapreduce program is the
map phase
hadoop: The map job basically
splits the search for relevant data across multiple nodes, and then collects the relevant data back into one node.
hadoop: Apache Spark is
Apache Spark is an open-source cluster computing framework that allows you to load data into a cluster’s memory and query it repeatedly, which is ideal for machine learning.
hadoop: ETL stands for
Extraction, Transformation, and Loading
Big Data: The three Vs that define big data are
velocity, variety, volume
hadoop: The cluster is usually made up of
mid range rack mounted servers
hadoop: Hive allows you to
use sql for the mapreduce instead of having to code
hadoop: Impala is
a way to query data using sql without using mapreduce
hadoop: sqoop is used to
move data from a relational database to hdfs
spark: Spark is
much faster than mapreduce
spark: RDD stands for
Resilient Distributed Dataset