Big Data and Machine Learning Flashcards
What is Big Data?
Big Data is the field that analyzes and deals with complex and large sets of data (thousands of exabytes) that cannot be handled by conventional software.
It is used by Netflix to gather user behaviour data from its more than 100 million customers which helps Netflix analyze it and recommend a movie or show.
Big Data categorizes the data in 3 types:
- Structured: all the data currently existing in databases, data arranged in relational row-column format such as excel, data gathered from medical devices, GPS etc.
- Semi Structured: data having some level of structure such as log data, emails etc
- Unstructured: data with no structure such as satellite images, data uploaded by users on social media etc.
What is Hadoop?
It is an open source software platform for distributed storage and processing of very large data sets on computer clusters built from commodity hardware.
It is basically a solution framework to process and analyze Big Data and is mainly written in Java. It is not just one project, but a set of projects (ecosystem).
With the default replication value of 3, data is stored on three nodes: two on the same rack, and one on a different rack.
Hadoop is made up of two parts: MapReduce and Hadoop Distributed File System (HDFS). MapReduce handles processing and HDFS handles storage of data.
What are the four Vs of Big Data?
Volume: deals with scale of data
Velocity: deals with analysis of streaming data
Veracity: deals with degree of accuracy of a data set.
Variety: deals with different forms of data (medical data(patient information, x-rays etc) , data from social media (posts, images, videos), data generated by GPS etc)
What is the need for Big Data solutions like Hadoop?
With traditional databases, we would scale vertically to store the data but it involves problems such as higher disk seek times, hardware failures and processing times.
Also, traditional databases store relational/structured data whereas Big Data contains unstructured data.
As Hadoop supports huge volumes of unstructured data storage and processing with horizontal scaling, it is a good solution.
Explain the two parts of Hadoop.
Hadoop consists of two parts: MapReduce and Hadoop Distributed File System (HDFS). These two exist on every machine on which we are storing the data.
MapReduce is the processing part of Hadoop.
HDFS stores all the data, contains files and directories scaling out to many petabytes. Hadoop interacts with HDFS with shell commands.
MapReduce server on each machine is called TaskTracker and is responsible for launching the tasks on that machine. HDFS server on each machine is called DataNode and it stores blocks of data and provides access to it.
One TaskTracker and a DataNode make up a single machine. To make a cluster, we replicate the pattern of TaskTracker and DataNode on several machines, adding to our storage. A JobTracker and NameNode co-ordinate all the TaskTrackers and DataNodes in a cluster.
Explain JobTracker.
The MapReduce needs a co-ordinator to co-ordinate between all the TaskTrackers on multiple machines.
This co-ordinator is called JobTracker and is responsible for accepting user’s jobs, dividing it into tasks and assigning each task to individual TaskTracker. The TaskTrackers then run the job and report their status to JobTracker.
JobTracker is also responsible for noticing if a TaskTracker disappers because of software/hardware failure. It then automatically assigns those tasks to another TaskTracker.
Explain NameNode.
The MapReduce needs a co-ordinator to co-ordinate between all the DataNodes on multiple machines.
This co-ordinator is called NameNode and is responsible for keeping the location information of stored data.
When client writes to HDFS, he/she talks to the NameNode, gets told where to store the data and then writes the data to that DataNode.
When client reads from HDFS, he/she talks to the NameNode, gets told where the data is stored and then talks directly to that DataNode.
The actual data never flows through NameNode,only the information about where the data is located.
Similar to JobTracker, NameNode is responsible for noticing when a DataNode has disappeared and automatically replicating the data.
What are the key characteristics of Hadoop?
- Reliable: Data is held on multiple DataNodes and will be automatically re-replicated if a machine fails. Tasks are reassigned if they fail.
- Scalable: We can have one application running on multiple machines, scaling linearly. If we need more computational power, we can add more machines.
- provides simple APIs: for data compute and access.
- powerful: massive amounts of data can be parallely processed
- economic: as it runs on commodity hardware, it is not expensive. Commodity hardware is low-end, broadly compatible hardware product such as standard-issue PC with no outstanding features.
Explain Hadoop ecosystem.
Hadoop is not a single project, but a set of multiple projects such as:
- Pig: it is a high level scripting language that translates to MapReduce. Instead of writing a whole chain of MapReduce jobs to get the processing done, we can write a high level description of processing, provide it to Pig which runs it and converts it to MapReduce job.
- Hive: it is a SQL-like interface which works similar to Pig, translating high level description to MapReduce job. It uses Hive Query Language (HQL), similar to SQL, to write the queries.
- HBase: HBase is a top-level project that allows for real time read/write access to distributed data, rather than traditional batch processing. HBase can be accessed by Pig, Hive and MapReduce and stores its information in HDFS so that its guaranteed to be reliable.
- Zookeeper: it is used for cluster management and provides co-ordination between different servers.
- HCatalog: is the metadata server for Hive which stores the data about tables that Hive can process. It was initally part of Hive but in order for Pig and MapReduce to access it, it was pulled out to make a separate project. It can access data in HDFS or HBase.
Explain YARN.
Yet Another Resource Negotiator (YARN) is a Hadoop ecosystem module, introduced in Hadoop 2.0 providing resource management and is responsible for monitoring and managing. It is also called operating system of Hadoop.
There are 4 components of YARN:
- client: submits MapReduce jobs.
- Resource Manager: manages resources across cluster
- Node manager: responsible for launch and monitor of computer container in a cluster
- MapReduce Application Master