Intro to Hadoop Flashcards
What is Distributed Computing?
A model in which components of a software system are shared among multiple computers to improve EFFICIENCY and PERFORMANCE
What is Distributed Processing?
The execution of a process across multiple computers connected by a computer network
What is a Distributed Object?
A software module designed to work with other distributed objects stored on other computers
What is Distributed File System?
A file system that offers simplified, scalable, highly available access to storing, analysing and processing data
Why do we need Distributed Computing?
Scalability, performance, processing power, efficiency to analyse large data sets for ROI realisation
In a hadoop clusters what are the type of nodes and what daemons do they run?
Master Node: Name Node & Job Tracker
Slave Nide: Data Nodes & Task Trackers
Describe how a Hadoop Cluster has fault tolerance built in?
Data & Tasks are replicated, across the Hadoop cluster nodes in case of failure
This means that if any machine foes down, the cluster will still carry on as normal!
Despite multiple versions of the same data or tasks co-existing inside the cluster at. The same time, Hadoop is smart enough to avoid duplicate results/outputs
Describe some challenges to Distributed Computing?
Heterogeneity – “Describes a system consisting of multiple distinct components”
In many systems in order to overcome heterogeneity a software layer known as Middleware is often used to hide the differences amongst the components underlying layers.
Openness – “Property of each subsystem to be open for interaction with other systems”
So once something has been published it cannot be taken back or reversed. Furthermore, in open distributed systems there is often no central authority, as different systems may have their own intermediary.
The issues surrounding security are those of: Confidentiality, Integration & Availability
To combat these issues encryption techniques such as those of cryptography can help but they are still not absolute. Denial of Service attacks can still occur, where a server or service is bombarded with false requests usually by botnets (zombie computers).
Increasing scalability leads to increase in cost and physical resources. It is also important to avoid performance bottlenecks by using caching and replication.
Fault Tolerance:
Detect Failures – Various mechanisms can be employed such as checksums.
Mask Failures – retransmit upon failure to receive acknowledgement
Recover from failures – if a server crashes roll back to previous state
Build Redundancy – Redundancy is the best way to deal with failures. It is achieved by replicating data so that if one sub system crashes another may still be able to provide the required information.
Concurrency issues arise when several clients attempt to request a shared resource at the same time.
Access Transparency – where resources are accessed in a uniform manner regardless of location
Location Transparency – the physical location of a resource is hidden from the user
Failure Transparency – Always try and Hide failures from users
What are the 2 main types of Data ingestion?
“Real-time” or Streaming and Batched
What are the 4 main considerations when ingesting data?
- Where the data source resides
- How the data will be received
- The data format and structure
- Data quality
Explain the 6 terms: Data cleansing, Data collection, Data custodian, Data-directed decision making, Data integrity and Data integration
Data Cleansing: The act of reviewing and revising data remove duplicate entries, correct misspellings, add missing data and provide more consistency
Data collection: Any process that captures any type of data
Data custodian: A person responsible for the database structure and the technical environment, including the storage of data
Data-directed decision making: Using data to support making crucial decisions
Data integrity: The measure of trust an organisation had in the accuracy, completeness, timeliness, and validity of the data.
Data integration: The process of combining data from different sources and presenting it in a single view.
What is an Exabyte?
One million terabytes or one billion gigabytes of information
What is Hadoop?
Hadoop is an open source framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model
What is “Failover” and why is it useful?
The automatic switching to another computer or node should on fail. It is part of fault tolerance and keeps or jobs task or data running/ accessible should a node fo down.
What is meant by the term “scalability”?
The ability of a system or process to maintain acceptable performance levels as workload or scope increases