Exam Practice Flashcards
What is Big Data?
Four pillars;
1) Information
2) Technology
3) Impact
4) Methods
Big Data is the Information Asset
What dimensions underlie Big Data?
1) Volume: quantity of available data
2) Velocity: rate at which data is collected/recorded
3) Veracity: quality and applicability of data
4) Variety: different types of data available
What is Garner’s description of Big Data?
Big data is high-volume, high velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making
Why do we call it “Big” Data?
Because the resources exceed the capabilities of traditional computing environments
What are the drivers of Big Data?
Non-exhaustive list:
- Increased data volumes
- Rapid acceleration of data growth
- Growing variation in data types for analysis
- Alternate and unsynchronized methods for facilitating data delivery
- Rising demand for real-time integration of analytical results
What is the process for piloting technologies to determine their feasibility and business value and for engaging business sponsors and socializing the benefits of a selected technique?
1) channel the energy and effort of test-driving big data technologies
2) determine whether those technologies add value
3) devise a communication strategy for sharing the message to the right people within the organisation
What must happen to bring big data analytics into organization’s system development life cycle to enable their use?
1) develop tactics for technologists, data management professionals and business stakeholders to work together
2) migrate the Big Data projects into the production environment in a controlled and managed way
How to assess value of Big Data?
1) feasibility: does organisational setup permit new and emerging technologies?
2) reasonability: is the resource requirements within capacity?
3) value: do the result warrant investment?
4) integrability: any impediments within the organisation?
5) sustainability: maintenance costs manageable?
What is Hadoop? And mention the three important layers.
Apache Hadoop is a collection of open-source software utilities for distributed storage and processing of Big Data using the MapReduce programming model
Important layers:
1) Hadoop Distributed File System (HDFS)
2) MapReduce
3) YARN: job scheduling and cluster management
How can organisations plan to support Big Data?
Get the people right (Business Evangelists, Technical Evangelists, Business Analysts, Big Data Application Architect, Application Developer, Program Manager, Data Scientists)
What is parallel computing?
Type of computation where many calculations are carried out simultaneously. Problems can be broken into pieces and solved at the same time.
Parallelism has long been employed in high-performance computing (multi-core processors).
What is distributed computing?
Model in which components of a software system are shared among multiple computers to improve efficiency and performance.
For example, in the typical distribution using the 3-tier model, user interface processing is performed in the PC at the user’s location, business processing is done in a remote computer, and database access and processing is conducted in another computer that provides centralized access for many business processes. Typically, this kind of distributed computing uses the client/server communications model.
What is the Big Data Landscape 2016?
Infrastructure (e.g. Hadoop) Analytics (e.g. Statistical Computing) Applications (e.g. Sales and Marketing) Cross-Infrastructure/Analytics (e.g. Google) Open Source Data Sources (e.g. Apple Health)
What is the Big Data Landscape of 2019?
Infrastructure Analytics and Machine Learning Applications - Enterprise Applications - Industry Cross-Infrastructure/Analytics Open Source Data Sources Data Resources
What is the Big Data framework?
Analytical applications that combine the means for developing and implementing algorithms, which must access, consume and manage data
What is encompassed in a technological ecosystem?
- Scalable storage
- Computing platform
- Data management environment
- Application development framework
- Scalable analytics
- Project management processes and tools
Describe row-oriented data
The entire record must be read to access required attributes
Traditional database systems employ a row-oriented layout: values at specific rows are laid out consecutively in memory
Describe column-oriented data
Values are stored by column per variable
Values can be stored separately
Reduced latency to access data compared to row
What are key difference between row vs. column-oriented data
Four dimensions of comparison;
1) Access performance: column faster than row
2) Speed of joins and aggregation: column has less access latency than row
3) Suitability to compression: with column you can compress data to decrease storage needs while maintaining high performance. Difficult to apply compression to row for increasing performance
4) Data load speed: in row, all data values need to be stored together and therefore prevent parallel loading. In columns, data can be segregated, thus allowing to load columns in parallel (dual cores) using multiple threads to work on each column
Describe tools and techniques for data management.
1) Processing capability: multi processing nodes often incorporate multiple cores to handle tasks run simultaneously
2) Memory: holds the data in the node currently running and generally has an upper limit per node
3) Storage: provides persistence of data and it is the place where datasets and databases are kept ready to be accessed
4) Network: this is the communication infrastructure between nodes and allows for information exchange
What is a cluster in data architecture?
A collection of interconnected nodes
Mention architecture cluster types from class.
- Fully connected network topology (all-to-all)
- Common bus topology (sequence, one-to-next)
- Mesh network topology (some-to-some)
- Star network topology (one-to-many)
- Ring network topology (neighbor-to-neighbor)
Describe in detail the three layers of Hadoop.
1) HDFS
Attempts to enable storage of large files by distributing the data among a pool of data nodes. A HDFS file appears to be one file, even though it blocks “chunks” of the file into pieces that are stored on individual data nodes. HDFS provides a level of fault-tolerance through data replication.
2) MapReduce
Used to write applications which process vast amounts of data (multi-terabyte data-sets) in parallel on large clusters (thousands of nodes). It is too fault-tolerant.
3) Yarn
What are the value proposition of HDFS from Hadoop?
1) Decrease cost of specialty large-scale storage systems
2) Providing the ability to rely on commodity components
3) Enabling the ability to deploy using cloud-based services
4) Reducing system management cost