Big data Flashcards
What are the FOUR dimensions of Big Data?
- Volume: refers to the quantity of available data
- Velocity: refers to the rate at which the data is recorded/collected
- Veracity: refers to quality and applicability of data
- Variety: refers to the different type of available data
What characterizes big data/how is it defined?
Big Data is the Information asset characterized by such a High Volume, Velocity and Variety to require specific Technology and Analytical Methods for its transformation into Value
Gartner: Big data is high-volume, highvelocity and high-variety information assets that demand cost-effective, innovative forms of information processing for
enhanced insight and decision making
What are the drivers behind big data?
- Increased data volumes being captured and stored
- Rapid acceleration of data growth
- Increased data volumes pushed into the network
- Growing variation in types of data assets for analysis
- Alternate and unsynchronized methods for facilitating data delivery
- Rising demand for real-time integration of analytical results
What is NoSQL?
“Not only SQL” –> alternate model for data management
- Provides a variety of methods for managing information to best suit specific business process needs, such as in-memory data management, columnar layouts to speed query response, and graph databases
What is MPP? and how is it related to Big Data?
Massively parallel Processing
–> A type of computer, utilizing high-bandwidth networks and massive I/O devices
RELATION TO BD:
- Big data is smarter, since it couples clusters of hardware components with open source tools and technology
What five aspects will a corporation considering incorporating Big Data need to consider?
• Feasibility: Is the enterprise aligned in a way that allows for new and emerging technologies to be brought into
the organization?
• Reasonability: will the resource requirements exceed the capability of the existing or planned environment?
• Value: do the result warrant the investment?
• Integrability: any constraints or impediments within the organization from a technical, social, or political
perspective?
• Sustainability: are costs associated with maintenance, configuration, skills maintenance, and adjustments to the
level of agility sustainable?
Name the 7 types of people needed for implementing Big Data?
1) Business envangelist –> understands current limitations of existing tech infrastructure
2) Technical envangelist –> undestands the emerging tech and the science behind
3) business analyst –> engages the business process owners, and identify measures to quantify
4) Big Data application architect –> Experienced in high performance computing
5) application developer –> Identify the technical resources with the right set of skills for programming
6) Program manager –> experienced in project managment
7) data scientist –> Experienced in coding and statistics/AI
What is the Big Data framework? and what key components does it consist of?
Overall picture of the Big Data landscape, consists of:
- Infrastructure (e.g. SAP and SQL)
- Analytics (e.g. Google analytics)
- Applications (e.g. Human capital, legal, security)
- Cross-infrastructure analytics (Google, Microsoft, Oracle)
- Open source (R-studio,
- Data Sources and API(Application Programming Interface)s (Garmin, Apple)
What is API?
Application programming interface
is a set of routines, protocols, and tools for building software applications. Basically, an API specifies how software components should interact. Additionally, APIs are used when programming graphical user interface (GUI) components
- Enables software components to talk to each other
Which is better row or column-oriented data?
Column-oriented data; since it reduces the latency by storing each column separately
Access performance; ROW: not good for many simultaneous queries (as opposed to column)
Speed of aggregation; Much faster in column-oriented data
Suitability to compression; column-data better suited for compression, decreasing storage needs.
Data load speed; faster in column, since data is stored separately you can load in parallel using multiple threads
Hardware versus software?
Go to slide 36 and 37 and discuss
Name the four tools and techniques?
Processing capability
- Often interconnected by several nodes, allowing tasks to be run simultaneously called MULTITHREADING
Storage of data
Memory
- Holds the data in the node currently running
Network
- Communication infrastructure between the nodes
What types of architectural clusters exist? And what the two OVERALL?
Slide 42 and 43:
OVERALL: centralized and decentralized
- Fully connected network topology
- Mesh network topology
- Star network topology
- Common bus topology
- Ring network topology
What does the general architecture distinguish between? and what are their roles?
Management of computing resources
- oversees the pool of processing nodes, assign tasks and monitors activity
Management of data/storage
- oversees the data storage pool and distributes datasets across the collection of storage resources
What are the three important layers of Hadoop?
- Hadoop Distributed File System (HDFS)
- MapReduce
- YARN: a new generation framework for job scheduling and cluster management
What are the main function of HDFS?
- Attempts to enable the storage of LARGE files, by distributing the data among a pool of data nodes
- Monitoring of communication between nodes and masters
- Rebalancing of data from one block to another if free capacity is available
- Managing integrity using checksums/digital signatures
- Metadata replication to protect against corruption
- Snapshots/copying of data to establish check-points
What are the four advantages of using HDFS?
1) decreasing the cost of specialty large-scale storage systems
2) providing the ability to rely on commodity components
3) enabling the ability to deploy using cloud-based services
4) reducing system management costs
What is MapReduce?
- It is a software framework
- Used to write applications which process vast amount of data in-parallel on large clusters
- It is fault-tolerant
- Combines both data and computational independence
(both data and computations can be distributed across nodes, which enables strong parallelization)
What are the two steps in MapReduce?
Map: Describes the computation analysis applied to a set of input key/value pairs to produce a set of intermediate key/value pairs
Reduce: the set of values associated with the intermediate key/value pairs output by the Map operation are combined to provide the results
Example; count the number of occurences of a word in a corpus:
key: is the word
value: is the number of times the word is counted
What is parallelization?
the act of designing a computer program or system to process data in parallel. Normally, computer programs compute data serially: they solve one problem, and then the next, then the next. If a computer program or system is parallelized, it breaks a problem down into smaller pieces that can each independently be solved at the same time by discrete computing resources
What are the four use cases for big data?
Counting; document indexing, filtering, aggregation
Scanning; sorting, text analysis, pattern recognition
Modeling; analysis and prediction
Storing; rapid access to stored large datasets
What is data mining?
The art and science of discovering knowledge, insights and patterns in data
- e.g. predicting the winning chances of a sports team
- or identifying friends and foes in warfare
- or forecasting rainfall patterns in a region
It helps recognizing the hidden value in data
Describe the typical process of data mining?
- Understand the application domain
- Identify data sources and select target data
- Pre-process: cleaning, attribute selection
- Data mining to extract patterns or models
- Post-process: identifying interesting or useful patterns
- Incorporate patterns in real world tasks
OR:
Data input –> data consolidation –> data cleaning –> data transformation –> data reduction –> well-formed data
In terms of data mining, what does ETL stand for?
Extract, transform, load