Big Data Architectures Flashcards
What is big data?
A collection of large and complex data sets which are difficult to process using
common database management tools or traditional data processing
applications
What are the 4 Vs of Big data?
Volume -> Data at rest
Velocity -> Data in Motion
Variety -> Data in many forms
Veracity -> Data in doubt
What are the two types of sclaling? (ability of the system to adapt to increased demands)
- Horizontal scaling:
- -> distribute workload across many servers by adding multiple machines to improve processing capacity - Vertical scaling
- -> involves installing more processors, more memory and faster hardware typically within a single server (make it bigger instead of more)
What are the advantages and disadvantages of horizontal scaling?
Advantages:
- increases performance in small steps as needed
- financial investment is relatively small
can scale up as much as needed
Disadvantages:
- Software has to handle all the data distribution
- There are only a limited number of softwares available than can take advantage of horizontal scaling
What are the advantages and disadvantages of vertical scaling?
Advantages:
- Most softwares can easily take advantage of vertical scaling
- easy to install hardware within a single machine
Disadvantages:
- requires substantial financial investment
- system has to be more powerful to handle future workloads
- does not necessarily scale up vertically after a certain limit
What are horizontal scaling platforms?
Peer to peer networks
apache hadoop
What are vertical scaling platforms?
Multicore processors
HPC high performance computing clusters
Graphics processing units
What is a peer to peer network?
- typically involves millions of machines connected in a network
- decentralized and distributes network architecture
- message passing interface
-each node is capacle of storing and processing data
scale is practically unlimited
drawbacks:
- communication is a major bottleneck
What is apache hadoop?
an open source software for storing and processing large datasets
what are high performance computing clusters? (HPC)
also known as super computers with throusands of processing cores
built powerful hardware optimized for speed and
throughput
What are multicore CPUs?
One machine having dozens of processing
cores