11 - Scaling Up: Big Data Flashcards
1
Q
Parallel processing
A
- parallelism: work on separate pieces at the same time
- – challenges = coordination, mutability, blocking
- distributed computing: same but != CPU, machine
- – challenges = sending instruction, fault tolerance, data storage and retrieving
2
Q
Programming: imperative, declarative
A
- imperative = direct orders, manual scheduling and data ctl, optimize perf possible (C, C++, Java, Matlab)
- declarative = state goals, data automatically managed and stored, automatic scheduling but not necessarily efficient (SQL, R, Python can be)
3
Q
Queue computing
A
- master (or name) node(s) = main address
- worker node = where computation is performed
- scheduler = decides which job, which resources
4
Q
Databases
A
- SQL = structured query language
- NOSQL (beyond)
5
Q
Big data
A
- PageRank: sort website by qlty (Google)
- MapReduce
- – map = apply fct° to every e/ of list
- – reduce = aggregate e/ and summarize
6
Q
Further big data
A
- distributed computing: Hadoop, Spark, Dask, DAGs
- cloud computing: Spark