Concurrency, Parallelism, Distributed Computing Flashcards
Threads vs. Processes
Definition:
Threads share memory within the same process, while processes run independently with separate memory spaces.
- Threads: lightweight, easier sharing of data, but must handle synchronization to avoid conflicts (e.g. mutex or semaphore).
- Processes: more overhead, but safer isolation (less risk of data corruption).
Mutex vs. Semaphore
Definition:
A mutex (mutual exclusion) allows only one thread at a time to access a resource, while a semaphore can allow multiple concurrent accesses (permits).
- Mutex: typically “locked” or “unlocked” (binary).
- Semaphore: can be counting (supports multiple permits) or binary (similar to a mutex).
- Both are used to synchronize threads and avoid race conditions.
Deadlock
Definition:
Occurs when two or more threads or processes block each other, each waiting for a resource that the others hold.
- Four conditions: Mutual exclusion, Hold and wait, No preemption, Circular wait.
- Prevent by careful resource ordering, lock timeouts, or deadlock detection.
Race Condition
Definition:
Multiple threads or processes access and modify shared data without proper synchronization, leading to unpredictable or incorrect results.
- Commonly fixed via locks, atomic operations, or other synchronization mechanisms.
- Debugging can be difficult; best prevented by design.
Parallelism vs. Concurrency
Definition:
Concurrency is about dealing with multiple tasks over the same time period (not necessarily simultaneously); Parallelism is about executing tasks simultaneously using multiple cores/CPUs.
- Concurrency = managing lots of tasks at once (structure).
- Parallelism = running tasks at the exact same time (execution).
Threading
Definition:
Creating multiple threads within a process to perform tasks concurrently.
- Can improve throughput for I/O-bound tasks.
- For CPU-bound tasks in languages like Python (due to the GIL), might not see true parallel speedups.
Apache Spark
Definition:
An open-source distributed computing framework for big data processing, with APIs in Scala, Java, Python, and R.
- Performs in-memory computations via resilient distributed datasets (RDDs) or DataFrames for speed.
- Includes high-level libraries for SQL, streaming, machine learning, and graph processing.
Dask
Definition:
A Python library for parallel computing that extends NumPy, Pandas, and scikit-learn APIs to larger-than-memory or distributed datasets.
- Uses “lazy” evaluation and task scheduling across multiple cores or machines.
- Ideal for scaling Python code without switching to completely different ecosystems.
In-Memory vs. Out-of-Memory Computations
Definition:
In-memory computations hold data in RAM for faster processing, while out-of-memory (OOM) computations handle data larger than RAM by streaming or chunking it off disk.
- In-memory solutions (e.g., Spark, Dask) significantly reduce I/O overhead but need sufficient RAM.
- OOM solutions trade speed for the ability to handle very large datasets.
MapReduce (Concept)
Definition:
A programming model for distributed processing of large data sets across clusters (popularized by Hadoop).
- Map step: transforms or filters data into key-value pairs.
- Reduce step: aggregates or summarizes data by keys.
- Foundation for many big-data processing frameworks