Concurrency, Parallelism, Distributed Computing Flashcards

Question 1

Q

Threads vs. Processes

Answer

A

Definition:
Threads share memory within the same process, while processes run independently with separate memory spaces.

Threads: lightweight, easier sharing of data, but must handle synchronization to avoid conflicts (e.g. mutex or semaphore).
Processes: more overhead, but safer isolation (less risk of data corruption).

Question 2

Q

Mutex vs. Semaphore

Answer

A

Definition:
A mutex (mutual exclusion) allows only one thread at a time to access a resource, while a semaphore can allow multiple concurrent accesses (permits).

Mutex: typically “locked” or “unlocked” (binary).
Semaphore: can be counting (supports multiple permits) or binary (similar to a mutex).
Both are used to synchronize threads and avoid race conditions.

Question 3

Q

Deadlock

Answer

A

Definition:
Occurs when two or more threads or processes block each other, each waiting for a resource that the others hold.

Four conditions: Mutual exclusion, Hold and wait, No preemption, Circular wait.
Prevent by careful resource ordering, lock timeouts, or deadlock detection.

Question 4

Q

Race Condition

Answer

A

Definition:
Multiple threads or processes access and modify shared data without proper synchronization, leading to unpredictable or incorrect results.

Commonly fixed via locks, atomic operations, or other synchronization mechanisms.
Debugging can be difficult; best prevented by design.

Question 5

Q

Parallelism vs. Concurrency

Answer

A

Definition:
Concurrency is about dealing with multiple tasks over the same time period (not necessarily simultaneously); Parallelism is about executing tasks simultaneously using multiple cores/CPUs.

Concurrency = managing lots of tasks at once (structure).
Parallelism = running tasks at the exact same time (execution).

Question 6

Q

Threading

Answer

A

Definition:
Creating multiple threads within a process to perform tasks concurrently.

Can improve throughput for I/O-bound tasks.
For CPU-bound tasks in languages like Python (due to the GIL), might not see true parallel speedups.

Question 7

Q

Apache Spark

Answer

A

Definition:
An open-source distributed computing framework for big data processing, with APIs in Scala, Java, Python, and R.

Performs in-memory computations via resilient distributed datasets (RDDs) or DataFrames for speed.
Includes high-level libraries for SQL, streaming, machine learning, and graph processing.

Question 8

Q

Dask

Answer

A

Definition:
A Python library for parallel computing that extends NumPy, Pandas, and scikit-learn APIs to larger-than-memory or distributed datasets.

Uses “lazy” evaluation and task scheduling across multiple cores or machines.
Ideal for scaling Python code without switching to completely different ecosystems.

Question 9

Q

In-Memory vs. Out-of-Memory Computations

Answer

A

Definition:
In-memory computations hold data in RAM for faster processing, while out-of-memory (OOM) computations handle data larger than RAM by streaming or chunking it off disk.

In-memory solutions (e.g., Spark, Dask) significantly reduce I/O overhead but need sufficient RAM.
OOM solutions trade speed for the ability to handle very large datasets.

Question 10

Q

MapReduce (Concept)

Answer

A

Definition:
A programming model for distributed processing of large data sets across clusters (popularized by Hadoop).

Map step: transforms or filters data into key-value pairs.
Reduce step: aggregates or summarizes data by keys.
Foundation for many big-data processing frameworks

Question 11

Q

Question 12

Q

Question 13

Q

Question 14

Q

Brainscape's Knowledge GenomeTM

Concurrency, Parallelism, Distributed Computing Flashcards

Brainscape's Knowledge Genome^TM