Thinking in parallel patterns

Real-World Analogy

Imagine a restaurant kitchen. Some tasks are embarrassingly parallel — every chef can peel a different potato at the same time with no coordination. Some tasks are a pipeline — one chef chops, passes to another who cooks, passes to another who plates. A thread pool is like having a fixed team of cooks who pick up the next order from the ticket rail whenever they finish — no hiring and firing for each dish.

When you have multiple CPU cores, you want them all doing useful work simultaneously. The challenge is deciding how to split the problem. Several proven patterns exist — each suits different kinds of work.

Embarrassingly parallel problems are the easiest: each unit of work is completely independent. Rendering image pixels, applying a filter to each array element, running independent simulations. No thread ever needs to talk to another.

Map pattern: Apply the same function independently to every element of a collection. Because there are no data dependencies between elements, each thread can handle a slice of the array without locks. The result is a new collection of the same size.

Reduce pattern: Combine all elements into a single value (sum, max, count). You cannot do this without any coordination — the final value depends on all partial results. The trick: each thread computes a partial result on its slice, then a final sequential step combines the partial results. This requires only O(threads) lock acquisitions instead of O(n).

Map-reduce together: First map in parallel (independent), then reduce the partial results. This is the fundamental pattern behind parallel sum, parallel search, and frameworks like Hadoop.

Thread pool: Creating and destroying OS threads is expensive (microseconds each). If you have millions of small tasks, creating one thread per task is wasteful. A thread pool pre-creates N worker threads at startup. Tasks are placed in a shared queue. Workers loop: grab a task, execute it, grab the next task. No thread creation/destruction overhead per task.

Pipeline pattern: Stage 1 produces output that stage 2 consumes. Like Unix pipes (cat | grep | sort) but implemented in-process with threads. Each stage runs on its own thread; data flows through bounded queues between stages. All stages run simultaneously on different data items.

Divide and conquer: Recursively split the problem in half, solve each half in parallel (separate thread), then combine results. Parallel mergesort is the classic example: spawn threads for left and right halves, join them, then merge.

When NOT to parallelize

Parallelism has overhead: thread creation, synchronization, cache effects. If the sequential part dominates (Amdahl's Law — see Topic 29), parallelism gives little benefit. If the problem is tiny (e.g. summing 10 numbers), the overhead of creating threads exceeds the work. Always measure before parallelizing.

Master-Worker pattern

A master thread distributes work items to a pool of worker threads. The master can push work to thread-local queues (push system) or workers can pull from a shared queue (pull system). The pull system is simpler and self-load-balancing: idle workers naturally grab more work.

Patterns in code

Map pattern — parallel array processing

/* Each thread processes its own slice — no locks needed */
struct slice { int *arr; int start; int end; };

void *double_elements(void *arg) {
    struct slice *s = arg;
    for (int i = s->start; i < s->end; i++)
        s->arr[i] *= 2;   /* no shared state — no mutex needed */
    return NULL;
}
Key insight: Each thread works on a disjoint slice. No two threads write the same memory location, so no mutex is required. This is the definition of embarrassingly parallel.

Reduce pattern — parallel sum with private accumulators

struct sum_arg { int *arr; int start; int end; long partial; };

void *partial_sum(void *arg) {
    struct sum_arg *s = arg;
    s->partial = 0;
    for (int i = s->start; i < s->end; i++)
        s->partial += s->arr[i];   /* accumulate privately */
    return NULL;
}

/* Main: after joining threads, combine partial results */
long total = 0;
for (int t = 0; t < NTHREADS; t++)
    total += args[t].partial;   /* single-threaded final reduction */
Why this is fast: Only NTHREADS (e.g. 4) additions need synchronization instead of N additions. Most work is lock-free.

Thread pool skeleton

/* Work item: a function pointer + argument */
typedef struct {
    void *(*action)(void *);   /* function to call */
    void *arg;                 /* argument to pass */
} workitem_t;

/* Worker thread: loop forever pulling work from queue */
void *worker_loop(void *arg) {
    while (1) {
        workitem_t *w = dequeue();  /* blocks if queue empty */
        (*w->action)(w->arg);     /* execute the work */
        free(w);
    }
}
Thread pool benefit: N threads are created once at startup. Thousands of work items can be submitted without ever calling pthread_create again. Thread creation cost is amortized over all work items.

Parallel patterns in action

Example 1 — Parallel sum using map-reduce Week 12 Lecture
#include <stdio.h>
#include <pthread.h>
#include <stdlib.h>

#define N 1000000
#define T 4

int arr[N];

struct arg { int start; int end; long partial; };

void *sum_slice(void *vp) {
    struct arg *a = vp;
    a->partial = 0;
    for (int i = a->start; i < a->end; i++)
        a->partial += arr[i];
    return NULL;
}

int main(void) {
    for (int i = 0; i < N; i++) arr[i] = 1;  /* fill with 1s */

    pthread_t tid[T];
    struct arg args[T];
    int slice = N / T;

    /* MAP phase: each thread sums its slice */
    for (int t = 0; t < T; t++) {
        args[t].start = t * slice;
        args[t].end   = (t == T-1) ? N : (t+1)*slice;
        pthread_create(&tid[t], NULL, sum_slice, &args[t]);
    }
    for (int t = 0; t < T; t++)
        pthread_join(tid[t], NULL);

    /* REDUCE phase: combine partial sums */
    long total = 0;
    for (int t = 0; t < T; t++) total += args[t].partial;

    printf("Total: %ld\n", total);  /* expect 1000000 */
    return 0;
}
Output
Total: 1000000
Example 2 — Parallel mergesort with depth-limited parallelism Week 11b Lecture
#include <stdio.h>
#include <pthread.h>

void merge(int *A, int p, int q, int r);  /* merge two sorted halves */
void seq_sort(int *A, int p, int r);      /* sequential mergesort */

struct ms_arg { int *A; int p; int r; int depth; int max_depth; };

void *par_mergesort(void *vp) {
    struct ms_arg *a = vp;
    if (a->depth >= a->max_depth || a->p >= a->r - 1) {
        seq_sort(a->A, a->p, a->r);  /* sequential fallback */
        return NULL;
    }
    int q = (a->p + a->r) / 2;
    struct ms_arg left  = {a->A, a->p,   q,   a->depth+1, a->max_depth};
    struct ms_arg right = {a->A, q+1, a->r, a->depth+1, a->max_depth};

    pthread_t lt, rt;
    pthread_create(<, NULL, par_mergesort, &left);
    pthread_create(&rt, NULL, par_mergesort, &right);
    pthread_join(lt, NULL);
    pthread_join(rt, NULL);
    merge(a->A, a->p, q, a->r);  /* combine after both halves sorted */
    return NULL;
}
Key insight
max_depth controls parallelism: depth 0 = 1 thread, depth 1 = 2 threads, depth 2 = 4 threads. Beyond max_depth, work is sequential to avoid thread explosion.
Example 3 — Pipeline: two-stage producer/consumer Week 12 Lecture
#include <pthread.h>
#include <semaphore.h>
#include <stdio.h>

#define BUF_SIZE 8
int buffer[BUF_SIZE];
int head = 0, tail = 0;
pthread_mutex_t m = PTHREAD_MUTEX_INITIALIZER;
sem_t empty_slots, full_slots;

void *producer(void *arg) {
    for (int i = 0; i < 20; i++) {
        sem_wait(&empty_slots);         /* wait for space */
        pthread_mutex_lock(&m);
        buffer[tail++ % BUF_SIZE] = i;  /* stage 1: produce */
        pthread_mutex_unlock(&m);
        sem_post(&full_slots);          /* signal item ready */
    }
    return NULL;
}

void *consumer(void *arg) {
    for (int i = 0; i < 20; i++) {
        sem_wait(&full_slots);          /* wait for item */
        pthread_mutex_lock(&m);
        int val = buffer[head++ % BUF_SIZE];
        pthread_mutex_unlock(&m);
        sem_post(&empty_slots);         /* free a slot */
        printf("Consumed: %d\n", val);  /* stage 2: process */
    }
    return NULL;
}
Pattern
Producer and consumer run simultaneously on different data. The bounded buffer with semaphores prevents the producer from outrunning the consumer and vice versa.

Practice problems with solutions

P1 — Identify the pattern: which parallelism pattern is each task? Week 11b Lecture

For each task, name the most appropriate parallel pattern and briefly explain why: (a) Apply a brightness filter to every pixel of an image. (b) Find the maximum value in an array of 10 million integers. (c) A web server handling incoming HTTP requests using a fixed set of threads. (d) Sorting a large array by recursively splitting it in half and sorting each half in parallel.

(a) Map pattern. Each pixel can be processed independently — no pixel's brightness depends on another. Trivially parallel.

(b) Reduce pattern. Each thread finds the max of its slice (map), then a final sequential step finds the max of the partial maxes (reduce). Cannot avoid the final combination step.

(c) Thread pool / master-worker. Incoming requests are tasks queued for a fixed pool of worker threads. Avoids creating a new thread per request, which would be too expensive.

(d) Divide and conquer. The array is recursively split; each half is sorted in a new thread; results are merged. Parallel mergesort is the textbook divide-and-conquer example.
P2 — Parallel sum: what is wrong with locking inside the loop? Week 12 Lecture

A student writes a parallel sum where every thread does pthread_mutex_lock(&m); sum += arr[i]; pthread_mutex_unlock(&m); inside its loop. The answer is correct but the program is slower than the sequential version. Explain why, and describe the fix.

Why it is slow: The mutex is acquired and released once per element. With 16 million elements, there are 16 million lock/unlock operations. All threads serialize at the mutex — only one thread can add at a time. The parallel program becomes functionally sequential, plus it has the overhead of mutex operations. This is lock contention.

Fix: Each thread accumulates into a private (local) variable without any lock. After all threads finish, the main thread combines the private sums in a single sequential loop (only NTHREADS additions need synchronization). This reduces lock operations from O(N) to O(threads).
P3 — Thread pool: why is it better than creating one thread per task? Week 12 Lecture

You have a web server that receives 10,000 requests per second. Option A: create a new thread for each request and destroy it when done. Option B: maintain a pool of 16 worker threads that pick requests from a queue. Compare the two approaches in terms of overhead, resource usage, and scalability.

Option A problems: Creating a thread takes ~10–100 microseconds and consumes ~1–8 MB of stack. At 10,000 req/s that is 10,000 thread creations per second — significant CPU overhead just for management. With many simultaneous threads, the OS spends significant time context-switching. Memory usage can balloon (each thread = megabytes of stack).

Option B advantages: Threads are created once at startup. Each new request is just a queue enqueue operation (microseconds). Only 16 threads exist regardless of request rate — predictable memory use. The OS scheduler only context-switches 16 threads. Load balancing is automatic: idle threads grab the next request.
P4 — Scalable parallelism: why not create a thread for every divide step? Week 11b Lecture

In parallel mergesort, one approach creates new pthreads at every recursion level down to single-element base cases (unlimited parallelism). Another approach stops creating threads after a fixed depth (fixed or scalable parallelism). What goes wrong with unlimited parallelism?

Two problems with unlimited parallelism:

1. Thread explosion: Sorting an array of N elements with unlimited parallelism creates O(N) threads. For N = 1,000,000 that is 1,000,000 threads — each consuming megabytes of stack and requiring OS scheduling structures. The system runs out of memory and the OS is overwhelmed.

2. Overhead dominates at base cases: Near the base case (arrays of 1 or 2 elements), the work per thread is trivial (nanoseconds). But creating and joining a thread takes microseconds — the overhead is 1000x the useful work. The program is net-slower than sequential.

Solution: Switch from parallel to sequential mergesort once the recursion depth exceeds max_depth (chosen based on number of available cores). Scalable parallelism adjusts max_depth based on the actual core count at runtime.

Key concepts to memorize

Card 1 of 10
Question — click to flip
Answer
Click card to flip • Use buttons to navigate

Test your understanding

Topic 28 Quiz — Scalable Patterns Score: 0 / 6
1
Which pattern applies the same operation to every element of a collection independently, with no data dependencies between elements?LO10
multiple choice
2
True or False: A thread pool avoids the overhead of creating and destroying a new OS thread for each task.LO10
true / false
3
In a parallel sum, why do threads accumulate into a private variable rather than adding directly to a shared sum with a mutex inside the loop?LO10
multiple choice
4
Fill in: In parallel mergesort, to avoid creating too many threads, we switch to sequential sorting when the recursion depth exceeds ___.LO10
fill in the blank
5
A pipeline pattern is most analogous to which Unix concept?LO10
multiple choice
6
True or False: An embarrassingly parallel problem requires frequent synchronization between threads because tasks share data.LO10
true / false
0/6
Quiz complete!