Scalable Algorithmic Templates
Parallel design patterns for multi-threaded programs — map-reduce, work queues, and thread pools.
Thinking in parallel patterns
Imagine a restaurant kitchen. Some tasks are embarrassingly parallel — every chef can peel a different potato at the same time with no coordination. Some tasks are a pipeline — one chef chops, passes to another who cooks, passes to another who plates. A thread pool is like having a fixed team of cooks who pick up the next order from the ticket rail whenever they finish — no hiring and firing for each dish.
When you have multiple CPU cores, you want them all doing useful work simultaneously. The challenge is deciding how to split the problem. Several proven patterns exist — each suits different kinds of work.
Embarrassingly parallel problems are the easiest: each unit of work is completely independent. Rendering image pixels, applying a filter to each array element, running independent simulations. No thread ever needs to talk to another.
Map pattern: Apply the same function independently to every element of a collection. Because there are no data dependencies between elements, each thread can handle a slice of the array without locks. The result is a new collection of the same size.
Reduce pattern: Combine all elements into a single value (sum, max, count). You cannot do this without any coordination — the final value depends on all partial results. The trick: each thread computes a partial result on its slice, then a final sequential step combines the partial results. This requires only O(threads) lock acquisitions instead of O(n).
Map-reduce together: First map in parallel (independent), then reduce the partial results. This is the fundamental pattern behind parallel sum, parallel search, and frameworks like Hadoop.
Thread pool: Creating and destroying OS threads is expensive (microseconds each). If you have millions of small tasks, creating one thread per task is wasteful. A thread pool pre-creates N worker threads at startup. Tasks are placed in a shared queue. Workers loop: grab a task, execute it, grab the next task. No thread creation/destruction overhead per task.
Pipeline pattern: Stage 1 produces output that stage 2 consumes. Like Unix pipes (cat | grep | sort) but implemented in-process with threads. Each stage runs on its own thread; data flows through bounded queues between stages. All stages run simultaneously on different data items.
Divide and conquer: Recursively split the problem in half, solve each half in parallel (separate thread), then combine results. Parallel mergesort is the classic example: spawn threads for left and right halves, join them, then merge.
Parallelism has overhead: thread creation, synchronization, cache effects. If the sequential part dominates (Amdahl's Law — see Topic 29), parallelism gives little benefit. If the problem is tiny (e.g. summing 10 numbers), the overhead of creating threads exceeds the work. Always measure before parallelizing.
A master thread distributes work items to a pool of worker threads. The master can push work to thread-local queues (push system) or workers can pull from a shared queue (pull system). The pull system is simpler and self-load-balancing: idle workers naturally grab more work.
Patterns in code
Map pattern — parallel array processing
/* Each thread processes its own slice — no locks needed */ struct slice { int *arr; int start; int end; }; void *double_elements(void *arg) { struct slice *s = arg; for (int i = s->start; i < s->end; i++) s->arr[i] *= 2; /* no shared state — no mutex needed */ return NULL; }
Reduce pattern — parallel sum with private accumulators
struct sum_arg { int *arr; int start; int end; long partial; }; void *partial_sum(void *arg) { struct sum_arg *s = arg; s->partial = 0; for (int i = s->start; i < s->end; i++) s->partial += s->arr[i]; /* accumulate privately */ return NULL; } /* Main: after joining threads, combine partial results */ long total = 0; for (int t = 0; t < NTHREADS; t++) total += args[t].partial; /* single-threaded final reduction */
Thread pool skeleton
/* Work item: a function pointer + argument */ typedef struct { void *(*action)(void *); /* function to call */ void *arg; /* argument to pass */ } workitem_t; /* Worker thread: loop forever pulling work from queue */ void *worker_loop(void *arg) { while (1) { workitem_t *w = dequeue(); /* blocks if queue empty */ (*w->action)(w->arg); /* execute the work */ free(w); } }
Parallel patterns in action
#include <stdio.h>
#include <pthread.h>
#include <stdlib.h>
#define N 1000000
#define T 4
int arr[N];
struct arg { int start; int end; long partial; };
void *sum_slice(void *vp) {
struct arg *a = vp;
a->partial = 0;
for (int i = a->start; i < a->end; i++)
a->partial += arr[i];
return NULL;
}
int main(void) {
for (int i = 0; i < N; i++) arr[i] = 1; /* fill with 1s */
pthread_t tid[T];
struct arg args[T];
int slice = N / T;
/* MAP phase: each thread sums its slice */
for (int t = 0; t < T; t++) {
args[t].start = t * slice;
args[t].end = (t == T-1) ? N : (t+1)*slice;
pthread_create(&tid[t], NULL, sum_slice, &args[t]);
}
for (int t = 0; t < T; t++)
pthread_join(tid[t], NULL);
/* REDUCE phase: combine partial sums */
long total = 0;
for (int t = 0; t < T; t++) total += args[t].partial;
printf("Total: %ld\n", total); /* expect 1000000 */
return 0;
}
#include <stdio.h>
#include <pthread.h>
void merge(int *A, int p, int q, int r); /* merge two sorted halves */
void seq_sort(int *A, int p, int r); /* sequential mergesort */
struct ms_arg { int *A; int p; int r; int depth; int max_depth; };
void *par_mergesort(void *vp) {
struct ms_arg *a = vp;
if (a->depth >= a->max_depth || a->p >= a->r - 1) {
seq_sort(a->A, a->p, a->r); /* sequential fallback */
return NULL;
}
int q = (a->p + a->r) / 2;
struct ms_arg left = {a->A, a->p, q, a->depth+1, a->max_depth};
struct ms_arg right = {a->A, q+1, a->r, a->depth+1, a->max_depth};
pthread_t lt, rt;
pthread_create(<, NULL, par_mergesort, &left);
pthread_create(&rt, NULL, par_mergesort, &right);
pthread_join(lt, NULL);
pthread_join(rt, NULL);
merge(a->A, a->p, q, a->r); /* combine after both halves sorted */
return NULL;
}
#include <pthread.h>
#include <semaphore.h>
#include <stdio.h>
#define BUF_SIZE 8
int buffer[BUF_SIZE];
int head = 0, tail = 0;
pthread_mutex_t m = PTHREAD_MUTEX_INITIALIZER;
sem_t empty_slots, full_slots;
void *producer(void *arg) {
for (int i = 0; i < 20; i++) {
sem_wait(&empty_slots); /* wait for space */
pthread_mutex_lock(&m);
buffer[tail++ % BUF_SIZE] = i; /* stage 1: produce */
pthread_mutex_unlock(&m);
sem_post(&full_slots); /* signal item ready */
}
return NULL;
}
void *consumer(void *arg) {
for (int i = 0; i < 20; i++) {
sem_wait(&full_slots); /* wait for item */
pthread_mutex_lock(&m);
int val = buffer[head++ % BUF_SIZE];
pthread_mutex_unlock(&m);
sem_post(&empty_slots); /* free a slot */
printf("Consumed: %d\n", val); /* stage 2: process */
}
return NULL;
}
Practice problems with solutions
For each task, name the most appropriate parallel pattern and briefly explain why: (a) Apply a brightness filter to every pixel of an image. (b) Find the maximum value in an array of 10 million integers. (c) A web server handling incoming HTTP requests using a fixed set of threads. (d) Sorting a large array by recursively splitting it in half and sorting each half in parallel.
(b) Reduce pattern. Each thread finds the max of its slice (map), then a final sequential step finds the max of the partial maxes (reduce). Cannot avoid the final combination step.
(c) Thread pool / master-worker. Incoming requests are tasks queued for a fixed pool of worker threads. Avoids creating a new thread per request, which would be too expensive.
(d) Divide and conquer. The array is recursively split; each half is sorted in a new thread; results are merged. Parallel mergesort is the textbook divide-and-conquer example.
A student writes a parallel sum where every thread does pthread_mutex_lock(&m); sum += arr[i]; pthread_mutex_unlock(&m); inside its loop. The answer is correct but the program is slower than the sequential version. Explain why, and describe the fix.
Fix: Each thread accumulates into a private (local) variable without any lock. After all threads finish, the main thread combines the private sums in a single sequential loop (only NTHREADS additions need synchronization). This reduces lock operations from O(N) to O(threads).
You have a web server that receives 10,000 requests per second. Option A: create a new thread for each request and destroy it when done. Option B: maintain a pool of 16 worker threads that pick requests from a queue. Compare the two approaches in terms of overhead, resource usage, and scalability.
Option B advantages: Threads are created once at startup. Each new request is just a queue enqueue operation (microseconds). Only 16 threads exist regardless of request rate — predictable memory use. The OS scheduler only context-switches 16 threads. Load balancing is automatic: idle threads grab the next request.
In parallel mergesort, one approach creates new pthreads at every recursion level down to single-element base cases (unlimited parallelism). Another approach stops creating threads after a fixed depth (fixed or scalable parallelism). What goes wrong with unlimited parallelism?
1. Thread explosion: Sorting an array of N elements with unlimited parallelism creates O(N) threads. For N = 1,000,000 that is 1,000,000 threads — each consuming megabytes of stack and requiring OS scheduling structures. The system runs out of memory and the OS is overwhelmed.
2. Overhead dominates at base cases: Near the base case (arrays of 1 or 2 elements), the work per thread is trivial (nanoseconds). But creating and joining a thread takes microseconds — the overhead is 1000x the useful work. The program is net-slower than sequential.
Solution: Switch from parallel to sequential mergesort once the recursion depth exceeds
max_depth (chosen based on number of available cores). Scalable parallelism adjusts max_depth based on the actual core count at runtime.
Key concepts to memorize
Test your understanding
sum with a mutex inside the loop?LO10