Performance of Parallel Programs
Amdahl's Law, serial fraction bottlenecks, strong vs weak scaling, false sharing, load balancing, measuring speedup with clock_gettime, and profiling with gprof / perf.
Why adding more threads does not always make things faster
Imagine a restaurant kitchen preparing a three-course dinner. Each meal has two phases: mise en place (peeling, chopping, measuring — each chef can do their own prep independently) and plating (the head chef must plate every dish personally, one at a time, before service). No matter how many prep cooks you hire, the bottleneck is the single head chef doing plating. Hiring ten prep cooks instead of two barely helps — most of them are idle while the head chef works. The head chef is the serial fraction of your kitchen. This is Amdahl's Law in a commercial kitchen.
Every parallel program is divided into two regions: a sequential part that cannot be parallelised (typically due to data dependencies or shared state), and a parallelisable part that can be divided across multiple threads or cores. Gene Amdahl observed in 1967 that the sequential part fundamentally limits how fast the whole program can ever become, regardless of how many processors you throw at it.
Why does a serial fraction limit you so severely? Say your program takes 100 seconds. 50 seconds are unavoidably sequential (the head chef). 50 seconds are parallelisable (the prep work). Even if you use infinitely many threads, the parallel portion drops to 0 seconds — but the sequential 50 seconds remains. Your total time is now 50 seconds. You have achieved at most a 2x speedup, forever, no matter how many cores you buy.
Strong scaling is what we described above: keep the total problem size fixed and add more processors. As you scale out, the amount of work per processor shrinks, and synchronisation overhead becomes a larger fraction. Curves level off and eventually adding more processors hurts performance. Weak scaling (Gustafson's Law) is different: grow the problem size proportionally with the number of processors, so each processor always has the same amount of work. Efficiency stays approximately constant — a much more optimistic scenario for large HPC workloads.
Strong scaling: Fixed problem size, more processors → measures how fast you can solve the same problem.
Weak scaling: Problem size grows with processors → measures how large a problem you can solve in the same time.
Load balancing is about making sure every core is doing useful work all the time. If one thread gets 90% of the array to sum while three threads get 10% combined, those three threads finish immediately and sit idle while the overloaded thread toils on. This load imbalance wastes CPU time and degrades speedup just as badly as a large serial fraction. Two strategies exist: static assignment (divide work at compile/design time — simple but inflexible) and dynamic assignment (a work queue from which threads pull tasks at runtime — better load balance, small per-task overhead).
False sharing is a hardware-level performance trap that looks like a correctness problem but is really a cache-coherence problem. Modern CPUs move data between cache and memory in chunks called cache lines (typically 64 bytes). If two threads each modify a different variable, but those two variables happen to sit in the same cache line, each modification forces the other core to invalidate its copy and re-fetch the whole line. The variables are logically independent, but the hardware treats them as shared — hence "false" sharing. The fix is to pad the variables so they occupy separate cache lines.
With p = 0.95 (95% parallelisable), the theoretical max speedup is 1 / (1 - 0.95) = 20x, no matter how many cores you use. With p = 0.9 the max is 10x. A mere 10% serial fraction caps you at 10x forever. Real programs almost always have more serial overhead than developers expect.
Amdahl's formula, clock_gettime timing, and load-balancing patterns
Amdahl's Law — The Formula
N — number of parallel threads / processors
(1 - p) — the serial fraction: the irreducible bottleneck
As N → ∞: Speedup → 1 / (1 - p) — the theoretical maximum
Efficiency = Speedup / N — ideal efficiency = 1 (linear speedup)
Quick Reference — Amdahl's Law Worked Values
| Serial fraction (1-p) | Parallel fraction p | N = 4 threads | N = 8 threads | N = 32 threads | N = ∞ (max) |
|---|---|---|---|---|---|
| 0% | 100% | 4.00x | 8.00x | 32.00x | ∞ |
| 5% | 95% | 3.48x | 5.93x | 11.88x | 20x |
| 10% | 90% | 3.08x | 4.71x | 7.80x | 10x |
| 25% | 75% | 2.29x | 2.91x | 3.66x | 4x |
| 50% | 50% | 1.60x | 1.78x | 1.94x | 2x |
| 90% | 10% | 1.08x | 1.09x | 1.10x | 1.11x |
Measuring Wallclock Time with clock_gettime
/* --- clock_gettime() timing pattern --- */ #include <time.h> /* struct timespec, clock_gettime, CLOCK_MONOTONIC */ #include <stdio.h> struct timespec ts_start, ts_end; // ^ two timespec structs: one before, one after clock_gettime(CLOCK_MONOTONIC, &ts_start); // ^ ^ // clock type: pointer to fill // CLOCK_MONOTONIC = never goes backwards (safe across sleep/suspend) // CLOCK_REALTIME = wall clock (can jump on NTP sync) /* ... work you want to time ... */ do_parallel_work(); clock_gettime(CLOCK_MONOTONIC, &ts_end); /* Convert to seconds (tv_sec) + nanoseconds (tv_nsec) */ double elapsed = (ts_end.tv_sec - ts_start.tv_sec) + (ts_end.tv_nsec - ts_start.tv_nsec) * 1e-9; // ^ tv_sec difference = whole seconds // tv_nsec difference / 1e9 = fractional seconds (nanosecond precision) printf("Elapsed: %.6f seconds\n", elapsed); /* Compile with: gcc -O2 foo.c -lpthread -o foo (link with -lrt on older Linux) */
tv_sec (seconds as time_t) and tv_nsec (nanoseconds 0–999,999,999). Always subtract start from end; never assume nanoseconds alone gives the answer — you must combine both fields.
Fixing False Sharing with Cache-Line Padding
/* Problem: two threads write private_sum[0] and private_sum[1]. Both fit on one 64-byte cache line → false sharing → cache thrashing. */ unsigned long long private_sum[4]; /* BAD: all 4 values share one cache line */ /* Fix: pad each element to 64 bytes (one full cache line) */ struct padded_sum { unsigned long long value; /* 8 bytes of actual data */ char padding[56]; /* 56 bytes of filler = 64 bytes total */ } private_sum[MAX_THREADS]; /* Now each thread's value occupies its own cache line. */ /* No false sharing. Threads do not invalidate each other's cache lines. */ /* Find cache line size at runtime on Linux: */ /* cat /sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size */
unsigned long long, padding = 56 bytes. Without padding, every write by one thread forces the other thread's L1 cache to invalidate its copy of the same line — thousands of cache-coherence messages per second.
Load Balancing — Static vs Dynamic
/* Divide array evenly at design time */
int chunk = length / num_threads;
int start = id * chunk;
int end = start + chunk;
for (int i = start; i < end; i++) {
process(array[i]);
}
/* Problem: uneven work per element
→ some threads finish early, idle */
/* Threads pull work from a queue */
pthread_mutex_lock(&queue_lock);
while (next_item < total_items) {
int item = next_item++;
pthread_mutex_unlock(&queue_lock);
process(array[item]); /* work */
pthread_mutex_lock(&queue_lock);
}
pthread_mutex_unlock(&queue_lock);
/* Threads always stay busy until done */
Profiling Tools Overview
| Tool | How to use | What it shows | Cost |
|---|---|---|---|
time ./a.out |
Shell built-in, no recompile needed | real (wallclock), user, sys time | Zero overhead |
clock_gettime |
Add to source, recompile | Nanosecond-precision timer for any code region | ~10 ns per call |
gprof |
gcc -pg then gprof ./a.out gmon.out |
Call graph, function-level time, flat profile | ~30% overhead |
perf stat |
perf stat ./a.out |
HW counters: cycles, cache misses, branch mispredicts, IPC | ~1–5% overhead |
perf record/report |
perf record ./a.out && perf report |
Hotspot analysis — which functions consume the most cycles | ~5% overhead |
Intel VTune |
GUI tool, various analysis types | Threading, memory, microarchitecture analysis | Varies |
Adding measurement to a program changes its behaviour. gprof inserts instrumentation code that slows the program by ~30%, which can shift which part appears as the bottleneck. Prefer perf (hardware counters) or clock_gettime (low overhead) for performance-critical measurement. Always note: "cold start" (first run, cold caches) vs "warm start" (subsequent runs, caches warm) can differ by 2–10x.
Applying the formula and measuring real speedup
Scenario: A program takes 100 seconds total.
- Sequential (unavoidable) work: 25 + 25 = 50 seconds
- Parallelisable work: 50 seconds
- p = 50/100 = 0.50 (50% parallelisable)
With 5 threads (N = 5):
new_time = (1 - p) * T + (p * T) / N
= 0.50 * 100 + 0.50 * 100 / 5
= 50 + 10
= 60 seconds
Speedup = old_time / new_time = 100 / 60 = 1.67x
With infinite threads (theoretical max):
max_speedup = 1 / (1 - p) = 1 / 0.50 = 2.0x
Even with unlimited processors, this program can never run faster
than 50 seconds, because 50 seconds is the serial portion.
Example 2 (from lecture): p = 0.75, N = 8 threads
Speedup = 1 / ((1 - 0.75) + 0.75/8)
= 1 / (0.25 + 0.09375)
= 1 / 0.34375
= 2.91x
Theoretical max (N → ∞):
max_speedup = 1 / (1 - 0.75) = 1 / 0.25 = 4.0x
Efficiency (N=8, p=0.75):
Efficiency = Speedup / N = 2.91 / 8 = 0.36
(Only 36% of the additional hardware is used effectively)
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <time.h>
#define ARRAY_SIZE (16 * 1024 * 1024) /* 16 million elements */
#define MAX_THREADS 8
#define CACHE_LINE 64
/* --- Padded struct: each thread's partial sum on its own cache line --- */
struct padded_sum {
unsigned long long value;
char padding[CACHE_LINE - sizeof(unsigned long long)]; /* 56 bytes */
} private_sum[MAX_THREADS];
static unsigned char array[ARRAY_SIZE];
static int num_threads;
typedef struct { int id; int start; int end; } thread_arg_t;
/* Each thread accumulates its slice into its own private_sum */
void *sum_worker(void *arg) {
thread_arg_t *a = (thread_arg_t *)arg;
unsigned long long local = 0;
for (int i = a->start; i < a->end; i++)
local += array[i];
private_sum[a->id].value = local;
return NULL;
}
int main(int argc, char **argv) {
num_threads = (argc > 1) ? atoi(argv[1]) : 4;
/* Initialise array */
for (int i = 0; i < ARRAY_SIZE; i++) array[i] = (unsigned char)(i % 256);
/* --- Sequential baseline --- */
struct timespec t0, t1;
clock_gettime(CLOCK_MONOTONIC, &t0);
unsigned long long seq_sum = 0;
for (int i = 0; i < ARRAY_SIZE; i++) seq_sum += array[i];
clock_gettime(CLOCK_MONOTONIC, &t1);
double seq_time = (t1.tv_sec - t0.tv_sec) + (t1.tv_nsec - t0.tv_nsec) * 1e-9;
/* --- Parallel version --- */
pthread_t threads[MAX_THREADS];
thread_arg_t args[MAX_THREADS];
int chunk = ARRAY_SIZE / num_threads;
for (int i = 0; i < num_threads; i++) {
args[i].id = i;
args[i].start = i * chunk;
args[i].end = (i == num_threads - 1) ? ARRAY_SIZE : args[i].start + chunk;
private_sum[i].value = 0;
}
clock_gettime(CLOCK_MONOTONIC, &t0);
for (int i = 0; i < num_threads; i++)
pthread_create(&threads[i], NULL, sum_worker, &args[i]);
for (int i = 0; i < num_threads; i++)
pthread_join(threads[i], NULL);
unsigned long long par_sum = 0;
for (int i = 0; i < num_threads; i++) par_sum += private_sum[i].value;
clock_gettime(CLOCK_MONOTONIC, &t1);
double par_time = (t1.tv_sec - t0.tv_sec) + (t1.tv_nsec - t0.tv_nsec) * 1e-9;
printf("Sequential: %.4f s (sum = %llu)\n", seq_time, seq_sum);
printf("Parallel (%d threads): %.4f s (sum = %llu)\n", num_threads, par_time, par_sum);
printf("Speedup: %.2fx Efficiency: %.2f\n",
seq_time / par_time, (seq_time / par_time) / num_threads);
return 0;
}
Parallel (4 threads): 0.0220 s (sum = 2139095040)
Speedup: 2.82x Efficiency: 0.71
#include <stdio.h>
#include <pthread.h>
#include <time.h>
#define ITERS 100000000ULL
#define NUM_THREADS 4
#define CACHE_LINE 64
/* Version A: False sharing — all counters on one cache line */
unsigned long long counters_bad[NUM_THREADS];
/* Version B: Padded — each counter on its own cache line */
struct { unsigned long long v; char pad[CACHE_LINE - 8]; } counters_good[NUM_THREADS];
void *inc_bad(void *arg) { int id = *(int*)arg; for (unsigned long long i=0;i<ITERS;i++) counters_bad[id]++; return NULL; }
void *inc_good(void *arg) { int id = *(int*)arg; for (unsigned long long i=0;i<ITERS;i++) counters_good[id].v++; return NULL; }
double run_threads(void *(*fn)(void*)) {
pthread_t t[NUM_THREADS]; int ids[NUM_THREADS];
struct timespec s, e;
clock_gettime(CLOCK_MONOTONIC, &s);
for (int i=0;i<NUM_THREADS;i++) { ids[i]=i; pthread_create(&t[i],NULL,fn,&ids[i]); }
for (int i=0;i<NUM_THREADS;i++) pthread_join(t[i],NULL);
clock_gettime(CLOCK_MONOTONIC, &e);
return (e.tv_sec - s.tv_sec) + (e.tv_nsec - s.tv_nsec)*1e-9;
}
int main(void) {
printf("False sharing: %.3f s\n", run_threads(inc_bad));
printf("Padded (fixed): %.3f s\n", run_threads(inc_good));
return 0;
}
Padded (fixed): 0.387 s
The padded version is ~5.5x faster — all from a struct layout change.
Practice problems with solutions
Use the formula Speedup = 1 / ((1-p) + p/N) to calculate the speedup and the theoretical maximum in each of these cases. Show your working.
(a) p = 0.90 (90% parallel), N = 4 threads
(b) p = 0.10 (10% parallel), N = 32 threads
(c) p = 0.95 (95% parallel), N = ∞ (theoretical maximum)
(a) p=0.90, N=4:
Speedup = 1 / ((1 - 0.90) + 0.90/4)
= 1 / (0.10 + 0.225)
= 1 / 0.325
= 3.08x
Theoretical max (N→∞): 1 / (1 - 0.90) = 1 / 0.10 = 10x
(b) p=0.10, N=32:
Speedup = 1 / ((1 - 0.10) + 0.10/32)
= 1 / (0.90 + 0.003125)
= 1 / 0.903125
= 1.11x
Theoretical max: 1 / 0.90 = 1.11x (already near the ceiling!)
Note: 90% serial fraction = almost no benefit from parallelism at all.
(c) p=0.95, N=∞:
Speedup = 1 / ((1 - 0.95) + 0)
= 1 / 0.05
= 20x
Even with an infinite number of cores, the 5% serial fraction
is an absolute ceiling at 20x.
The code below is an attempt to parallelise an array sum across 4 threads. It is dramatically slower than the sequential version. Identify all performance problems and explain why each hurts performance.
#define T 4
unsigned long long sum = 0;
pthread_mutex_t m;
void *worker(void *arg) {
int id = *(int *)arg;
int chunk = ARRAY_SIZE / T;
int start = id * chunk;
for (int i = start; i < start + chunk; i++) {
pthread_mutex_lock(&m);
sum = sum + array[i];
pthread_mutex_unlock(&m);
}
return NULL;
}
Problem 2 — Highly contended lock: All 4 threads compete for the same mutex. Only one thread can hold it at a time. The parallel threads now execute entirely serially — worse than a single-threaded loop because each thread also pays the lock/unlock overhead.
Problem 3 — Memory traffic on shared variables:
sum and m (the mutex) are written by all threads. Every write invalidates the cache line in the other cores, causing constant cache-coherence traffic (the "motion of sum and m" shown in the lecture).Fix: Each thread should accumulate into a private local variable, then add its partial sum to the global
sum exactly once with a mutex at the end. This reduces lock contention from N (elements per thread) acquisitions to exactly 1 per thread.
A developer writes a parallel sum using a private-sum array as follows:
#define T 4
unsigned long long private_sum[T] = {0};
void *worker(void *arg) {
int id = *(int *)arg;
/* ... each thread writes only private_sum[id] ... */
for (int i = start; i < end; i++)
private_sum[id] += array[i];
return NULL;
}
The developer is surprised to find this version is almost as slow as the mutex version. Explain what is happening at the hardware level and provide a fixed version.
/* Fix: pad each element to its own 64-byte cache line */
#define CACHE_LINE 64
struct padded_sum {
unsigned long long value; /* 8 bytes of real data */
char padding[CACHE_LINE - sizeof(unsigned long long)]; /* 56 bytes filler */
} private_sum[T];
void *worker(void *arg) {
int id = *(int *)arg;
for (int i = start; i < end; i++)
private_sum[id].value += array[i]; /* each on its own cache line */
return NULL;
}
unsigned long long private_sum[4] occupies only 32 bytes — all four 8-byte values fit on a single 64-byte cache line. When Thread 0 writes private_sum[0], the CPU must mark that entire cache line as modified ("dirty"). Thread 1's L1 cache sees the line is dirty in Thread 0's cache and must invalidate its own copy and re-fetch the line — even though Thread 1 only cares about private_sum[1]. This causes continuous cache-line bouncing between the cores' L1 caches via the cache-coherence protocol, creating the same memory traffic as true sharing. Padding separates the values onto distinct cache lines, ending the invalidation cascade.
A profiling run shows that a video-transcoding program spends 8% of its time reading input/writing output (inherently sequential I/O), 2% in a sequential initialisation phase, and 90% in the parallel frame-encoding loop.
(a) What is the theoretical maximum speedup this program can achieve, no matter how many cores are used?
(b) You want to measure the speedup of the parallel encoder on 4, 8, and 16 cores. List three precautions you should take when benchmarking to get meaningful results.
(c) How would you use perf stat to check whether the bottleneck is in the CPU or memory?
(a) Serial fraction = 8% (I/O) + 2% (init) = 10%
p = 0.90
Max speedup = 1 / (1 - p) = 1 / 0.10 = 10x
With 4 cores: 1 / (0.10 + 0.90/4) = 1 / 0.325 = 3.08x
With 8 cores: 1 / (0.10 + 0.90/8) = 1 / 0.2125 = 4.71x
With 16 cores: 1 / (0.10 + 0.90/16) = 1 / 0.15625 = 6.40x
(b) Three benchmarking precautions:
1. Use clock_gettime(CLOCK_MONOTONIC, ...) around the work region only —
not including program startup, argument parsing, or file I/O that is
not part of the parallel section being measured.
2. Run on an unloaded machine and repeat measurements multiple times
(discard first "cold" run; report mean/median of warm runs). Other
processes sharing the CPU pollute wallclock time.
3. Use the same input data and compile flags (-O2) for all thread counts.
Never compare a debug build to an optimised build.
(c) perf stat usage:
$ perf stat -e cache-misses,cache-references,cycles,instructions ./encoder
Key metrics to inspect:
- If "cache-misses" is high (e.g. >5% miss rate) → memory-bound.
- If "IPC" (instructions/cycles) is very low (< 0.5) → memory stalls.
- If IPC is near 2–4 and cache misses are low → CPU-bound, good.
$ perf stat -e LLC-load-misses ./encoder # last-level cache misses
For each scenario below, state whether it is an example of strong scaling or weak scaling, and briefly explain why.
(a) A weather simulation is sped up from 4 hours to 30 minutes by adding more nodes, while simulating the same geographic region at the same resolution.
(b) A climate model doubles the resolution (4x the data) every time the cluster doubles in size, targeting the same daily update cycle.
(c) A matrix multiply benchmark measures how long it takes to multiply a fixed 1024×1024 matrix as the thread count goes from 1 to 32.
(b) Weak scaling. As nodes double, the problem size also increases to keep work-per-node constant. The goal is constant time-to-completion despite growing problem complexity. Weak scaling asks "can I handle a bigger problem in the same time?"
(c) Strong scaling. The matrix is fixed at 1024×1024. More threads share the same fixed problem. This is a classic strong-scaling benchmark — Amdahl's Law predicts the curve will flatten as thread count grows due to synchronisation overhead and the serial merge step.
Key concepts to memorise
Test your understanding
clock_gettime(CLOCK_MONOTONIC, &ts), the field ts.tv_nsec stores the nanoseconds component (0–999,999,999). To convert a struct timespec difference to seconds as a double, you compute: elapsed = (end.tv_sec - start.tv_sec) + (end.tv_nsec - start.tv_nsec) * ___LO10