Why adding more threads does not always make things faster

Real-World Analogy — Restaurant Kitchen

Imagine a restaurant kitchen preparing a three-course dinner. Each meal has two phases: mise en place (peeling, chopping, measuring — each chef can do their own prep independently) and plating (the head chef must plate every dish personally, one at a time, before service). No matter how many prep cooks you hire, the bottleneck is the single head chef doing plating. Hiring ten prep cooks instead of two barely helps — most of them are idle while the head chef works. The head chef is the serial fraction of your kitchen. This is Amdahl's Law in a commercial kitchen.

Every parallel program is divided into two regions: a sequential part that cannot be parallelised (typically due to data dependencies or shared state), and a parallelisable part that can be divided across multiple threads or cores. Gene Amdahl observed in 1967 that the sequential part fundamentally limits how fast the whole program can ever become, regardless of how many processors you throw at it.

Why does a serial fraction limit you so severely? Say your program takes 100 seconds. 50 seconds are unavoidably sequential (the head chef). 50 seconds are parallelisable (the prep work). Even if you use infinitely many threads, the parallel portion drops to 0 seconds — but the sequential 50 seconds remains. Your total time is now 50 seconds. You have achieved at most a 2x speedup, forever, no matter how many cores you buy.

Strong scaling is what we described above: keep the total problem size fixed and add more processors. As you scale out, the amount of work per processor shrinks, and synchronisation overhead becomes a larger fraction. Curves level off and eventually adding more processors hurts performance. Weak scaling (Gustafson's Law) is different: grow the problem size proportionally with the number of processors, so each processor always has the same amount of work. Efficiency stays approximately constant — a much more optimistic scenario for large HPC workloads.

Strong vs Weak Scaling — One-Sentence Definitions

Strong scaling: Fixed problem size, more processors → measures how fast you can solve the same problem.
Weak scaling: Problem size grows with processors → measures how large a problem you can solve in the same time.

Load balancing is about making sure every core is doing useful work all the time. If one thread gets 90% of the array to sum while three threads get 10% combined, those three threads finish immediately and sit idle while the overloaded thread toils on. This load imbalance wastes CPU time and degrades speedup just as badly as a large serial fraction. Two strategies exist: static assignment (divide work at compile/design time — simple but inflexible) and dynamic assignment (a work queue from which threads pull tasks at runtime — better load balance, small per-task overhead).

False sharing is a hardware-level performance trap that looks like a correctness problem but is really a cache-coherence problem. Modern CPUs move data between cache and memory in chunks called cache lines (typically 64 bytes). If two threads each modify a different variable, but those two variables happen to sit in the same cache line, each modification forces the other core to invalidate its copy and re-fetch the whole line. The variables are logically independent, but the hardware treats them as shared — hence "false" sharing. The fix is to pad the variables so they occupy separate cache lines.

The Diminishing Returns Problem

With p = 0.95 (95% parallelisable), the theoretical max speedup is 1 / (1 - 0.95) = 20x, no matter how many cores you use. With p = 0.9 the max is 10x. A mere 10% serial fraction caps you at 10x forever. Real programs almost always have more serial overhead than developers expect.

Amdahl's formula, clock_gettime timing, and load-balancing patterns

Amdahl's Law — The Formula

Speedup(N) = 1 / ( (1 - p) + p/N )
p — fraction of the program that can run in parallel (0.0 to 1.0)
N — number of parallel threads / processors
(1 - p) — the serial fraction: the irreducible bottleneck
As N → ∞: Speedup → 1 / (1 - p) — the theoretical maximum
Efficiency = Speedup / N — ideal efficiency = 1 (linear speedup)

Quick Reference — Amdahl's Law Worked Values

Serial fraction (1-p) Parallel fraction p N = 4 threads N = 8 threads N = 32 threads N = ∞ (max)
0%100%4.00x8.00x32.00x
5%95%3.48x5.93x11.88x20x
10%90%3.08x4.71x7.80x10x
25%75%2.29x2.91x3.66x4x
50%50%1.60x1.78x1.94x2x
90%10%1.08x1.09x1.10x1.11x

Measuring Wallclock Time with clock_gettime

/* --- clock_gettime() timing pattern --- */
#include <time.h>     /* struct timespec, clock_gettime, CLOCK_MONOTONIC */
#include <stdio.h>

struct timespec ts_start, ts_end;
//    ^                               two timespec structs: one before, one after

clock_gettime(CLOCK_MONOTONIC, &ts_start);
//             ^                 ^
//             clock type:       pointer to fill
//             CLOCK_MONOTONIC = never goes backwards (safe across sleep/suspend)
//             CLOCK_REALTIME  = wall clock (can jump on NTP sync)

/* ... work you want to time ... */
do_parallel_work();

clock_gettime(CLOCK_MONOTONIC, &ts_end);

/* Convert to seconds (tv_sec) + nanoseconds (tv_nsec) */
double elapsed = (ts_end.tv_sec  - ts_start.tv_sec)
              + (ts_end.tv_nsec - ts_start.tv_nsec) * 1e-9;
//    ^    tv_sec difference = whole seconds
//         tv_nsec difference / 1e9 = fractional seconds (nanosecond precision)

printf("Elapsed: %.6f seconds\n", elapsed);
/* Compile with: gcc -O2 foo.c -lpthread -o foo  (link with -lrt on older Linux) */
struct timespec fields: tv_sec (seconds as time_t) and tv_nsec (nanoseconds 0–999,999,999). Always subtract start from end; never assume nanoseconds alone gives the answer — you must combine both fields.

Fixing False Sharing with Cache-Line Padding

/* Problem: two threads write private_sum[0] and private_sum[1].
   Both fit on one 64-byte cache line → false sharing → cache thrashing. */
unsigned long long private_sum[4];   /* BAD: all 4 values share one cache line */

/* Fix: pad each element to 64 bytes (one full cache line) */
struct padded_sum {
    unsigned long long value;   /* 8 bytes of actual data */
    char padding[56];          /* 56 bytes of filler = 64 bytes total */
} private_sum[MAX_THREADS];
/* Now each thread's value occupies its own cache line. */
/* No false sharing. Threads do not invalidate each other's cache lines. */

/* Find cache line size at runtime on Linux: */
/* cat /sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size */
Rule: padding size = cache_line_size − sizeof(your_data). For a 64-byte cache line with an 8-byte unsigned long long, padding = 56 bytes. Without padding, every write by one thread forces the other thread's L1 cache to invalidate its copy of the same line — thousands of cache-coherence messages per second.

Load Balancing — Static vs Dynamic

Static Assignment
/* Divide array evenly at design time */
int chunk = length / num_threads;
int start = id * chunk;
int end   = start + chunk;

for (int i = start; i < end; i++) {
    process(array[i]);
}
/* Problem: uneven work per element
   → some threads finish early, idle */
Dynamic Assignment (Work Queue)
/* Threads pull work from a queue */
pthread_mutex_lock(&queue_lock);
while (next_item < total_items) {
    int item = next_item++;
    pthread_mutex_unlock(&queue_lock);
    process(array[item]);    /* work */
    pthread_mutex_lock(&queue_lock);
}
pthread_mutex_unlock(&queue_lock);
/* Threads always stay busy until done */

Profiling Tools Overview

ToolHow to useWhat it showsCost
time ./a.out Shell built-in, no recompile needed real (wallclock), user, sys time Zero overhead
clock_gettime Add to source, recompile Nanosecond-precision timer for any code region ~10 ns per call
gprof gcc -pg then gprof ./a.out gmon.out Call graph, function-level time, flat profile ~30% overhead
perf stat perf stat ./a.out HW counters: cycles, cache misses, branch mispredicts, IPC ~1–5% overhead
perf record/report perf record ./a.out && perf report Hotspot analysis — which functions consume the most cycles ~5% overhead
Intel VTune GUI tool, various analysis types Threading, memory, microarchitecture analysis Varies
The Heisenberg Effect in Profiling

Adding measurement to a program changes its behaviour. gprof inserts instrumentation code that slows the program by ~30%, which can shift which part appears as the bottleneck. Prefer perf (hardware counters) or clock_gettime (low overhead) for performance-critical measurement. Always note: "cold start" (first run, cold caches) vs "warm start" (subsequent runs, caches warm) can differ by 2–10x.

Applying the formula and measuring real speedup

Example 1 — Calculating speedup from a serial fraction (lecture example) Week 12 Lecture
Scenario: A program takes 100 seconds total.
  - Sequential (unavoidable) work: 25 + 25 = 50 seconds
  - Parallelisable work:                    50 seconds
  - p = 50/100 = 0.50   (50% parallelisable)

With 5 threads (N = 5):
  new_time = (1 - p) * T + (p * T) / N
           = 0.50 * 100 + 0.50 * 100 / 5
           = 50 + 10
           = 60 seconds

  Speedup = old_time / new_time = 100 / 60 = 1.67x

With infinite threads (theoretical max):
  max_speedup = 1 / (1 - p) = 1 / 0.50 = 2.0x

Even with unlimited processors, this program can never run faster
than 50 seconds, because 50 seconds is the serial portion.

Example 2 (from lecture): p = 0.75, N = 8 threads
  Speedup = 1 / ((1 - 0.75) + 0.75/8)
          = 1 / (0.25 + 0.09375)
          = 1 / 0.34375
          = 2.91x

  Theoretical max (N → ∞):
  max_speedup = 1 / (1 - 0.75) = 1 / 0.25 = 4.0x

Efficiency (N=8, p=0.75):
  Efficiency = Speedup / N = 2.91 / 8 = 0.36
  (Only 36% of the additional hardware is used effectively)
Key insight
The 25% serial fraction (0.25) is an absolute ceiling. Buying 100 more servers gives the same theoretical max as infinity: 4x. Buy fewer, get nearly the same result.
Example 2 — Measuring speedup with clock_gettime: parallel array sum Week 12 Lecture — case study
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <time.h>

#define ARRAY_SIZE  (16 * 1024 * 1024)  /* 16 million elements */
#define MAX_THREADS 8
#define CACHE_LINE  64

/* --- Padded struct: each thread's partial sum on its own cache line --- */
struct padded_sum {
    unsigned long long value;
    char padding[CACHE_LINE - sizeof(unsigned long long)];  /* 56 bytes */
} private_sum[MAX_THREADS];

static unsigned char array[ARRAY_SIZE];
static int num_threads;

typedef struct { int id; int start; int end; } thread_arg_t;

/* Each thread accumulates its slice into its own private_sum */
void *sum_worker(void *arg) {
    thread_arg_t *a = (thread_arg_t *)arg;
    unsigned long long local = 0;
    for (int i = a->start; i < a->end; i++)
        local += array[i];
    private_sum[a->id].value = local;
    return NULL;
}

int main(int argc, char **argv) {
    num_threads = (argc > 1) ? atoi(argv[1]) : 4;

    /* Initialise array */
    for (int i = 0; i < ARRAY_SIZE; i++) array[i] = (unsigned char)(i % 256);

    /* --- Sequential baseline --- */
    struct timespec t0, t1;
    clock_gettime(CLOCK_MONOTONIC, &t0);
    unsigned long long seq_sum = 0;
    for (int i = 0; i < ARRAY_SIZE; i++) seq_sum += array[i];
    clock_gettime(CLOCK_MONOTONIC, &t1);
    double seq_time = (t1.tv_sec - t0.tv_sec) + (t1.tv_nsec - t0.tv_nsec) * 1e-9;

    /* --- Parallel version --- */
    pthread_t threads[MAX_THREADS];
    thread_arg_t args[MAX_THREADS];
    int chunk = ARRAY_SIZE / num_threads;

    for (int i = 0; i < num_threads; i++) {
        args[i].id    = i;
        args[i].start = i * chunk;
        args[i].end   = (i == num_threads - 1) ? ARRAY_SIZE : args[i].start + chunk;
        private_sum[i].value = 0;
    }

    clock_gettime(CLOCK_MONOTONIC, &t0);
    for (int i = 0; i < num_threads; i++)
        pthread_create(&threads[i], NULL, sum_worker, &args[i]);
    for (int i = 0; i < num_threads; i++)
        pthread_join(threads[i], NULL);

    unsigned long long par_sum = 0;
    for (int i = 0; i < num_threads; i++) par_sum += private_sum[i].value;
    clock_gettime(CLOCK_MONOTONIC, &t1);
    double par_time = (t1.tv_sec - t0.tv_sec) + (t1.tv_nsec - t0.tv_nsec) * 1e-9;

    printf("Sequential: %.4f s  (sum = %llu)\n", seq_time, seq_sum);
    printf("Parallel (%d threads): %.4f s  (sum = %llu)\n", num_threads, par_time, par_sum);
    printf("Speedup: %.2fx  Efficiency: %.2f\n",
           seq_time / par_time, (seq_time / par_time) / num_threads);
    return 0;
}
Sample output (4-core machine)
Sequential: 0.0621 s (sum = 2139095040)
Parallel (4 threads): 0.0220 s (sum = 2139095040)
Speedup: 2.82x Efficiency: 0.71
Example 3 — Demonstrating false sharing vs padded version Week 12 Lecture — pages 37-42
#include <stdio.h>
#include <pthread.h>
#include <time.h>

#define ITERS       100000000ULL
#define NUM_THREADS 4
#define CACHE_LINE  64

/* Version A: False sharing — all counters on one cache line */
unsigned long long counters_bad[NUM_THREADS];

/* Version B: Padded — each counter on its own cache line */
struct { unsigned long long v; char pad[CACHE_LINE - 8]; } counters_good[NUM_THREADS];

void *inc_bad(void *arg)  { int id = *(int*)arg; for (unsigned long long i=0;i<ITERS;i++) counters_bad[id]++;  return NULL; }
void *inc_good(void *arg) { int id = *(int*)arg; for (unsigned long long i=0;i<ITERS;i++) counters_good[id].v++; return NULL; }

double run_threads(void *(*fn)(void*)) {
    pthread_t t[NUM_THREADS]; int ids[NUM_THREADS];
    struct timespec s, e;
    clock_gettime(CLOCK_MONOTONIC, &s);
    for (int i=0;i<NUM_THREADS;i++) { ids[i]=i; pthread_create(&t[i],NULL,fn,&ids[i]); }
    for (int i=0;i<NUM_THREADS;i++) pthread_join(t[i],NULL);
    clock_gettime(CLOCK_MONOTONIC, &e);
    return (e.tv_sec - s.tv_sec) + (e.tv_nsec - s.tv_nsec)*1e-9;
}

int main(void) {
    printf("False sharing:  %.3f s\n", run_threads(inc_bad));
    printf("Padded (fixed): %.3f s\n", run_threads(inc_good));
    return 0;
}
Typical output on a 4-core machine
False sharing: 2.143 s
Padded (fixed): 0.387 s

The padded version is ~5.5x faster — all from a struct layout change.

Practice problems with solutions

P1 — Apply Amdahl's Law: three cases Week 12 Lecture + Tutorial

Use the formula Speedup = 1 / ((1-p) + p/N) to calculate the speedup and the theoretical maximum in each of these cases. Show your working.

(a) p = 0.90 (90% parallel), N = 4 threads
(b) p = 0.10 (10% parallel), N = 32 threads
(c) p = 0.95 (95% parallel), N = ∞ (theoretical maximum)

(a) p=0.90, N=4:
    Speedup = 1 / ((1 - 0.90) + 0.90/4)
            = 1 / (0.10 + 0.225)
            = 1 / 0.325
            = 3.08x
    Theoretical max (N→∞): 1 / (1 - 0.90) = 1 / 0.10 = 10x

(b) p=0.10, N=32:
    Speedup = 1 / ((1 - 0.10) + 0.10/32)
            = 1 / (0.90 + 0.003125)
            = 1 / 0.903125
            = 1.11x
    Theoretical max: 1 / 0.90 = 1.11x  (already near the ceiling!)
    Note: 90% serial fraction = almost no benefit from parallelism at all.

(c) p=0.95, N=∞:
    Speedup = 1 / ((1 - 0.95) + 0)
            = 1 / 0.05
            = 20x
    Even with an infinite number of cores, the 5% serial fraction
    is an absolute ceiling at 20x.
Key takeaway: Case (b) is the most dramatic demonstration of Amdahl's Law. Throwing 32 cores at a 90%-serial program gives just 1.11x speedup — barely worth the engineering effort. The serial fraction is the bottleneck, and no amount of additional parallelism helps it.
P2 — Identify the performance bottleneck in this code Week 12 Lecture — page 34

The code below is an attempt to parallelise an array sum across 4 threads. It is dramatically slower than the sequential version. Identify all performance problems and explain why each hurts performance.

#define T 4
unsigned long long sum = 0;
pthread_mutex_t m;

void *worker(void *arg) {
    int id = *(int *)arg;
    int chunk = ARRAY_SIZE / T;
    int start = id * chunk;
    for (int i = start; i < start + chunk; i++) {
        pthread_mutex_lock(&m);
        sum = sum + array[i];
        pthread_mutex_unlock(&m);
    }
    return NULL;
}
Problem 1 — Lock inside the innermost loop: The mutex is acquired and released for every single element. Each lock/unlock costs hundreds of nanoseconds due to memory barriers and cache-coherence traffic. For 16 million elements that is tens of billions of nanoseconds wasted on synchronisation alone — far exceeding the actual work.

Problem 2 — Highly contended lock: All 4 threads compete for the same mutex. Only one thread can hold it at a time. The parallel threads now execute entirely serially — worse than a single-threaded loop because each thread also pays the lock/unlock overhead.

Problem 3 — Memory traffic on shared variables: sum and m (the mutex) are written by all threads. Every write invalidates the cache line in the other cores, causing constant cache-coherence traffic (the "motion of sum and m" shown in the lecture).

Fix: Each thread should accumulate into a private local variable, then add its partial sum to the global sum exactly once with a mutex at the end. This reduces lock contention from N (elements per thread) acquisitions to exactly 1 per thread.
P3 — Explain false sharing: what goes wrong and how to fix it Week 12 Lecture — pages 39-42

A developer writes a parallel sum using a private-sum array as follows:

#define T 4
unsigned long long private_sum[T] = {0};

void *worker(void *arg) {
    int id = *(int *)arg;
    /* ... each thread writes only private_sum[id] ... */
    for (int i = start; i < end; i++)
        private_sum[id] += array[i];
    return NULL;
}

The developer is surprised to find this version is almost as slow as the mutex version. Explain what is happening at the hardware level and provide a fixed version.

/* Fix: pad each element to its own 64-byte cache line */
#define CACHE_LINE 64
struct padded_sum {
    unsigned long long value;               /* 8 bytes of real data */
    char padding[CACHE_LINE - sizeof(unsigned long long)]; /* 56 bytes filler */
} private_sum[T];

void *worker(void *arg) {
    int id = *(int *)arg;
    for (int i = start; i < end; i++)
        private_sum[id].value += array[i];  /* each on its own cache line */
    return NULL;
}
What goes wrong at the hardware level: CPUs move data between cache and RAM in cache lines of 64 bytes. unsigned long long private_sum[4] occupies only 32 bytes — all four 8-byte values fit on a single 64-byte cache line. When Thread 0 writes private_sum[0], the CPU must mark that entire cache line as modified ("dirty"). Thread 1's L1 cache sees the line is dirty in Thread 0's cache and must invalidate its own copy and re-fetch the line — even though Thread 1 only cares about private_sum[1]. This causes continuous cache-line bouncing between the cores' L1 caches via the cache-coherence protocol, creating the same memory traffic as true sharing. Padding separates the values onto distinct cache lines, ending the invalidation cascade.
P4 — Estimate max speedup and explain the measurement setup Week 12 Lecture + LO10

A profiling run shows that a video-transcoding program spends 8% of its time reading input/writing output (inherently sequential I/O), 2% in a sequential initialisation phase, and 90% in the parallel frame-encoding loop.

(a) What is the theoretical maximum speedup this program can achieve, no matter how many cores are used?
(b) You want to measure the speedup of the parallel encoder on 4, 8, and 16 cores. List three precautions you should take when benchmarking to get meaningful results.
(c) How would you use perf stat to check whether the bottleneck is in the CPU or memory?

(a) Serial fraction = 8% (I/O) + 2% (init) = 10%
    p = 0.90
    Max speedup = 1 / (1 - p) = 1 / 0.10 = 10x

    With 4 cores:  1 / (0.10 + 0.90/4)  = 1 / 0.325 = 3.08x
    With 8 cores:  1 / (0.10 + 0.90/8)  = 1 / 0.2125 = 4.71x
    With 16 cores: 1 / (0.10 + 0.90/16) = 1 / 0.15625 = 6.40x

(b) Three benchmarking precautions:
    1. Use clock_gettime(CLOCK_MONOTONIC, ...) around the work region only —
       not including program startup, argument parsing, or file I/O that is
       not part of the parallel section being measured.
    2. Run on an unloaded machine and repeat measurements multiple times
       (discard first "cold" run; report mean/median of warm runs). Other
       processes sharing the CPU pollute wallclock time.
    3. Use the same input data and compile flags (-O2) for all thread counts.
       Never compare a debug build to an optimised build.

(c) perf stat usage:
    $ perf stat -e cache-misses,cache-references,cycles,instructions ./encoder

    Key metrics to inspect:
    - If "cache-misses" is high (e.g. >5% miss rate) → memory-bound.
    - If "IPC" (instructions/cycles) is very low (< 0.5) → memory stalls.
    - If IPC is near 2–4 and cache misses are low → CPU-bound, good.
    $ perf stat -e LLC-load-misses ./encoder  # last-level cache misses
Why the max is 10x and not higher: The 10% serial fraction is the combined I/O + init time. These portions cannot be parallelised — I/O is gated by device bandwidth (a single hardware resource), and initialisation has data dependencies that prevent parallel execution. Even running the frame-encoder on a million cores cannot eliminate these 10 units of sequential work.
P5 — Strong vs weak scaling: which is which? Week 12 Lecture + Tutorial

For each scenario below, state whether it is an example of strong scaling or weak scaling, and briefly explain why.

(a) A weather simulation is sped up from 4 hours to 30 minutes by adding more nodes, while simulating the same geographic region at the same resolution.
(b) A climate model doubles the resolution (4x the data) every time the cluster doubles in size, targeting the same daily update cycle.
(c) A matrix multiply benchmark measures how long it takes to multiply a fixed 1024×1024 matrix as the thread count goes from 1 to 32.

(a) Strong scaling. The problem (same geographic region, same resolution) is fixed in size. Adding hardware reduces wall time for the same amount of work. Strong scaling measures "how fast can I solve this exact problem?"

(b) Weak scaling. As nodes double, the problem size also increases to keep work-per-node constant. The goal is constant time-to-completion despite growing problem complexity. Weak scaling asks "can I handle a bigger problem in the same time?"

(c) Strong scaling. The matrix is fixed at 1024×1024. More threads share the same fixed problem. This is a classic strong-scaling benchmark — Amdahl's Law predicts the curve will flatten as thread count grows due to synchronisation overhead and the serial merge step.

Key concepts to memorise

Card 1 of 10
Question — click to flip
Answer
Click card to flip • Use buttons to navigate

Test your understanding

Topic 29 Quiz — Parallel Performance Score: 0 / 6
1
A program is 80% parallelisable (p = 0.80). Using Amdahl's Law, what is the theoretical maximum speedup achievable with an unlimited number of cores?LO10
multiple choice — calculation
2
True or False: False sharing occurs when two threads write to different variables that happen to share the same CPU cache line, causing unnecessary cache-coherence traffic.LO10
true / false
3
Using Amdahl's Law, what is the speedup of a program with p = 0.75 running on N = 8 threads? (Select the closest answer.)LO10
multiple choice — calculation
4
Fill in the blank: In clock_gettime(CLOCK_MONOTONIC, &ts), the field ts.tv_nsec stores the nanoseconds component (0–999,999,999). To convert a struct timespec difference to seconds as a double, you compute: elapsed = (end.tv_sec - start.tv_sec) + (end.tv_nsec - start.tv_nsec) * ___LO10
fill in the blank
5
Which of the following is the best description of "strong scaling"?LO10
multiple choice
6
True or False: Dynamic load balancing (where threads pull work from a shared work queue at runtime) generally achieves better CPU utilisation than static load balancing (where work is pre-divided at design time), at the cost of small per-task synchronisation overhead.LO10
true / false
0/6
Quiz complete!