Is Python Code Sensitive to CPU Caching?

2024-07-07 12 min by Lukas Atkinson Python (11) Algorithms (3)

Cache-aware programming can make a huge performance difference, especially when writing code in C++ or Rust. Python is a much more high-level language, and doesn't give us that level of control over memory layout of our data structures. So does this mean that CPU caching effects aren't relevant in Python?

In this post, we'll conduct some basic experiments to answer this question, by accessing list element either in sequental order or in a random order.

Results indicate that randomized access is consistently slower in Python. Especially as soon as problem sizes outgrow the CPU cache sizes, random access is multiple times slower. Thus, some degree of cache-aware programming may be relevant in interpreted environments like CPython 3.12 as well.

Disclaimers: Results may vary on different software and hardware. The software used for these tests is CPython 3.12. The CPU uses the x86-64 (amd64) architecture, Zen 2 microarchitecture. This CPU generation had a fairly complex cache hierarchy due to grouping cores in separate "core complexes" that partially share some of their cache. Each core should see up to 32KB L1, 512KB L2, 16MB L3.

Background

Programs store data in memory (RAM), but this is slow. CPU caches are faster, but much smaller. The CPU has to decide which parts of the RAM to load into the caches. Of course explicitly accessed memory addresses will be loaded into the cache. But the act of loading data is comparatively slow, so the CPU will try to pre-load relevant data.

For example, we can expect the CPU to detect linear array access patterns and automatically pre-load the next chunks of an array. So foreach style loops over an array where all relevant data is directly inside the array can be quite cache-friendly, whereas random access cannot be predicted.

In practice, many arrays contain pointers instead. We must dereference the pointer to a different memory location in order to get to the actual data. Maybe the pointed-to location is already in the cache, maybe not and we will have to wait – effectively also a kind of random access.

In Java (OpenJDK) and Python (CPython), all normal objects are actually represented as pointers to a more complex data structure. In C notation, an array wouldn't contain PyObject values, but PyObject* pointers.

So clearly, Python data structures won't be as cache-friendly as data structures in C++ or Rust could be.

But Python already has a ton of interpreter overhead compared to natively compiled languages, so maybe this doesn't matter?

In my experience, interpreter overhead tends to be in the 2× – 100× range, typically at least 10×, and much worse if meaningful compiler optimizations are possible.
But typical values for the cost of accessing main memory vs L1 cached data also tend to be in the 100× range.

All of this is very rule-of-tumb, back-of-napkin, but this suggests that interpreter overhead wouldn't drown out cache effects.

Designing an experiment

Our working hypothesis is: cache effects are detectable in Python programs.

This is not testable, so let's refine it based on the array iteration example above. What we could try to observe is that the following program should take longer when indices = shuffled(range(len(values))) ("random" access) than when elements are accessed in order (indices = list(range(len(values)))):

sum(values[i] for i in indices)

But there are many different objects here at play:

indices is an array containing pointers to integer objects
i is effectively a pointer to such an integer object
values is an array containing pointers to objects
values[i] is effectively one such pointer

To drop to C notation for a moment, this means we have four different kinds of memory locations:

indices[…] / i, which contains a pointer
*i, which contains the actual index
values[*i], which contains a pointer
*values[*i], which contains the actual value

If we want to desing the experiment properly, we have to consider cache effects for all these memory locations.

The indices array is likely unproblematic as it will be iterated in order.
Each index *i however requires a memory access. To reduce the impact of cache effects, it would be nice if these values are laid out sequentially in memory. Python doesn't let us control that, but in practice objects are likely to be allocated mostly one after the other. So we want to create those integer objects in the order in which they will be referenced. That means we must not shuffle indices within the Python process that runs the benchmark, but can read in already-shuffled test data.
The values array is unproblematic, as it is the main cache effect that we want to test.
The *values[*i] elements may or may not matter. Either the CPU is able to understand that values are actually pointers to values and can pre-load the pointed-to values. Or, cache effects would mean that the example runs faster if the values are laid out in the same order in which they are referenced in the values array.

I'm not sure how to test the impact of values order, so let's stick to the same trick of using allocation-order to increase the chance of a potentially cache-friendly data layout.

We can generate test data in a format as follows:

<size>,<iterations>,<expected sum>
<indices>,...
<values>,...

For example, the following would describe an in-order access of values (the order of indices matters, the order of values should not):

4,1,10
0,1,2,3,4
3,2,4,0,1

I don't want to bother with strings and number parsing, so let's use Python's struct and array modules to pack the test data into a binary representation.

Here is the main program that we will benchmark:

import sys, struct, array
def parse_array(f, fmt, n):
    arr = array.array(fmt)
    arr.fromfile(f, n)
    return arr.tolist()

# parse the test case
header_layout = struct.Struct("=4sLLQ")
magic, size, iterations, expected_sum = \
    header_layout.unpack(sys.stdin.buffer.read(header_layout.size))
assert magic == b"cach",  f"wrong magic number {magic}"
indices = parse_array(sys.stdin.buffer, "L", size)
values = parse_array(sys.stdin.buffer, "L", size)

# run the test case
for _ in range(iterations):
    s = sum(values[i] for i in indices)
    assert s == expected_sum

Here's a program to generate a test case:

import sys, struct, array, random, click

@click.command()
@click.option("--size", type=int, required=True)
@click.option("--iterations", type=int, required=True)
@click.option("--shuffle/--no-shuffle", required=True)
@click.option("--seed", type=int, required=True)
def cli(size: int, iterations: int, shuffle: bool, seed: int):
    rng = random.Random(seed)
    values = list(range(size))
    rng.shuffle(values)  # should not affect benchmarks, but best randomize
    indices = list(range(size))
    if shuffle:
        rng.shuffle(indices)

    header_layout = struct.Struct("=4sLLQ")
    sys.stdout.buffer.write(header_layout.pack(b"cach", size, iterations, sum(values)))
    array.array("L", indices).tofile(sys.stdout.buffer)
    array.array("L", values).tofile(sys.stdout.buffer)

if __name__ == "__main__":
    cli()

Btw why don't we just use an array instead of a Python list for our tests because after all an array lets us control the memory layout?

For the values, we want to use a list because we're interested in the cache effects with typical Python datastructures.
For the indices, because accessing an indices[…] element would still have to create a PyObject*. It's probably better to get all of these allocations out of the way before running the actual tests.

How much memory will actually get used for a particular example size?

While the above layouts encode integers as 32-bit or 64-bit ints (4 or 8 bytes), the in-memory representation will be a PyObject*.
That's one pointer (8 bytes) for the pointer,
and another 28 bytes (3.5 pointers worth) for the PyObject data in its small integer mode (calculated via sys.getsizeof()).
However, alignment considerations will probably round this up to 32 bytes (4 pointers worth).
In total, that is 5 pointers worth (40 bytes) per element in values, plus the same for the indices.
To observe meaningful results, our problem sizes should be a lot larger than the available cache size. For 16MB cache, this means values should contain at least 420000 entries, or half that amount if we also consider cache pressure from indices.

This allows us to formulate some expectations:

For problem sizes below around 200000, we expect linear and randomized access to work similarly well.
Hypothesis: for problem sizes above 420000, we expect linear array access to be significantly faster than randomized access.

We can use the hyperfine tool (v1.18) to run the individual benchmarks. It supports parametrized benchmarks.

All experiments will use the following form, though we will vary --iterations and size parameters.

hyperfine \
  --setup 'python generate_data.py --seed 1234 --iterations {iterations} --size {size} --{shuffle} >example_{size}_{shuffle}.bin' \
  --command-name 'size={size} iters={iterations} {shuffle}' \
  -L size 100,1000,10000,100000 \
  -L iterations 5 \
  -L shuffle shuffle,no-shuffle \
  'python run_benchmark.py <example_{size}_{shuffle}.bin'

Results

We can now conduct a couple of experiments to investigate caching impact.

Small data sets

Our above analysis suggests that problem sizes below around 200000 should not be affected by caching. Let's ask hyperfine to compare shuffled access for size in {50_000, 100_000, 200_000}. Note that we're also adjusting the iterations so that the total number of array accesses remains constant at 10 million, which allows us also to compare results across sizes.

Results for size=50_000

Benchmark 1: size=50000 iters=200 shuffle (iterations = 200)
  Time (mean ± σ):     462.0 ms ±  13.4 ms    [User: 456.7 ms, System: 5.2 ms]
  Range (min … max):   439.5 ms … 477.5 ms    10 runs

Benchmark 2: size=50000 iters=200 no-shuffle (iterations = 200)
  Time (mean ± σ):     374.1 ms ±  17.3 ms    [User: 369.2 ms, System: 4.8 ms]
  Range (min … max):   350.0 ms … 394.3 ms    10 runs

Summary
  size=50000 iters=200 no-shuffle (iterations = 200) ran
    1.23 ± 0.07 times faster than size=50000 iters=200 shuffle (iterations = 200)

Results for size=100_000:

Benchmark 1: size=100000 iters=100 shuffle (iterations = 100)
  Time (mean ± σ):     479.6 ms ±  19.6 ms    [User: 470.2 ms, System: 9.3 ms]
  Range (min … max):   453.7 ms … 519.8 ms    10 runs

Benchmark 2: size=100000 iters=100 no-shuffle (iterations = 100)
  Time (mean ± σ):     384.4 ms ±  14.0 ms    [User: 376.0 ms, System: 8.3 ms]
  Range (min … max):   361.4 ms … 405.8 ms    10 runs

Summary
  size=100000 iters=100 no-shuffle (iterations = 100) ran
    1.25 ± 0.07 times faster than size=100000 iters=100 shuffle (iterations = 100)

Results for size=200_000:

Benchmark 1: size=200000 iters=50 shuffle (iterations = 50)
  Time (mean ± σ):     567.8 ms ±  35.4 ms    [User: 557.7 ms, System: 9.9 ms]
  Range (min … max):   501.0 ms … 634.4 ms    10 runs

Benchmark 2: size=200000 iters=50 no-shuffle (iterations = 50)
  Time (mean ± σ):     385.2 ms ±  13.8 ms    [User: 377.3 ms, System: 7.8 ms]
  Range (min … max):   371.4 ms … 407.8 ms    10 runs

Summary
  size=200000 iters=50 no-shuffle (iterations = 50) ran
    1.47 ± 0.11 times faster than size=200000 iters=50 shuffle (iterations = 50)

Interpretation:

For all tested problem sizes, we see that randomized (shuffle) access is significantly slower than linear array access (no-shuffle). For sizes 50_000 and 100_000 we see a factor of around 1.24. But for the size 200_000 we're already up to a factor of 1.47.

We can reformat the data to compare the mean elapsed time of shuffle and no-shuffle factors across different problem sizes:

size	no-shuffle	shuffle	rel. slowdown
`50_000`	374.1 ms ± 17.3 ms	462.0 ms ± 13.4 ms	1.23 ± 0.07
`100_000`	384.4 ms ± 14.0 ms	479.6 ms ± 19.6 ms	1.25 ± 0.07
`200_000`	385.2 ms ± 13.8 ms	567.8 ms ± 35.4 ms	1.47 ± 0.11

We can see that the no-shuffle linear mode (left column) maintains constant speed across different problem sizes. All of these timings are within each other's margin of error.

But for the shuffle mode, we see differences. While the two smaller problem sizes have speeds within each other's margin of error, the size = 200_000 runs are significantly slower.

This is broadly in line with our expectations.

We see that linear array access provides very consistent performance across problem sizes.
We see that random access is consistently slower, but the overhead isn't terrible (only a factor of about 1.24 at smaller problem sizes).
As we approach the region where our problem no longer fits into the cache, performance of random access starts to deteriorate.

Large data sets

Next, let's measure performance when we cross into the region of problem sizes that cannot fit into the cache. In order to maintain a nice power-of-two sequence, the iterations were increased a bit, keeping a constant volume of indexing operations at 12.8 million.

size= 200_000 iters=64 (size matches largest experiment from the "small data set" round)
size= 400_000 iters=32
size= 800_000 iters=16
size=1600_000 iters= 8

Results for size=200_000:

Benchmark 1: size=200000 iters=64 shuffle (iterations = 64)
  Time (mean ± σ):     693.9 ms ±  40.1 ms    [User: 684.6 ms, System: 9.1 ms]
  Range (min … max):   641.0 ms … 751.8 ms    10 runs

Benchmark 2: size=200000 iters=64 no-shuffle (iterations = 64)
  Time (mean ± σ):     496.8 ms ±  10.3 ms    [User: 483.1 ms, System: 13.6 ms]
  Range (min … max):   482.2 ms … 515.4 ms    10 runs

Summary
  size=200000 iters=64 no-shuffle (iterations = 64) ran
    1.40 ± 0.09 times faster than size=200000 iters=64 shuffle (iterations = 64)

Results for size=400_000:

Benchmark 1: size=400000 iters=32 shuffle (iterations = 32)
  Time (mean ± σ):      1.293 s ±  0.071 s    [User: 1.275 s, System: 0.018 s]
  Range (min … max):    1.217 s …  1.452 s    10 runs

Benchmark 2: size=400000 iters=32 no-shuffle (iterations = 32)
  Time (mean ± σ):     515.2 ms ±  16.2 ms    [User: 497.3 ms, System: 17.7 ms]
  Range (min … max):   485.7 ms … 541.0 ms    10 runs

Summary
  size=400000 iters=32 no-shuffle (iterations = 32) ran
    2.51 ± 0.16 times faster than size=400000 iters=32 shuffle (iterations = 32)

Results for size=800_000:

Benchmark 1: size=800000 iters=16 shuffle (iterations = 16)
  Time (mean ± σ):      1.855 s ±  0.036 s    [User: 1.824 s, System: 0.030 s]
  Range (min … max):    1.805 s …  1.926 s    10 runs

Benchmark 2: size=800000 iters=16 no-shuffle (iterations = 16)
  Time (mean ± σ):     550.0 ms ±  11.8 ms    [User: 521.5 ms, System: 28.4 ms]
  Range (min … max):   532.4 ms … 563.1 ms    10 runs

Summary
  size=800000 iters=16 no-shuffle (iterations = 16) ran
    3.37 ± 0.10 times faster than size=800000 iters=16 shuffle (iterations = 16)

Results for size=1600_000:

Benchmark 1: size=1600000 iters=8 shuffle (iterations = 8)
  Time (mean ± σ):      2.379 s ±  0.034 s    [User: 2.327 s, System: 0.052 s]
  Range (min … max):    2.336 s …  2.428 s    10 runs

Benchmark 2: size=1600000 iters=8 no-shuffle (iterations = 8)
  Time (mean ± σ):     628.4 ms ±  12.2 ms    [User: 570.3 ms, System: 57.9 ms]
  Range (min … max):   606.3 ms … 647.2 ms    10 runs

Summary
  size=1600000 iters=8 no-shuffle (iterations = 8) ran
    3.79 ± 0.09 times faster than size=1600000 iters=8 shuffle (iterations = 8)

Table for comparing the mean elapsed time and the relative slowdown of random access:

size	no-shuffle	shuffle	rel. slowdown
`200_000`	496.8 ms ± 10.3 ms	693.9 ms ± 40.1 ms	1.40 ± 0.09
`400_000`	515.2 ms ± 16.2 ms	1.293 s ± 0.071 s	2.51 ± 0.16
`800_000`	550.0 ms ± 11.8 ms	1.855 s ± 0.036 s	3.37 ± 0.10
`1600_000`	628.4 ms ± 12.2 ms	2.379 s ± 0.034 s	3.79 ± 0.09

Interpretation:

For linear accesses (no-shuffle mode), we now see deteriorating performance as the problem sizes increase. Part of this is due to the lower amount of iterations, so that the initialization overhead (Python interpreter boot, parsing the input data) becomes more noticeable.
For random accesses (shuffle mode), we see dramatic slowdowns as problem sizes increase. These cannot be explained solely by initialization overhead.
The relative slowdown between a linear access vs random access pair increases with the problem size. At 1.6 million elements (a values list weighing around 60MB total, plus the same amount of data in indices), we're up to a relative slowdown of about 3.8. That means random access is 280% slower, or that linear access only needs 26% of the time.

As the data set is equivalent between the linear-access and random-access variants, this seems to confirm our experimental hypothesis that linear access is significantly faster, which in turn provides evidence for our actual hypothesis that cache effects can have a significant impact in Python code.

Bonus round: indirection overhead

A problem with vanilla Python is the PyObject* pointer indirection, and the overhead of the PyObject structure itself. However, Python does offer facilities to circumvent this. We can use the numpy module to obtain C-style flat arrays, and perform the indexing and summing within native code:

import sys, struct, numpy as np

# parse the test case
header_layout = struct.Struct("=4sLLQ")
magic, size, iterations, expected_sum = \
    header_layout.unpack(sys.stdin.buffer.read(header_layout.size))
assert magic == b"cach",  f"wrong magic number {magic}"
indices = np.fromfile(sys.stdin.buffer, dtype="L", count=size)
values = np.fromfile(sys.stdin.buffer, dtype="L", count=size)
assert sys.stdin.buffer.read() == b""  # EOF

# run the test case
for _ in range(iterations):
    s = np.sum(values[indices], dtype=int)
    assert s == expected_sum

We can now use Hyperfine to compare the Numpy-based implementation against vanilla Python at a fixed problem size:

hyperfine \
  --setup 'python generate_data.py --seed 1234 --iterations {iterations} --size {size} --{shuffle} >example_{size}_{shuffle}.bin' \
  --command-name 'size={size} iters={iterations} {shuffle} {implementation}' \
  -L size 1600000 \
  -L iterations 8 \
  -L shuffle shuffle,no-shuffle \
  -L implementation benchmark,benchmark_numpy \
  'python run_{implementation}.py <example_{size}_{shuffle}.bin'

Results:

Benchmark 1: size=1600000 iters=8 shuffle benchmark (iterations = 8)
  Time (mean ± σ):      2.354 s ±  0.029 s    [User: 2.298 s, System: 0.055 s]
  Range (min … max):    2.315 s …  2.405 s    10 runs

Benchmark 2: size=1600000 iters=8 no-shuffle benchmark (iterations = 8)
  Time (mean ± σ):     618.1 ms ±  17.7 ms    [User: 567.3 ms, System: 50.6 ms]
  Range (min … max):   585.9 ms … 641.5 ms    10 runs

Benchmark 3: size=1600000 iters=8 shuffle benchmark_numpy (iterations = 8)
  Time (mean ± σ):     214.8 ms ±  17.5 ms    [User: 1782.5 ms, System: 19.9 ms]
  Range (min … max):   195.8 ms … 255.0 ms    14 runs

Benchmark 4: size=1600000 iters=8 no-shuffle benchmark_numpy (iterations = 8)
  Time (mean ± σ):     137.1 ms ±  11.0 ms    [User: 1726.8 ms, System: 18.6 ms]
  Range (min … max):   128.8 ms … 163.1 ms    18 runs

Summary
  size=1600000 iters=8 no-shuffle benchmark_numpy (iterations = 8) ran
    1.57 ± 0.18 times faster than size=1600000 iters=8 shuffle benchmark_numpy (iterations = 8)
    4.51 ± 0.38 times faster than size=1600000 iters=8 no-shuffle benchmark (iterations = 8)
   17.17 ± 1.39 times faster than size=1600000 iters=8 shuffle benchmark (iterations = 8)

This looks like Numpy has a considerable performance advantage, which is rather unsurprising. However, this is an apples-to-oranges comparison when it comes to cache effects. Because Numpy arrays are densely packed with no pointer indirection and no per-item object overhead, they take up much less memory. Here, we only need 8 bytes per item with Numpy, compared to about 40 bytes per item in vanilla Python – a 5× difference.

Instead, we can run an experiment for the same memory usage, but with different iterations to achieve a comparable number of indexing operations. This is arguably not a valid comparison, but it's a nice illustration. Here, let's target a volume of 20 million operations, divided as

4 million values at 5 iterations for Numpy
0.8 million values at 25 iterations for vanilla Python

python generate_data.py --seed 1234 --size 800000 --iterations 25 --shuffle    >example_vanilla_shuffle.bin
python generate_data.py --seed 1234 --size 800000 --iterations 25 --no-shuffle >example_vanilla_no-shuffle.bin
python generate_data.py --seed 1234 --size 4000000 --iterations 5 --shuffle    >example_numpy_shuffle.bin
python generate_data.py --seed 1234 --size 4000000 --iterations 5 --no-shuffle >example_numpy_no-shuffle.bin
hyperfine \
    --command-name '{implementation} {shuffle}' \
    -L implementation vanilla,numpy \
    -L shuffle shuffle,no-shuffle \
    'python run_benchmark_{implementation}.py <example_{implementation}_{shuffle}.bin'

Results:

Benchmark 1: vanilla shuffle
  Time (mean ± σ):      2.864 s ±  0.016 s    [User: 2.828 s, System: 0.035 s]
  Range (min … max):    2.832 s …  2.884 s    10 runs

Benchmark 2: numpy shuffle
  Time (mean ± σ):     336.8 ms ±  14.9 ms    [User: 1918.3 ms, System: 27.0 ms]
  Range (min … max):   319.3 ms … 369.8 ms    10 runs

Benchmark 3: vanilla no-shuffle
  Time (mean ± σ):     791.2 ms ±  21.7 ms    [User: 758.6 ms, System: 32.2 ms]
  Range (min … max):   758.3 ms … 822.7 ms    10 runs

Benchmark 4: numpy no-shuffle
  Time (mean ± σ):     167.8 ms ±   9.0 ms    [User: 1745.3 ms, System: 30.2 ms]
  Range (min … max):   152.7 ms … 187.6 ms    17 runs

Summary
  numpy no-shuffle ran
    2.01 ± 0.14 times faster than numpy shuffle
    4.71 ± 0.28 times faster than vanilla no-shuffle
   17.06 ± 0.92 times faster than vanilla shuffle

Table for pairwise comparisons of each implementation:

implementation	no-shuffle	shuffle	rel. slowdown
numpy	167.8 ms ± 9.0 ms	336.8 ms ± 14.9 ms	2.01 ± 0.14
vanilla	791.2 ms ± 21.7 ms	2.864 s ± 0.016 s	3.62 ± 0.10

Now, we can clearly see that both the Numpy and vanilla Python implementations are affected by cache pressure, with the cache-friendly version being 2× and 3.6× faster, respectively. That the pure-Python version suffered a greater slowdown could indicate that the more complex PyObject* layout is substantially more cache-hostile, but I have a hunch that this might not be the entire story, e.g. because these measurements also include input parsing overhead. Future work might involve using tools like Valgrind's cachegrind to gain more detailed insight into how these programs interact with CPU caches, or could use a Numpy array for both implementations, and only compare whether the summing and indexing occurs in C code or Python code.

Conclusion

In these experiments, we demonstrated that even though Python may be slow compared to native code, cache effects do have measurable impact on performance. We used fairly large lists to compare sequential element access to random access. We found that depending on list size, the random access was between 23% and 280% slower. That is nontrivial overhead worth optimizing for, at least for CPU-bound Python code.

The practical effect of this should be negligible though, as CPU-bound Python code using lists with lengths in the 10⁴+ range should be fairly rare.