Cache-aware programming can make a huge performance difference, especially when writing code in C++ or Rust. Python is a much more high-level language, and doesn't give us that level of control over memory layout of our data structures. So does this mean that CPU caching effects aren't relevant in Python?
In this post, we'll conduct some basic experiments to answer this question, by accessing list element either in sequental order or in a random order.
Results indicate that randomized access is consistently slower in Python. Especially as soon as problem sizes outgrow the CPU cache sizes, random access is multiple times slower. Thus, some degree of cache-aware programming may be relevant in interpreted environments like CPython 3.12 as well.
Disclaimers: Results may vary on different software and hardware. The software used for these tests is CPython 3.12. The CPU uses the x86-64 (amd64) architecture, Zen 2 microarchitecture. This CPU generation had a fairly complex cache hierarchy due to grouping cores in separate "core complexes" that partially share some of their cache. Each core should see up to 32KB L1, 512KB L2, 16MB L3.
Background
Programs store data in memory (RAM), but this is slow. CPU caches are faster, but much smaller. The CPU has to decide which parts of the RAM to load into the caches. Of course explicitly accessed memory addresses will be loaded into the cache. But the act of loading data is comparatively slow, so the CPU will try to pre-load relevant data.
For example, we can expect the CPU to detect linear array access patterns
and automatically pre-load the next chunks of an array.
So foreach
style loops over an array where all relevant data is directly inside the array
can be quite cache-friendly,
whereas random access cannot be predicted.
In practice, many arrays contain pointers instead. We must dereference the pointer to a different memory location in order to get to the actual data. Maybe the pointed-to location is already in the cache, maybe not and we will have to wait – effectively also a kind of random access.
In Java (OpenJDK) and Python (CPython), all normal objects are actually represented as pointers to a more complex data structure.
In C notation, an array wouldn't contain PyObject
values, but PyObject*
pointers.
So clearly, Python data structures won't be as cache-friendly as data structures in C++ or Rust could be.
But Python already has a ton of interpreter overhead compared to natively compiled languages, so maybe this doesn't matter?
- In my experience, interpreter overhead tends to be in the 2× – 100× range, typically at least 10×, and much worse if meaningful compiler optimizations are possible.
- But typical values for the cost of accessing main memory vs L1 cached data also tend to be in the 100× range.
All of this is very rule-of-tumb, back-of-napkin, but this suggests that interpreter overhead wouldn't drown out cache effects.
Designing an experiment
Our working hypothesis is: cache effects are detectable in Python programs.
This is not testable, so let's refine it based on the array iteration example above.
What we could try to observe is that the following program should take longer
when indices = shuffled(range(len(values)))
("random" access)
than when elements are accessed in order (indices = list(range(len(values)))
):
sum(values[i] for i in indices)
But there are many different objects here at play:
indices
is an array containing pointers to integer objectsi
is effectively a pointer to such an integer objectvalues
is an array containing pointers to objectsvalues[i]
is effectively one such pointer
To drop to C notation for a moment, this means we have four different kinds of memory locations:
indices[…]
/i
, which contains a pointer*i
, which contains the actual indexvalues[*i]
, which contains a pointer*values[*i]
, which contains the actual value
If we want to desing the experiment properly, we have to consider cache effects for all these memory locations.
-
The
indices
array is likely unproblematic as it will be iterated in order. -
Each index
*i
however requires a memory access. To reduce the impact of cache effects, it would be nice if these values are laid out sequentially in memory. Python doesn't let us control that, but in practice objects are likely to be allocated mostly one after the other. So we want to create those integer objects in the order in which they will be referenced. That means we must not shuffle indices within the Python process that runs the benchmark, but can read in already-shuffled test data. -
The
values
array is unproblematic, as it is the main cache effect that we want to test. -
The
*values[*i]
elements may or may not matter. Either the CPU is able to understand thatvalues
are actually pointers to values and can pre-load the pointed-to values. Or, cache effects would mean that the example runs faster if the values are laid out in the same order in which they are referenced in thevalues
array.
I'm not sure how to test the impact of values
order,
so let's stick to the same trick of using allocation-order to increase the chance of a potentially cache-friendly data layout.
We can generate test data in a format as follows:
<size>,<iterations>,<expected sum>
<indices>,...
<values>,...
For example, the following would describe an in-order access of values
(the order of indices matters, the order of values should not):
4,1,10
0,1,2,3,4
3,2,4,0,1
I don't want to bother with strings and number parsing,
so let's use Python's struct
and array
modules to pack the test data into a binary representation.
Here is the main program that we will benchmark:
import sys, struct, array
def parse_array(f, fmt, n):
arr = array.array(fmt)
arr.fromfile(f, n)
return arr.tolist()
# parse the test case
header_layout = struct.Struct("=4sLLQ")
magic, size, iterations, expected_sum = \
header_layout.unpack(sys.stdin.buffer.read(header_layout.size))
assert magic == b"cach", f"wrong magic number {magic}"
indices = parse_array(sys.stdin.buffer, "L", size)
values = parse_array(sys.stdin.buffer, "L", size)
# run the test case
for _ in range(iterations):
s = sum(values[i] for i in indices)
assert s == expected_sum
Here's a program to generate a test case:
import sys, struct, array, random, click
@click.command()
@click.option("--size", type=int, required=True)
@click.option("--iterations", type=int, required=True)
@click.option("--shuffle/--no-shuffle", required=True)
@click.option("--seed", type=int, required=True)
def cli(size: int, iterations: int, shuffle: bool, seed: int):
rng = random.Random(seed)
values = list(range(size))
rng.shuffle(values) # should not affect benchmarks, but best randomize
indices = list(range(size))
if shuffle:
rng.shuffle(indices)
header_layout = struct.Struct("=4sLLQ")
sys.stdout.buffer.write(header_layout.pack(b"cach", size, iterations, sum(values)))
array.array("L", indices).tofile(sys.stdout.buffer)
array.array("L", values).tofile(sys.stdout.buffer)
if __name__ == "__main__":
cli()
Btw why don't we just use an array
instead of a Python list
for our tests because after all an array lets us control the memory layout?
- For the
values
, we want to use alist
because we're interested in the cache effects with typical Python datastructures. - For the
indices
, because accessing anindices[…]
element would still have to create aPyObject*
. It's probably better to get all of these allocations out of the way before running the actual tests.
How much memory will actually get used for a particular example size?
- While the above layouts encode integers as 32-bit or 64-bit ints (4 or 8 bytes),
the in-memory representation will be a
PyObject*
. - That's one pointer (8 bytes) for the pointer,
- and another 28 bytes (3.5 pointers worth) for the
PyObject
data in its small integer mode (calculated viasys.getsizeof()
). - However, alignment considerations will probably round this up to 32 bytes (4 pointers worth).
- In total, that is 5 pointers worth (40 bytes) per element in
values
, plus the same for the indices. - To observe meaningful results, our problem sizes should be a lot larger than the available cache size.
For 16MB cache, this means
values
should contain at least 420000 entries, or half that amount if we also consider cache pressure fromindices
.
This allows us to formulate some expectations:
- For problem sizes below around 200000, we expect linear and randomized access to work similarly well.
- Hypothesis: for problem sizes above 420000, we expect linear array access to be significantly faster than randomized access.
We can use the hyperfine
tool (v1.18) to run the individual benchmarks.
It supports parametrized benchmarks.
All experiments will use the following form, though we will vary --iterations
and size
parameters.
hyperfine \
--setup 'python generate_data.py --seed 1234 --iterations {iterations} --size {size} --{shuffle} >example_{size}_{shuffle}.bin' \
--command-name 'size={size} iters={iterations} {shuffle}' \
-L size 100,1000,10000,100000 \
-L iterations 5 \
-L shuffle shuffle,no-shuffle \
'python run_benchmark.py <example_{size}_{shuffle}.bin'
Results
We can now conduct a couple of experiments to investigate caching impact.
Small data sets
Our above analysis suggests that problem sizes below around 200000 should not be affected by caching.
Let's ask hyperfine
to compare shuffled access for size in {50_000, 100_000, 200_000}
.
Note that we're also adjusting the iterations so that the total number of array accesses remains constant at 10 million,
which allows us also to compare results across sizes.
Results for size=50_000
Benchmark 1: size=50000 iters=200 shuffle (iterations = 200)
Time (mean ± σ): 462.0 ms ± 13.4 ms [User: 456.7 ms, System: 5.2 ms]
Range (min … max): 439.5 ms … 477.5 ms 10 runs
Benchmark 2: size=50000 iters=200 no-shuffle (iterations = 200)
Time (mean ± σ): 374.1 ms ± 17.3 ms [User: 369.2 ms, System: 4.8 ms]
Range (min … max): 350.0 ms … 394.3 ms 10 runs
Summary
size=50000 iters=200 no-shuffle (iterations = 200) ran
1.23 ± 0.07 times faster than size=50000 iters=200 shuffle (iterations = 200)
Results for size=100_000:
Benchmark 1: size=100000 iters=100 shuffle (iterations = 100)
Time (mean ± σ): 479.6 ms ± 19.6 ms [User: 470.2 ms, System: 9.3 ms]
Range (min … max): 453.7 ms … 519.8 ms 10 runs
Benchmark 2: size=100000 iters=100 no-shuffle (iterations = 100)
Time (mean ± σ): 384.4 ms ± 14.0 ms [User: 376.0 ms, System: 8.3 ms]
Range (min … max): 361.4 ms … 405.8 ms 10 runs
Summary
size=100000 iters=100 no-shuffle (iterations = 100) ran
1.25 ± 0.07 times faster than size=100000 iters=100 shuffle (iterations = 100)
Results for size=200_000:
Benchmark 1: size=200000 iters=50 shuffle (iterations = 50)
Time (mean ± σ): 567.8 ms ± 35.4 ms [User: 557.7 ms, System: 9.9 ms]
Range (min … max): 501.0 ms … 634.4 ms 10 runs
Benchmark 2: size=200000 iters=50 no-shuffle (iterations = 50)
Time (mean ± σ): 385.2 ms ± 13.8 ms [User: 377.3 ms, System: 7.8 ms]
Range (min … max): 371.4 ms … 407.8 ms 10 runs
Summary
size=200000 iters=50 no-shuffle (iterations = 50) ran
1.47 ± 0.11 times faster than size=200000 iters=50 shuffle (iterations = 50)
Interpretation:
For all tested problem sizes, we see that randomized (shuffle
) access is significantly slower than linear array access (no-shuffle
).
For sizes 50_000
and 100_000
we see a factor of around 1.24.
But for the size 200_000
we're already up to a factor of 1.47
.
We can reformat the data to compare the mean elapsed time of shuffle
and no-shuffle
factors across different problem sizes:
size | no-shuffle | shuffle | rel. slowdown |
---|---|---|---|
50_000 | 374.1 ms ± 17.3 ms | 462.0 ms ± 13.4 ms | 1.23 ± 0.07 |
100_000 | 384.4 ms ± 14.0 ms | 479.6 ms ± 19.6 ms | 1.25 ± 0.07 |
200_000 | 385.2 ms ± 13.8 ms | 567.8 ms ± 35.4 ms | 1.47 ± 0.11 |
We can see that the no-shuffle
linear mode (left column) maintains constant speed across different problem sizes.
All of these timings are within each other's margin of error.
But for the shuffle
mode, we see differences.
While the two smaller problem sizes have speeds within each other's margin of error,
the size = 200_000
runs are significantly slower.
This is broadly in line with our expectations.
- We see that linear array access provides very consistent performance across problem sizes.
- We see that random access is consistently slower, but the overhead isn't terrible (only a factor of about
1.24
at smaller problem sizes). - As we approach the region where our problem no longer fits into the cache, performance of random access starts to deteriorate.
Large data sets
Next, let's measure performance when we cross into the region of problem sizes that cannot fit into the cache.
In order to maintain a nice power-of-two sequence, the iterations
were increased a bit,
keeping a constant volume of indexing operations at 12.8 million.
size= 200_000 iters=64
(size matches largest experiment from the "small data set" round)size= 400_000 iters=32
size= 800_000 iters=16
size=1600_000 iters= 8
Results for size=200_000:
Benchmark 1: size=200000 iters=64 shuffle (iterations = 64)
Time (mean ± σ): 693.9 ms ± 40.1 ms [User: 684.6 ms, System: 9.1 ms]
Range (min … max): 641.0 ms … 751.8 ms 10 runs
Benchmark 2: size=200000 iters=64 no-shuffle (iterations = 64)
Time (mean ± σ): 496.8 ms ± 10.3 ms [User: 483.1 ms, System: 13.6 ms]
Range (min … max): 482.2 ms … 515.4 ms 10 runs
Summary
size=200000 iters=64 no-shuffle (iterations = 64) ran
1.40 ± 0.09 times faster than size=200000 iters=64 shuffle (iterations = 64)
Benchmark 1: size=200000 iters=64 shuffle (iterations = 64)
Time (mean ± σ): 693.9 ms ± 40.1 ms [User: 684.6 ms, System: 9.1 ms]
Range (min … max): 641.0 ms … 751.8 ms 10 runs
Benchmark 2: size=200000 iters=64 no-shuffle (iterations = 64)
Time (mean ± σ): 496.8 ms ± 10.3 ms [User: 483.1 ms, System: 13.6 ms]
Range (min … max): 482.2 ms … 515.4 ms 10 runs
Summary
size=200000 iters=64 no-shuffle (iterations = 64) ran
1.40 ± 0.09 times faster than size=200000 iters=64 shuffle (iterations = 64)
Results for size=400_000:
Benchmark 1: size=400000 iters=32 shuffle (iterations = 32)
Time (mean ± σ): 1.293 s ± 0.071 s [User: 1.275 s, System: 0.018 s]
Range (min … max): 1.217 s … 1.452 s 10 runs
Benchmark 2: size=400000 iters=32 no-shuffle (iterations = 32)
Time (mean ± σ): 515.2 ms ± 16.2 ms [User: 497.3 ms, System: 17.7 ms]
Range (min … max): 485.7 ms … 541.0 ms 10 runs
Summary
size=400000 iters=32 no-shuffle (iterations = 32) ran
2.51 ± 0.16 times faster than size=400000 iters=32 shuffle (iterations = 32)
Results for size=800_000:
Benchmark 1: size=800000 iters=16 shuffle (iterations = 16)
Time (mean ± σ): 1.855 s ± 0.036 s [User: 1.824 s, System: 0.030 s]
Range (min … max): 1.805 s … 1.926 s 10 runs
Benchmark 2: size=800000 iters=16 no-shuffle (iterations = 16)
Time (mean ± σ): 550.0 ms ± 11.8 ms [User: 521.5 ms, System: 28.4 ms]
Range (min … max): 532.4 ms … 563.1 ms 10 runs
Summary
size=800000 iters=16 no-shuffle (iterations = 16) ran
3.37 ± 0.10 times faster than size=800000 iters=16 shuffle (iterations = 16)
Results for size=1600_000:
Benchmark 1: size=1600000 iters=8 shuffle (iterations = 8)
Time (mean ± σ): 2.379 s ± 0.034 s [User: 2.327 s, System: 0.052 s]
Range (min … max): 2.336 s … 2.428 s 10 runs
Benchmark 2: size=1600000 iters=8 no-shuffle (iterations = 8)
Time (mean ± σ): 628.4 ms ± 12.2 ms [User: 570.3 ms, System: 57.9 ms]
Range (min … max): 606.3 ms … 647.2 ms 10 runs
Summary
size=1600000 iters=8 no-shuffle (iterations = 8) ran
3.79 ± 0.09 times faster than size=1600000 iters=8 shuffle (iterations = 8)
Table for comparing the mean elapsed time and the relative slowdown of random access:
size | no-shuffle | shuffle | rel. slowdown |
---|---|---|---|
200_000 | 496.8 ms ± 10.3 ms | 693.9 ms ± 40.1 ms | 1.40 ± 0.09 |
400_000 | 515.2 ms ± 16.2 ms | 1.293 s ± 0.071 s | 2.51 ± 0.16 |
800_000 | 550.0 ms ± 11.8 ms | 1.855 s ± 0.036 s | 3.37 ± 0.10 |
1600_000 | 628.4 ms ± 12.2 ms | 2.379 s ± 0.034 s | 3.79 ± 0.09 |
Interpretation:
- For linear accesses (
no-shuffle
mode), we now see deteriorating performance as the problem sizes increase. Part of this is due to the lower amount of iterations, so that the initialization overhead (Python interpreter boot, parsing the input data) becomes more noticeable. - For random accesses (
shuffle
mode), we see dramatic slowdowns as problem sizes increase. These cannot be explained solely by initialization overhead. - The relative slowdown between a linear access vs random access pair increases with the problem size.
At 1.6 million elements (a
values
list weighing around 60MB total, plus the same amount of data inindices
), we're up to a relative slowdown of about 3.8. That means random access is 280% slower, or that linear access only needs 26% of the time.
As the data set is equivalent between the linear-access and random-access variants, this seems to confirm our experimental hypothesis that linear access is significantly faster, which in turn provides evidence for our actual hypothesis that cache effects can have a significant impact in Python code.
Bonus round: indirection overhead
A problem with vanilla Python is the PyObject*
pointer indirection,
and the overhead of the PyObject
structure itself.
However, Python does offer facilities to circumvent this.
We can use the numpy
module to obtain C-style flat arrays,
and perform the indexing and summing within native code:
import sys, struct, numpy as np
# parse the test case
header_layout = struct.Struct("=4sLLQ")
magic, size, iterations, expected_sum = \
header_layout.unpack(sys.stdin.buffer.read(header_layout.size))
assert magic == b"cach", f"wrong magic number {magic}"
indices = np.fromfile(sys.stdin.buffer, dtype="L", count=size)
values = np.fromfile(sys.stdin.buffer, dtype="L", count=size)
assert sys.stdin.buffer.read() == b"" # EOF
# run the test case
for _ in range(iterations):
s = np.sum(values[indices], dtype=int)
assert s == expected_sum
We can now use Hyperfine to compare the Numpy-based implementation against vanilla Python at a fixed problem size:
hyperfine \
--setup 'python generate_data.py --seed 1234 --iterations {iterations} --size {size} --{shuffle} >example_{size}_{shuffle}.bin' \
--command-name 'size={size} iters={iterations} {shuffle} {implementation}' \
-L size 1600000 \
-L iterations 8 \
-L shuffle shuffle,no-shuffle \
-L implementation benchmark,benchmark_numpy \
'python run_{implementation}.py <example_{size}_{shuffle}.bin'
Results:
Benchmark 1: size=1600000 iters=8 shuffle benchmark (iterations = 8)
Time (mean ± σ): 2.354 s ± 0.029 s [User: 2.298 s, System: 0.055 s]
Range (min … max): 2.315 s … 2.405 s 10 runs
Benchmark 2: size=1600000 iters=8 no-shuffle benchmark (iterations = 8)
Time (mean ± σ): 618.1 ms ± 17.7 ms [User: 567.3 ms, System: 50.6 ms]
Range (min … max): 585.9 ms … 641.5 ms 10 runs
Benchmark 3: size=1600000 iters=8 shuffle benchmark_numpy (iterations = 8)
Time (mean ± σ): 214.8 ms ± 17.5 ms [User: 1782.5 ms, System: 19.9 ms]
Range (min … max): 195.8 ms … 255.0 ms 14 runs
Benchmark 4: size=1600000 iters=8 no-shuffle benchmark_numpy (iterations = 8)
Time (mean ± σ): 137.1 ms ± 11.0 ms [User: 1726.8 ms, System: 18.6 ms]
Range (min … max): 128.8 ms … 163.1 ms 18 runs
Summary
size=1600000 iters=8 no-shuffle benchmark_numpy (iterations = 8) ran
1.57 ± 0.18 times faster than size=1600000 iters=8 shuffle benchmark_numpy (iterations = 8)
4.51 ± 0.38 times faster than size=1600000 iters=8 no-shuffle benchmark (iterations = 8)
17.17 ± 1.39 times faster than size=1600000 iters=8 shuffle benchmark (iterations = 8)
This looks like Numpy has a considerable performance advantage, which is rather unsurprising. However, this is an apples-to-oranges comparison when it comes to cache effects. Because Numpy arrays are densely packed with no pointer indirection and no per-item object overhead, they take up much less memory. Here, we only need 8 bytes per item with Numpy, compared to about 40 bytes per item in vanilla Python – a 5× difference.
Instead, we can run an experiment for the same memory usage,
but with different iterations
to achieve a comparable number of indexing operations.
This is arguably not a valid comparison, but it's a nice illustration.
Here, let's target a volume of 20 million operations, divided as
- 4 million values at 5 iterations for Numpy
- 0.8 million values at 25 iterations for vanilla Python
python generate_data.py --seed 1234 --size 800000 --iterations 25 --shuffle >example_vanilla_shuffle.bin
python generate_data.py --seed 1234 --size 800000 --iterations 25 --no-shuffle >example_vanilla_no-shuffle.bin
python generate_data.py --seed 1234 --size 4000000 --iterations 5 --shuffle >example_numpy_shuffle.bin
python generate_data.py --seed 1234 --size 4000000 --iterations 5 --no-shuffle >example_numpy_no-shuffle.bin
hyperfine \
--command-name '{implementation} {shuffle}' \
-L implementation vanilla,numpy \
-L shuffle shuffle,no-shuffle \
'python run_benchmark_{implementation}.py <example_{implementation}_{shuffle}.bin'
Results:
Benchmark 1: vanilla shuffle
Time (mean ± σ): 2.864 s ± 0.016 s [User: 2.828 s, System: 0.035 s]
Range (min … max): 2.832 s … 2.884 s 10 runs
Benchmark 2: numpy shuffle
Time (mean ± σ): 336.8 ms ± 14.9 ms [User: 1918.3 ms, System: 27.0 ms]
Range (min … max): 319.3 ms … 369.8 ms 10 runs
Benchmark 3: vanilla no-shuffle
Time (mean ± σ): 791.2 ms ± 21.7 ms [User: 758.6 ms, System: 32.2 ms]
Range (min … max): 758.3 ms … 822.7 ms 10 runs
Benchmark 4: numpy no-shuffle
Time (mean ± σ): 167.8 ms ± 9.0 ms [User: 1745.3 ms, System: 30.2 ms]
Range (min … max): 152.7 ms … 187.6 ms 17 runs
Summary
numpy no-shuffle ran
2.01 ± 0.14 times faster than numpy shuffle
4.71 ± 0.28 times faster than vanilla no-shuffle
17.06 ± 0.92 times faster than vanilla shuffle
Table for pairwise comparisons of each implementation:
implementation | no-shuffle | shuffle | rel. slowdown |
---|---|---|---|
numpy | 167.8 ms ± 9.0 ms | 336.8 ms ± 14.9 ms | 2.01 ± 0.14 |
vanilla | 791.2 ms ± 21.7 ms | 2.864 s ± 0.016 s | 3.62 ± 0.10 |
Now, we can clearly see that both the Numpy and vanilla Python implementations
are affected by cache pressure,
with the cache-friendly version being 2× and 3.6× faster, respectively.
That the pure-Python version suffered a greater slowdown
could indicate that the more complex PyObject*
layout is substantially more cache-hostile,
but I have a hunch that this might not be the entire story,
e.g. because these measurements also include input parsing overhead.
Future work might involve using tools like Valgrind's cachegrind
to gain more detailed insight into how these programs interact with CPU caches,
or could use a Numpy array for both implementations,
and only compare whether the summing and indexing occurs in C code or Python code.
Conclusion
In these experiments, we demonstrated that even though Python may be slow compared to native code, cache effects do have measurable impact on performance. We used fairly large lists to compare sequential element access to random access. We found that depending on list size, the random access was between 23% and 280% slower. That is nontrivial overhead worth optimizing for, at least for CPU-bound Python code.
The practical effect of this should be negligible though, as CPU-bound Python code using lists with lengths in the 10⁴+ range should be fairly rare.
Further reading
Agner Fog's optimization manuals are an indispensable and fairly accessible resource on microarchitecture details. For example, the microarchitecture manual provides tables on cache latencies, and the C++ and assembly optimization manuals each contain sections on optimizing memory access patterns.
Norvig's table of Latencies Every Programmer Should Know is a neat overview latencies. The relative orders of magnitude remain relevant, even if that resource is over a decade old. Sirupsen's Napkin Math collection provides similar tables that are more up to date, but is focused more on general system performance. The collection to further performance resources is particularly nice.
- no newer posts. Subscribe to Atom feed. Get notified when more articles are posted.
- previous post: Allocator Testing