Notes on Struct Packing

Today I was trying to recall how C and C++ pack struct members. It’s important if you really need to use as little memory as possible. When you have 150 million records at 64 bytes each, it starts to matter that half of that is wasted space. It’s annoying to have to figure this out. Didn’t Turbo Pascal do a better job? Why the difference on modern systems and languages? This called for a little research.

So, Turbo Pascal did indeed pack its structs (“Records”) at the byte level by default. The 16 bit CPUs it usually ran on didn’t penalize unaligned access much, and space savings were worth it when you just had 256K or so. Plus it meant you could do type CustomerFile = File of Customer and it would serialize to disk with read(), write() and seek (random access) by Customer, with zero wasted space. So in my mind that’s still the “right” way to store struct data.

When I first learned to code, I was taught records (structs) were arranged in memory exactly as they appeared in your code. It’s WYSIWYG. If you serialized them to disk, you could expect a byte for byte match with the in-memory representation and you could read them directly back with no problem. Just add up bytes needed for your record type, multiply by how many you have in memory and that’s your memory footprint and the size of a potential file; the file size tells you exactly how much memory is required to load it. That’s super convenient and intuitive.

When you look at a Rust or C/C++ struct layout, on the other hand, what you see isn’t always what you get in your memory layout and memory used. C defaults to aligning struct members on the member type size, and the whole struct size aligns to a multiple of the largest primitive type size – that’s usually a maximum of 8 bytes but can be more. An example of a large primitive type is the unsigned __int128 gcc type; there it’s treated as a 16 byte primitive and so struct { bool b1; unsigned __int128 uuid_couple[2];} will take 48 bytes.

Members reside in memory in the order they appear in the struct definition. You can use more or less efficient orders; to avoid padding between members you typically want to group members by type size from largest to smallest to save space. By placing smaller type members at the end of the struct they use space in the struct’s trailing padding that was reserved anyway.

Rust by default will reorder struct members that way, to more efficiently pack data without developer intervention. In both these cases your struct may still carry some padding at the end though. The Rust behavior is a convenience to the developer; you write out the members in the order that makes most sense for reading the code and don’t need to concern yourself with saving space.

In C/ C++ however, you do need to be aware of the order of different types. Here’s an example of an extreme case, where changing the type saves a lot of space.

struct BigPersonRecord {
	unsigned __int128 uuid_pair[2];
	uint32_t person_id;
};

The above takes 48 bytes. The “stride” size is 16 bytes.

But if you don’t need to do math operations on the UUIDs and just need to move them around, try storing them in pieces.

struct PersonRecord {
	uint8_t permanent_uuid[16];
	uint8_t secondary_uuid[16];
	uint32_t person_id;	
};

This takes just 36 bytes. The uint32_t type is now the largest (you might position it as the first member) and so the struct “stride” size is 4 bytes. The two UUID members will carry an alignment of the constituent parts of the composite type / array member type, not the size of the array. So the alignment of the UUID pair is now 1.

I want to avoid understanding or thinking about struct packing rules,

you can use compiler directives to pack structs in C/C++ if you want those benefits I ascribed to Turbo Pascal. In gcc you can use the #pragma pack(push,1) directive to pack to single byte alignment. To turn that off you have to do another pack: #pragma pack(pop).

Intel and x86 CPUs back to the 8086 allow unaligned access as do most modern processors. Sun Sparc, and RISC-V microcontroller CPUs do not. The 808x’s contemporary, the Motorola 68000 and 68010 didn’t allow unaligned access either. If you’re building for a modern CPU you’re probably fine, except for the performance penalty you incur by byte packing. I’ll get to that next. There’s no free lunch: There are dangers beyond performance degradation when packing structs in C/C++.

Why struct packing defaults today are different from 1983

The most common processors in the early 80s to run Turbo Pascal on were the 8080 / Z80 for CPM systems, and the 8088 mostly PC/ DOS systems. These all used an 8-bit memory bus. So the memory access naturally aligned to byte offsets! Internally the 8088 had a 16-bit arithmetic and register core. The 8086, 80186 and 80286 had 16 bit memory bus access instead so were actually aligned to 16 bit words. Using the packed Turbo Pascal records when one or more fields had an odd offset made the machine do two accesses instead of one, effectively making it run slower like the 8088. Not something you’d notice necessarily since the 8088 was the dominant CPU by far for quite a few years, and advancements besides 16-bit memory access made programs on the newer x86 chips run a lot faster.

Even so, if you used a 80286, say, you might want to speed up some code where a record was a crucial part. You only needed to concern yourself with alignment in the interior of a record; the record instances / variables were default word aligned already. So your record started on an even offset. You could then avoid the use of char and byte and odd sized interior fields and get faster memory access. It only mattered that your record members started on offsets of a factor of 2.

With the advent of 32 bit memory buses the chances that a packed record would not align went up – you were not getting the full power of the 32 bit wide bus. Memory size on typical systems had grown a lot and so the performance / storage size trade off shifted. Packing data into byte aligned sizes wasted time and reading a machine native 32 bit word was no slower than reading a 16 bit or 8 bit value.

Later, CPU caches got much more important to the performance story. These days, one performance downside to C/C++ packed structs is if an unaligned scalar member crosses cache boundaries. Crossing a cache line – usually 64 bytes – costs a few cycles. Crossing a 4K page boundary (less common) forces a translation – and opens the possibility of a cache miss, which can slow down operation by 100-fold. Using aligned structs guarantees your members won’t cross a cache line or page boundary.

Even more important, unaligned data breaks auto-vectorization, so a compiler won’t emit SIMD code where it otherwise would, slowing down some code in a loop by two to four times (a rough estimate.) It’s much easier to auto-vectorize aligned code and actually, you may align data in the opposite direction – align to 32 or 64 bit boundaries to assist the vectorization.

Safety limits to struct packing in C/C++ – how far can you go?

Another problematic situation is atomic reads and writes (declaring a packed struct member atomic<T> n.) The atomic behavior depends on the value not crossing a cache line. It either slows down a lot or is a fault. More generally any place you interact with third-party code that expects references or pointers to your data structures to use standard alignment is going to break. Rust would catch this shape of an issue during compilation since the equivalent in Rust #[repr(packed)] is actually part of the type and the compiler will error when it sees you taking a reference to a packed struct. The C++ #pragma packed appears superficially similar but it doesn’t effect the struct type at all and so the compiler can’t do any type checking where the struct gets used.

As mentioned earlier, other architectures besides modern Arm64 and x86 may have much less tolerance for unaligned struct members, so the rest of the advice only holds if you’re on the two mainstream architectures.

The guideline therefore is, you may use packed structs in C/C++ on a case by case basis if you don’t plan to exchange them with other libraries or applications and don’t take references or pointers to members in struct instances, and don’t plan on building for non-mainstream targets. You probably don’t want to pack structs involved in heavy computation in tight loops.

Where you are just moving large amounts of data, packing is actually a win. Even though you incur small (maybe 10%) performance penalties the large reduction in data size more than makes up for that slowdown. I observed about 10% reduction in throughput during testing, but 35% (exactly reduced size - 10% less time to copy data. Plus, you very occasionally may hit a scenario where saving 30% space might make the difference between being able to deploy an application to certain environments or not.

My conclusion is that using packed structs more than is usual practice is probably fine and even preferred in a few situations, if you keep it contained and avoid heavy math. Do use them for boring business data objects and where you have large collections of the structs; don’t use packed structs where you do intense numerical work or core real-time game logic or complex multithreaded operations or exchange data structures with libraries you didn’t write.

Perhaps a new language could benefit by reversing the default packing rules, providing a type annotation to data types like favor-compute , where the default would be favor-space. A library writer could enforce alignment on the interface where it mattered. Such a language would always present some extra friction interfacing with the vast C/C++ library universe but it’s an interesting trade-off even so.