### 64 bit architecture and programming

Hi guys,

So I'm not sure if this belongs in the longue, but it does pertain to programming and potentially to C++(maybe depending on the compiler?)

I'm currently reading William Stallings's Computer Architecture and Organisation. After just finishing reading Charles Petzold's code second edition, I have a better understanding of what's happening under the hood.

Since Charles's book walks us through the Intel 8080, a rather primitive CPU compared to today's beasts. Well, enough preamble.

A word in computer science pertains to the size of the registers and data a CPU can process(and obviously the data lines too). 64-bit processors to my knowledge have 64-bit registers and a 64-bit data bus from memory to processor; although it could be a lot more? I would imagine maybe it's more as the CPU contains a cache, but for the sake of this post, let's assume the amount of data fetched is 8 bytes or 64 bits.

There can be multi-byte fetch instructions on the intel 8080. For example, one instruction may have one operand such as move immediate(MVI A,23), one fetch for the opcode and one fetch for the byte to move to register A. Once each respective fetch has finished, the program counter will be incremented. This is quite simple for this processor as not only is memory byte addressable(still is) , but the word size is also 8 bits/1 byte. So each fetch will increment the program counter by 1. Now with a 64 bit machine it gets a little bit murky.

With a 64 bit machine, let's assume a word is 64 bits and as stated the memory still only stores one byte each. It would make sense to fetch all 64 bits from memory i.e. acesses 8 bytes of storage then increment the program counter by 8 bytes, I'm guessing this is done??

But, what happens if we have a program, really simple program such as

 ``12345678910`` `````` int main(){ double a = 3.14 // store 8 bytes in memory uint8_t b = 120; // store 1 byte in memory cout << a << endl ;// read 8 bytes from memory, perhaps place it in register and let sysout display 3.14 cout << b << endl; // read 1 byte from memory, perhaps place it in a register and let sysout display 120 }``````

If we read 8 bytes from memory we can display our double a, but next, we read 1 byte from memory to display uint8_t b. Each memory fetch is expensive(let's assume it's not cached). If we read 1 byte from memory and place it in the LSB of a register isn't that a waste of a memory fetch, considering we only retrieve one byte?

I also remember there is a #pragma align if I can remember correctly which aligns memory, I'm guessing this is in some way related?

Danke
Last edited on
Processors have built-in caches. And the size of cache-lines, i.e. the chunks of memory that are kept in the cache, is 64 byte!

So, whenever you "load" some data from the memory into a register, then either the specific cache-line (64 byte block) containing the desired address is already present in the cache, in which case the data is taken straight from the cache (very fast!), or we have a "cache miss", in which case the entire cache-line containing the desired address is loaded into the cache (from RAM). The latter case is much slower, but because the entire cache-line is now already loaded in the cache, subsequent loads of "neighboring" addresses will again be very fast. So, whether you load a single byte or 8 bytes, doesn't really make a big difference, as long as the data is served from the cache. What does makes a big difference is whether the cache-line containing the desired address(es) is already present in the cache, or not.

The "general purpose" registers are 32-Bit in size on the "x86" (32-Bit) processor, and they are 64-Bit in size on the "x86-64" (64-Bit) processor. If you need to do computations with values that are smaller than the registers (e.g. a byte or a 16-Bit word), then the "upper" bits of the register are simply filled with zeros. That is simply how things work and is not a "waste of a memory". It would be a waste if you stored your values in memory (RAM) as 32-Bit or 64-Bit words, even though you know that those values would fit into smaller types...

BTW: SSE, AVX and AVX-512 instructions use dedicated (separate) registers that are 128, 256 and 512 bits in size, respectively. Those registers are used to pack multiple values into a single register (e.g. a 128-bit registers may contain four 32-Bit integers at once) and they are used with SIMD instructions, i.e. instructions that perform the same operation separately/concurrently on all values in the register.
Last edited on
64 byte, that's a lot bigger than I thought, would have thought 64 bit. If the CPU fetches 64 bytes per fetch to cache, does the program counter get incremented by 64 bytes?

And is there another register in the processor that keeps track of where the next byte, etc, to be fetched is located in the cache?
Yes, to the best of my knowledge, the standard cache-line size is 64 bytes.

Unrelated: The program counter points to the next instruction to be executed. After an instruction was executed, it is incremented by `n`, where `n` is the size of the last instruction that was executed, in bytes. In x86/x86-64, the size of an encoded instruction is not fixed!

https://stackoverflow.com/questions/4567903/how-to-tell-the-length-of-an-x86-instruction

 And is there another register in the processor that keeps track of where the next byte, etc, to be fetched is located in the cache?

No. The program code decides which memory address (and how many bytes) it will load next. That data is then fetched from the cache, if possible. But when a "cache miss" happens, it is already determined by the requested address (the address that caused the "cache miss") which cache-line needs to be loaded into the cache now – the one containing the requested address. The real question is: If we add another cache-line into the cache, then which one will be kicked out? Well, this leads to "cache replacement policies" and they are quite complex!

https://en.wikipedia.org/wiki/Cache_replacement_policies

(a relatively simple and not too bad strategy would be to always replace the least-recently accessed cache entry)
Last edited on
 Unrelated: The program counter points to the next instruction to be executed. After an instruction was executed, it is incremented by n, where n is the size of the last instruction that was executed, in bytes. In x86/x86-64, the size of an encoded instruction is not fixed! https://stackoverflow.com/questions/4567903/how-to-tell-the-length-of-an-x86-instruction

Interesting, that I didn't know but does make sense.

 No. When a "cache miss" happens, then it is already determined by the requested address (the address that caused the "cache miss") which cache-line now needs to be loaded into the cache – the one containing the requested address. The real question is: If we add another cache-line into the cache, then which one will be kicked out? Well, this leads to "cache replacement policies" and they are quite complex!

Ah, so I'm guessing the operating system will deal a lot with which data gets swaped in and out of the CPUs cache, I mean, it wouldn't surprise me if the CPU had circuitry to deal with this but seems more of an OS implementation
 Ah, so I'm guessing the operating system will deal a lot with which data gets swaped in and out of the CPUs cache, I mean, it wouldn't surprise me if the CPU had circuitry to deal with this but seems more of an OS implementation

Nope. The CPU itself decides which cache-line gets "evicted" from the CPU cache, when a new cache-line is added to cache.

The cache is internal to the CPU and pretty much opaque to the OS.

(in fact there is a quite complex hierarchy of caches in the CPU, some of which are exclusive to each CPU core and some of which are shared by all/multiple CPU cores; usually there are at least three levels of caches in the CPU, called L1, L2 and L3)

Software can "control" the CPU caching behavior, to some degree, with the `PREFETCH` instruction:
https://c9x.me/x86/html/file_module_x86_id_252.html

The OS is responsible for swapping memory pages in and out of the RAM, to/from the disk. But that is a separate topic... 😏
Last edited on
It's mind-boggling how complex contemporary CPUs are to the Intel 8080 way back in the day.

Thanks Kigar!
Last edited on
Wait until you read about (automatic) cache prefetching, branch predication and speculative execution.

https://en.wikipedia.org/wiki/Cache_prefetching
https://en.wikipedia.org/wiki/Branch_predictor

I don't claim to be an expert on this but from what I understand the CPU tries to predict what memory that you are going to use and will try to load it into cache before it is used. This means the way you access memory can have a big performance impact. E.g. predictable sequential access is usually faster than accessing locations in memory at "random". This mostly matters when you're dealing with a lots of data that don't fit into the cache all at the same time.

Apparently the CPU will also try to execute multiple instructions at the same time, and not necessarily in the same order as the compiler have laid them out. To do this effectively it needs to do "branch prediction", e.g. guess whether the code in an if statement will be executed. If it guesses correctly the code will run faster.

Matt Godbolt has a short video that you might find interesting.
Five things you didn't realise your CPU did for you
Last edited on
 Apparently the CPU will also try to execute multiple instructions at the same time, and not necessarily in the same order as the compiler have laid them out. To do this effectively it needs to do "branch prediction", e.g. guess whether the code in an if statement will be executed. If it guesses correctly the code will run faster.

It is called out-of-order execution and speculative execution.

BTW: Because speculative execution can go wrong – and is expected to go wrong, every now and then – this case must be detected and then the "speculative" instructions need to be reverted before their effects can become "visible" outside of the CPU. Normally, this is not a problem, as long as it happens rarely. Unfortunately, even if the "speculative" instructions are fully reverted, they can leave the CPU cache (or other internal CPU data structures) in slightly different states than they would be otherwise. This is not directly "visible", but it can be detected by cache-based side channel attacks and other similar methods. This leads to CPU vulnerabilities like Meltdown and Spectre 😂
Last edited on
branch prediction is awesome when it works. An example of it working well is loop like code that does its thing 5 billion times and then does something else (exits the loop on a branch) one time, but it can't be determined ahead of time (eg, until the user enters -1, keep adding values to a vector). An example of it working poorly is a chain of nested conditions (until the user enters -1, if they entered 3.14 do this else if they entered 6.28 do that else .... on and on for 10+ choices, each of those with internal choices, and so on) where it simply cannot chase down all the paths or make any meaningful prediction etc.

It is in your best interest to write code that can exploit BP, but if you can make it work for you it is truly amazing.
I'm not sure that's even really possible. Maybe if you're targeting a specific uarch, but otherwise attempting to program around "the" branch predictor (as if there's only one) sounds like an exercise in futility.
Last edited on
Maybe so. The latest generations are able to buffer up a pretty high number as well, so the limit is a moving target; its getting close to 64 at a time last time I looked. That is much more forgiving than it was even a few years back. That is the common weakness for all the approaches... the number of active branches at once. Try to keep that low?
helios wrote:
attempting to program around "the" branch predictor (as if there's only one) sounds like an exercise in futility
Perhaps the goal is better worded as "avoid branches in hot paths when possible in the first place". Of course, avoiding branches isn't always faster, and everything should be measured (e.g. if it requires more RAM accesses to avoid branching when simply having calculated the branch would have avoided that).

https://youtu.be/rX0ItVEVjHc&t=2587 (43:07 gives examples when branching can't always be optimized by the compiler)
Last edited on
Avoiding branches is less about the branch predictor and more about avoiding pipeline stalls.
Right, well in that case an example of working with the branch predictor is to have consistent data, e.g. a pre-sorted array.

I'm sure a lot of people here already have seen posts like this, but in case people haven't:
https://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-processing-an-unsorted-array

For large N, big-O always wins. But for many applications with medium-sized N, it can be more advantageous to sort data before processing it, so that the branch predictor can make more accurate predictions.
Topic archived. No new replies allowed.