Login Register

Tutorial Phy's ARM Fun Facts [Episode 2: Thumb2 and ARMv8-A] filter_list
Phy's ARM Fun Facts [Episode 2: Thumb2 and ARMv8-A] #1
So, if any of you have been following my ARM rants lately, you have certainly heard a lot about conditional execution and that being the basis of why I love ARM so much (compared to things like PPC and Sparc). If you tried to write some of the sample programs I've given you during that period you may have noticed that pretty much none of them actually assemble, they all give you unknown mnemonic errors. Perhaps you may have also stumbled upon this article.
[Image: sECJMSu.png]

The key here is that this "Applies to: ARMv8-A", sometimes also called A64, or better known in the Apple world as a subset of AArch64. Prior to 2011 all ARM chips were 32-bit, and ARMv7 (think Samsung processors) was incredibly used. The slightly better ARMv7-A ISA came out shortly afterwards, which changed some dynamics of the pipeline (think Apple processors), and then right at the end of it's official development period the Cortex-M's came out, making ARMv7-A not only the best RISC ISA ever made, but also the last true 32-bit densely packed fixed width processor. That same year, 2011, ARM announced the ARMv8 ISA, which made a few key changes to the Apple/M customizations:
  • Extend all address bus lines to 64 bits
  • Double the 32-bit registers on chip (the new registers can't be accessed however)
  • Implement a new prefetch stage allowing an extra 5 registers to be placed on the data bus (effectively implementing 64-bit registers)
  • Move to a superscalar pipeline (now 8 stages rather than 5)
Now, this doesn't look too difficult to work with, but they had some issues to face. Up until that point, the ARM processor was effectively 2 processors on the same die. One for 32-bits long instructions and one for 16-bit instructions (the actual length of the instruction, everything operated in 32-bit mode). When it came time to move to 64-bit computing, they had a tough choice to make.
They could either:
  1. Completely remove 32-bit functionality
  2. Say goodbye to the simplicity of their architecture
Unfortunately this wasn't really even a choice for them, they have Intel to compete with, and Intel is backwards compatible to 1983, they had to keep 32-bit functionality. This of course meant that they could not extend the instruction width to 64 bits (A64 instructions occupy the same space that A32 instructions did). This caused some problems. Let's have a look at the ARMv7 instruction set spec
[Image: NFcfzdA.png]
The key things I want to point out in this graphic are fields labeled
Those are all 4 bit fields, so for each of those fields that we have, we lose 4 bits of our instruction. This worked great with ARMv7 because it only had 16 registers that you could actually use. 2^4=16, so no big deal. Now there are 31 general purpose registers, we need 5 bits....Let's look at the data processing instructions. We have both Rn and Rd in that instruction (and technically we have Rm as well, but its hidden inside Operand2). With A32, this would only take up a single byte, no big deal, but now it takes up a single byte plus 2 extra bits! At this point we've lost 75% of the conditional execution abilities, it's the only thing that can be removed.
Now enter the second part of the problem: maintaining backwards compatibility. To do this, ARM had to use decode bits. You see these in the diagram as either a 1 or a 0. ARM needed more of them, specifically 1 more to handle the new fancy pipeline, they needed a bit to tell the core which decoder to actually use the result from. Now we're down to just 2 conditions (EQ and NE). While this isn't completely useless, it doesn't serve its intended purpose anymore, so ARM took that bit and used it for something else, conditional execution was dead.

Fast forward to 2013, Apple releases the Cyclone series of processors (commonly called the A7). This is where things start to get interesting. Over the last 30 years, Apple has made it very apparent that if you are a CPU or ISA manufacturer, you do what they tell you. In the late 1980's Apple was designing and making processors, just as they do now, but back then they actually were doing it for the mainstream processors. In 2005, Apple announced that they were switching to Intel processors. By 2006, all of their products had been transitioned over, and in the same year PowerPC was dead. Pivoting to today, we now know why that happened, it was because IBM (part of the business alliance that was responsible for PPC) did not want to bring the processors up to date with the 3GHz barrier that Intel was getting so close to reaching. Soon after Apple left, Microsoft was forced to transition the xBox over to Intel and AMD processors, you can draw your own conclusions from that one.

Ok, so enough about the story now, what was Apple pissed about, and what happened?
They were pissed about losing conditional execution. After all, an execute miss on a single instruction is no big deal when compared to Apple's CPU needing to refill its 16-stage pipeline. One cycle was nothing compared to 16 on a branch prediction miss. Apple deviated slightly from the ARM spec, and that eventually gave rise to ARMv8-A, which significantly expanded some of the various instructions, namely adding CSEL, which we won't talk about. In the process, they got much needed updates to Thumb2 that can accomplish at least a good part of those goals, using those misc bits that ARM was left over with at the end of their updates. Let's talk about some of them:

Here's some ARMv7 code:
LDR R4, =var_label
CMP R4, #0x0
BEQ some_func
Now, on an ARMv7 chip, that would take 4 clock cycles before it executed the instruction, assuming that it didn't miss the branch (then it would take 7), still no big deal. The same code for an ARMv8 CPU would take about a dozen cycles.  There isn't any way to optimize this any further, unless some_func was less than 4 instructions in total, at which point we would "inline" the function and conditionally execute it. Apple found a place to pressure ARM on this one, abusing their fancy pipeline, some instruction hackery, and the new branch predictor. Here's the Thumb2 code:
LDR R4, =var_label
CBZ R4, some_func
Exact same result, but this code only needs 3 cycles to get into some_func, and it will never miss the branch. Now this was new to me, and it's awesome. Thumb code was usually much less powerful than ARM code because it only had half as much space, but now they're taking advantage of the pipeline additions and backwards compatible circuitry to simultaneously execute a Thumb instruction and an A32 instruction. If you didn't already know, CBZ means compare and branch if zero.
[Image: vMmi0Xj.png]
There's also CBNZ, as I said in the forward to this post, we only had 1 bit, so we only got EQ and NE, which when used with a CMP mean Z and NZ respectively.

So, I just barely touched on the fanciness of the new pipeline, and that's in part because I hate it. So, up until 2011, an ARM processor had 2 execution states and 2 operational encodings. These were ARM and Thumb. With AArch64 (v8-A), this got a little weird. ARM now has 2 execution states and 3 operational encodings. Thumb2 is gone, it technically does not exist anymore, to get to it, you need to use a BX instruction, it operates entirely in 32-bit mode. This is where I start feeling like something is going on that nobody is telling me about, and I was right. While it's not an official feature in the sense that ARM won't acknowledge its existence, and no assembler will generate the code for it right now, there is indeed a secondary Thumb execution mode on ARM processors that have a pipeline depth of 11 or more, and it has near universal conditional execution. It's a feature that was introduced with the Cortex-A8 and 1156 series, called Thumb-EE, and it's a variable length Thumb2 encoding, meaning that some instructions are 32 bits long, and those instructions do indeed have conditional execution.

I'm going to cut this one short, because I've just given you a giant wall of text about stuff you probably don't care about. Maybe next time I'll talk about ARMv8.4, which should be coming out at the end of this year.
You can read a bit about it here and here


Users browsing this thread: 1 Guest(s)