In almost all assembly books you’ll find some nice tricks to do fast multiplications. E.g. instead of “imul eax, ebx, 3” you can do “lea eax, [ebx+ebx*2]” (ignoring flag effects). It’s pretty clear how this works. But how can we speed up, say, a division by 3? This is quite important since division is still a really slow operation. If you never thought or heart about this problem before, get pen and paper and try a little bit. It’s an interesting problem.
Switching modes with Style
pushl $(0xcb<<24)|0x08 call .-1
What does this instruction sequence do? (This was a collaborative effort by Chuck Gray, Myria and Michael.)
How retiring segmentation in AMD64 long mode broke VMware
UNIX, Windows NT, and all the operating systems in their class rely on virtual memory, or paging, in order to provide every process on the system a complete address space of its own. An easier way to protect processes from each other is segmentation: The 4 GB address space of a 32 bit CPU is divided into segments (consisting of a physical base address and a limit), one for each process, and every process may only access their own segment. This is what the 286 did.
Strange SSE3 opcodes
Intel used some strange opcodes for the SSE3 instructions. All MMX/SSE opcodes use the 0x0f prefix (former “pop cs”). They soon noticed the the 0x0f area gets full, so they used the 0x66, 0xf2, 0xf3 prefix as modifiers. The basic rule is:
Shift oddities
Most of the x86 instructions will automatically alter the flags depending on the result. Sometimes this is rather frustrating because you actually what to preserve the flags as long as possible, and sometimes you miss a “mov eax, ecx” which alters the flags. But at least it’s guaranteed that an instruction either sets the flags or it doesn’t touch them, independent of the actual operation… Or is it?
Black Hat letdown
I went to Black Hat over Wednesday and Thursday. The presentation most people wanted to see (including me) was Joanna Rutkowska breaking the Vista x64 driver signing that I hate so much. I wanted to see what trick she’d found. I was let down, however, when she presented her technique.
Arithmetic mean of two signed integers
It’s time for a puzzle again! (submitted by sheepmaster)
Redundant SSE instructions
As we all know the x86-ISA has a lot of redundant instructions (ie. instructions with the same semantic but different opcodes). Sometimes this is unavoidable, sometimes it looks like bad design. But with SSE it gets really weird. Let’s say we want to perform xmm0 <- xmm0 & xmm1 (ie. bitwise and). Not an uncommon operation; but we have 3 different ways do archive this:
- andps xmm0, xmm1 (0f 54 c1)
- andpd xmm0, xmm1 (66 0f 54 c1)
- pand xmm0, xmm1 (66 0f db c1)
(Note that andpd/pand are SSE2 instructions)
Regarding the result in xmm0 these are really the same instructions. Now, why did Intel do this? First we’re going to inspect andps/andpd. Looking at the optimization manuals we get a hint: The ps/pd mark the target register to contain singles or doubles, so they should match the actual data you are operating on.
How Itanium messed up Intel's CPUID family IDs
Assigning internal version/family/model IDs to products is a non-trivial task, especially if there are several different families/architectures on your roadmap, and if the marketing names and target markets have no real correlation to the internal architecture.
Win32's MulDiv
In Win32, there is an API call called “MulDiv”:
FFREEP – the assembly instruction that never existed
Due to simplified instruction decoding of the Intel 80287, this CPU had opcode aliases for instructions like FXCH, FSTP, i.e. there were some additional encodings that did the same as the originals as defined by the 8087. As a side effect of this, a new instruction, FFREEP appeared, although not intented by Intel.
Virtualization: The elegant way and the x86 way
Virtualization means running one or more complete operating systems (at the same time) on one machine, possibly on top of another operating system. VMware, VirtualPC, Parallels etc. support, for example, running a complete GNU/Linux OS on top of Windows. For virtualization, the Virtual Machine Monitor (VMM) must be more powerful than kernel mode code of the guest: The guest’s kernel mode code must not be allowed to change the global state of the machine, but may not notice that its attempts fail, as it was designed for kernel mode. The VMM as the arbiter must be able to control the guest completely.
The funny page table terminology on AMD64
What’s the next word in this sequence: PT, PD, PDP, …?
The C ! operator
In C, the ! (“logical NOT”) operator used on a value x evaluates to 0 when x is not 0, and 1 when x is 0. In other words, it’s equivalent to the following C:
The real reason for driver signing in Vista x64
In Windows Vista x64, drivers are required to be signed by someone holding a VeriSign code certificate or they won’t load. There is no way to (permanently) disable this signing even if you are Administrator. The F8 startup menu has an option to disable it, but you must select it every time you boot up. Microsoft’s claimed reason for this is that it prevents Trojans from installing kernel-mode rootkits. That is a load of crap.
Microsoft changes CS value in Win64
I just found out the hard way that in 32 bit programs under Win64, the value of CS changed. In Win32, the value of CS is 0x001B. In 32 bit programs under Win64, it’s 0x0023. This will probably break some programs, especially debuggers.
Simple compiler optimization
I thought of an optimization that compilers for most CPUs could do that I think should be implemented. Let’s say you have C code like this:
Asking the kernel how to make a syscall
Imagine you’re an i386 user mode application on a modern operating system, and you want to make a syscall, for example to request some memory or create a new thread. But syscalls can be made in various ways on the i386 family of CPUs (int, call gates, sysenter, syscall), and CPUs tend to support only a subset of them. But hardcoding “int” into the kernel is a waste of resources on modern CPUs, because sysenter is a lot faster.
Why does PUSHA also push the stack pointer?
This puzzle is actually a quite easy one – but when I asked it in a university course, it kept some people busy for some time to find out the answer, so I thought it might be a good idea to ask you nevertheless: