GXemul  >  Documentation (0.6.0.1)


    



  Introduction  

  Stable release (0.6.0.1)  
     Download  
     Documentation  

  Development  
     News  

  Links  

   

GXemul: Dynamic Translation

Back to the index.


This page describes the internal workings of the GXemul dynamic translation system (usually referred to as just "dyntrans").


Static vs. dynamic:

In order to support guest operating systems, which can overwrite old code pages in memory with new code, it is necessary to translate code dynamically. It is not possible to do a static (one-pass) translation. Self-modifying code and Just-in-Time compilers running inside the emulator are other things that would not work with a static translator. GXemul is a dynamic translator. However, it does not necessarily translate into native code, like some other emulators.


Executable Intermediate Representation:

Dynamic translators usually translate from the emulated architecture (e.g. MIPS) into a kind of intermediate representation (IR), and then to native code (e.g. AMD64 or x86 code) which can be executed by the host. Since one of the main goals for GXemul is to keep everything as portable as possible, the IR is something which can be executed regardless of whether the final step (translation from IR to native code) has been implemented or not.

The IR in GXemul consists of arrays of pointers to functions, and a few arguments which are passed along to those functions. This is usually referred to as threaded code. The functions are implemented in either manually hand-coded C/C++, or generated C/C++.

("Generated code" in the old C dyntrans framework means generated by special programs at compile time, which output temporary .c files; in the C++ dyntrans framework, this is implemented using templates instead. In any case, this is all statically linked into the GXemul binary at link time.)

Here is a simplified diagram of how these arrays work.

There is one instruction call slot for every possible program counter location. For example, in the traditional MIPS case, instruction words are 32 bits long, and pages are 4 KB large, resulting in 1024 instructions per page. In GXemul, this is represented using 1025 or 1026 instruction call slots.

After the last of the regular instruction call slots, there is one or two additional slots, which are special "end of page" slots. These do not count as executed instructions. Instead, they jump to the first instruction on the next virtual page (which might cause exceptions, etc).

The core dyntrans loop, which executes the instructions in the instruction call slots, looks something like this: (Note: This is from pre-0.6.x.)

	/*  The normal instruction execution core:  */
	#define I	ic = cpu->cd.DYNTRANS_ARCH.next_ic ++; ic->f(cpu, ic);

	...

	n_instrs = 0;

	for (;;) {
		struct DYNTRANS_IC *ic;

		I; I; I; I; I;   I; I; I; I; I;
		I; I; I; I; I;   I; I; I; I; I;
		I; I; I; I; I;   I; I; I; I; I;
		I; I; I; I; I;   I; I; I; I; I;
		I; I; I; I; I;   I; I; I; I; I;
		I; I; I; I; I;   I; I; I; I; I;

		I; I; I; I; I;   I; I; I; I; I;
		I; I; I; I; I;   I; I; I; I; I;
		I; I; I; I; I;   I; I; I; I; I;
		I; I; I; I; I;   I; I; I; I; I;
		I; I; I; I; I;   I; I; I; I; I;
		I; I; I; I; I;   I; I; I; I; I;

		cpu->n_translated_instrs += 120;
		if (cpu->n_translated_instrs >= N_SAFE_DYNTRANS_LIMIT)
			break;
	}

Why the number 120? The answer is: this number is evenly divisible by 1, 2, 3, 4, 5, 6, 8, 10, 12 (and some other numbers). Long emulated instruction loops of those lenghts will thus result in the same host branch being taken from the same place multiple times, which seems to fit with some modern CPUs' branch prediction mechanisms.

The complexity of individual instructions vary. A simple example of what an instruction can look like is the MIPS addu instruction. First, as implemented in the old C dyntrans framework (pre-0.6.x):

	/*
	 *  3-register arithmetic instructions:
	 *
	 *  arg[0] = ptr to rs
	 *  arg[1] = ptr to rt
	 *  arg[2] = ptr to rd
	 */
	X(addu)
	{
		reg(ic->arg[2]) = (int32_t)(reg(ic->arg[0]) + reg(ic->arg[1]));
	}

It stores the result of a 32-bit addition of the register at arg[0] with the register at arg[1] into the register at arg[2]. If the emulated CPU is a 64-bit CPU, then this will store a correctly sign-extended value. If it is a 32-bit CPU, then only the lowest 32 bits will be stored, and the high part ignored. X(addu) is expanded to mips_instr_addu in the 64-bit case, and mips32_instr_addu in the 32-bit case. Both are compiled into the GXemul executable; no code is created during run-time.

The C++ dyntrans implementation (in GXemul 0.6.x) is similar, except that a fixed REG32 or REG64 is used instead of reg (so it is only compiled once instead of twice). Also, instructions that may be reused between emulated architectures, such as this 32-bit signed add instruction, are in CPUDyntransComopnent instead of MIPS_CPUComponent.

	/*
	 * arg 0: 64-bit register
	 * arg 1: 64-bit register
	 * arg 2: 64-bit register
	 *
	 * Adds the the registers in arg 1 and arg 2, and stores the result in arg 0
	 * (truncated to a signed 32-bit value).
	 */
	DYNTRANS_INSTR(CPUDyntransComponent,add_u64_u64_u64_truncS32)
	{
		REG64(ic->arg[0]) = (int32_t) (REG64(ic->arg[1]) + REG64(ic->arg[2]));
	}

Another reasonably simple example is the M88K bsr (branch to subroutine) instruction. This is a branch which also sets the return address register to the instruction following the branch. Here is the implementation for "samepage" offsets, i.e. branch offsets which are inside the same dyntrans page:

	DYNTRANS_INSTR(M88K_CPUComponent,bsr_samepage)
	{
		//  This declares a cpu variable of the correct type which we can use.
		DYNTRANS_INSTR_HEAD(M88K_CPUComponent)

		//  Set the return address.
		cpu->m_r[M88K_RETURN_REG] = (cpu->m_pc & ~((M88K_IC_ENTRIES_PER_PAGE-1)
		    << M88K_INSTR_ALIGNMENT_SHIFT)) + ic->arg[2].u32;

		//  Branch.
		cpu->m_nextIC = (struct DyntransIC *) ic->arg[0].p;
	}

Here, the M88K r1 register (the return register) is set to the return address (the current program counter plus an offset in arg[2] pointing to the next instruction) and then the next instruction call to execute is set to arg[0].

Although it looks like the bsr_samepage code would result in quite a lot of host instructions, objdump reveals that the compiler was able to optimize it to a reasonably short sequence:

	On Alpha (compiled with GNU C++ 3.4.6):
	<_ZN17M88K_CPUComponent18instr_bsr_samepageEP20CPUDyntransComponentP10DyntransIC>:
	     760:       b0 00 90 a4     ldq     t3,176(a0)
	     764:       1f 04 ff 47     nop     
	     768:       03 f0 bf 20     lda     t4,-4093
	     76c:       18 00 71 a0     ldl     t2,24(a1)
	     770:       00 00 85 44     and     t3,t4,v0
	     774:       01 04 03 40     addq    v0,t2,t0
	     778:       d0 01 30 b0     stl     t0,464(a0)		; Set the return address
	     77c:       08 00 51 a4     ldq     t1,8(a1)
	     780:       38 01 50 b4     stq     t1,312(a0)		; ... and update nextIC.
	     784:       01 80 fa 6b     ret

	On AMD64 (compiled with GNU C++ 4.2.1):
	<_ZN17M88K_CPUComponent18instr_bsr_samepageEP20CPUDyntransComponentP10DyntransIC>:
	  419310:       8b 87 b0 00 00 00       mov    0xb0(%rdi),%eax
	  419316:       25 03 f0 ff ff          and    $0xfffff003,%eax
	  41931b:       03 46 18                add    0x18(%rsi),%eax
	  41931e:       89 87 d0 01 00 00       mov    %eax,0x1d0(%rdi)		; Set the return address
	  419324:       48 8b 46 08             mov    0x8(%rsi),%rax
	  419328:       48 89 87 38 01 00 00    mov    %rax,0x138(%rdi)		; ... and update nextIC.
	  41932f:       c3                      retq   


Performance:

The performance of using this kind of executable IR is obviously lower than what can be achieved by emulators using native code generation, but can be significantly higher than using a naive fetch-decode-execute interpretation loop. Using threaded code as the IR is an interesting compromise.

The overhead per emulated instruction is usually around or below approximately 10 host instructions. This is very much dependent on your host architecture and what compiler and compiler switches you are using. Added to this instruction count is (of course) also the C/C++ code used to implement each specific instruction.


Instruction Combinations:

Short, common instruction sequences can sometimes be replaced by a "combined" instruction. An example could be a compare instruction followed by a conditional branch instruction. The advantages of instruction combinations are that

  • the amortized overhead per instruction is slightly reduced, and

  • the host's compiler can make a good job at optimizing the common instruction sequence.

The special cases where instruction combinations give the most gain are in the cores of string/memory manipulation functions such as memset() or strlen(). The core loop can then (at least to some extent) be replaced by a native call to the equivalent function.

The implementations of compound instructions still keep track of the number of executed instructions, etc. When single-stepping, these translations are invalidated, and replaced by normal instruction calls (one per emulated instruction).

An example of what such an instruction combination can look like is a 3-instruction memset loop on MIPS. The pattern to detect is a sequence of addiu, bne, and sw with certain registers. In this case, an attempt to identify the sequence is only necessary for certain sw instructions; otherwise, the attempt to find the pattern is not done. (Note: The sw instruction is in the delay slot of the bne instruction.)

If the sequence is detected, the instruction call slot with addiu is changed to point to the instruction combination.

	Original sequence:			Changed to:

	slot 4:  ...				slot 4:  ...
	slot 5:  addiu	r5, r3, 24		slot 5:  addiu	  r5, r3, 24
	slot 6:  addiu	r9, r9, 4		slot 6:  sw_loop  r9, r9, 4
	slot 7:  bne    r12, r9, slot 6		slot 7:  bne      r12, r9, slot 6
	slot 8:  sw	r6, -4(r9)		slot 8:  sw	  r6, -4(r9)
	slot 9:  subu	r2, r3, 32		slot 9:  subu	  r2, r3, 32      <--- (*)
	slot 10: ...				slot 10: ...
(*) If the sw_loop code is executed, then execution continues here (&ic[3]).

The sw_loop is then implemented as follows:

	/*
	 *  sw_loop:
	 *
	 *  s:	addiu	rX,rX,4			rX = arg[0] and arg[1]
	 *	bne	rY,rX,s  (or rX,rY,s)	rt=arg[1], rs=arg[0]
	 *	sw	rZ,-4(rX)		rt=arg[0], rs=arg[1]
	 */
	X(sw_loop)
	{
		MODE_uint_t rX = reg(ic->arg[0]), rZ = reg(ic[2].arg[0]);
		uint64_t *rYp = (uint64_t *) ic[1].arg[0];
		MODE_uint_t rY, bytes_to_write;
		unsigned char *page;
		int partial = 0;

		page = cpu->cd.mips.host_store[rX >> 12];

		/*  Fallback:  */
		if (cpu->delay_slot || page == NULL || (rX & 3) != 0 || rZ != 0) {
			instr(addiu)(cpu, ic);
			return;
		}

		if (rYp == (uint64_t *) ic->arg[0])
			rYp = (uint64_t *) ic[1].arg[1];

		rY = reg(rYp);

		bytes_to_write = rY - rX;
		if ((rX & 0xfff) + bytes_to_write > 0x1000) {
			bytes_to_write = 0x1000 - (rX & 0xfff);
			partial = 1;
		}

		memset(page + (rX & 0xfff), 0, bytes_to_write);

		reg(ic->arg[0]) = rX + bytes_to_write;

		cpu->n_translated_instrs += bytes_to_write / 4 * 3 - 1;
		cpu->cd.mips.next_ic = partial?
		    (struct mips_instr_call *) &ic[0] :
		    (struct mips_instr_call *) &ic[3];
	}

(Note: This is from the old C dyntrans system, not the 0.6.x dyntrans implementation.)

For large memsets, calling the native/system memset implementation will most likely be a lot faster than emulating the MIPS memset one instruction at a time.

The fallback check makes sure that the rest of the code in sw_loop will be able to run. The implementation only supports zero-fill memsets, so if rZ is non-zero (or some other dangerous condition is present), then we do fall-back to addiu. This is done by explicitly calling instr(addiu)(cpu, ic); and then returning.

(The example above is 32-bit specific, and only works if a direct page address can be obtained via cpu->cd.mips.host_store.)

To find out which such instruction combinations to implement, real-world code (such as a complete guest operating system with suitable user applications) should be executed with special instruction statistics gathering, and then the most common instructions can be retrieved from that statistics.


Native Code Generation Back-ends:

In theory, it will be possible to implement native code generation, similar to what is used in high-performance emulators such as QEMU, as long as that generated code abides to the C or C++ ABI on the host.

However, so far, there is no native code generation in GXemul. There was native code generation before 0.4.x (for i386 and Alpha hosts), but that was not clean enough to work as a long-term solution. It is better to spend development resources on porting things from the old emulation framework (pre-0.6.x) to the current one.