Besides common improvements like shifts instead of multiplying integers by a power of 2, MPW Oberon uses the FPU instruction FSCALE when a REAL is multiplied with a constant power of 2. Other detail improvements include the proper use of FMOVECR to use floating-point constants built into the FPU ROM.
680x0 branch instructions come in several sizes: 8-bit, 16-bit, and 32-bit displacement, resulting in 2, 4, or 6 byte instructions [Mot1]. The selection of the appropriate jump is easy in the case of a backward jump (e.g. in loops). Forward jumps are not as simple, the distance is unknown at the time of the instruction generation. It is necessary to repeat the jump instruction selection in an iterative fashion until all jump instructions in a block are determined. We implemented a special datastructure mimicking the jumps to find the relevant instructions.
Another performance critical aspect of code generation for CISC architectures is the choice of the optimal addressing mode for a given load and store operation. This is particularly true on the 680x0 processor with so many modes to choose from. Furthermore 'optimal' is different according to the value of 'x' in the processor name: the 68000, 68020/30, and 68040 process addressing modes with different efficiency.
Several instruction sequences using different addressing modes are possible to access a variable. The choice of the best scheme is influenced by the number of bytes for the sequence and the number of cycles for the instruction sequence. Combining as many operations as possible into a single instruction used to be the optimal strategy. In the times of burst transfers from memory to CPU and multistage instruction decode, this strategy is no longer adequate. For example the memory indirect addressing mode, used in MPW Oberon to access VAR parameters on lower lexical levels, is slower on the 68040 than a two instruction sequence. The 68030 performs better with the single instruction scheme.
After several performance tests, we decided to remove the memory indirect addressing mode we originally implemented as an optimization. As outlined above, the register indexed scheme is faster on the 68040. Both sequences have the same length, complex addressing modes need more bytes to code. Indexed addressing with a register is straightforward, index registers can be scaled by 2, 4, or 8. This addressing mode has more positive effects: less registers are necessary, resulting in smaller code size and faster execution.
The single most effective means to improve code quality was to avoid recomputing values where possible. MPW Oberon keeps track of values still available in a register. This is particularly efficient considering Wirth's redefinition of the WITH statement known from Pascal or Modula-2 [Wirth3]. Programmers are forced to use the complete qualifier in every access to a structure. Since access to multiple elements of a structure in a row is quite common, the same pointer has to be fetched constantly. MPW Oberon's value tracking scheme avoids the refetch of these pointers. The compiler has to be careful with this scheme however, it can only be applied within a basic block. Since the code generator doesn't have this information, it has to clear the value cache every time a branch is encountered or a pointer is touched. Although we cannot detect common subexpressions, the benefit of this scheme is significant: the compiler for example was accelerated by 7%.
The following two fragments show the substantial savings achieved by avoiding recomputations. The first fragment highlights the initialisation of a record. Its fields are accessed relative to register A4 after the first access to field head without recomputing the record's address.
| PROCEDURE NewHead(VAR head: Object; level: LONGINT); | ||
| BEGIN | LINK | A6,#0 |
| head.mode:=Head; | MOVE.L MOVE.L MOVE.W | 12(A6),A3 (A3),A4 #20,4(A4) |
| head.a0:=level; | MOVE.L | 8(A6),10(A4) |
| head.typ:=NIL | CLR.L | 6(A4) |
| ENDNewHead; | UNLK RTD | A6 #6 |
| PROCEDURE CalcWH(win: Windows.WindowPtr); | ||
| VAR w, h: INTEGER; | ||
| BEGIN | LINK | A6,#-4 |
| w:=win.contRgn.rgnBBox.topLeft.h
-win.contRgn.rgnBBox.botRight.h; | MOVE.L MOVE.L MOVE.L MOVE.W SUB.W MOVE.W | 8(A6),A4
118(A4),A3 (A3),A2 4(A2),D7 8(A2),D7 D7,-2(A6) |
| h:=win.contRgn.rgnBBox.topLeft.v
-win.contRgn.rgnBBox.botRight.v | MOVE.W SUB.W MOVE.W | 2(A2),D6 6(A2),D6 D6,-4(A6) |
| END CalcWH; | UNLK RTD | A6 #4 |
The PO compiler manages the identifiers of a scope or an imported module in linear lists. Initially not a problem, big import files such as most Macintosh system modules with thousands of identifiers, show the limits of this strategy. We decided to use a balanced binary tree to hold the identifiers, the Red-Black-Tree [Sedge]. Their administration is simple and the resulting tree well balanced. Although an almost trivial change, the performance gain was enormous. Even the compilation of the compiler itself was speeded up significantly.