Optimizations

To debug a compiler you have to examine the object code produced and you will find parts where the code could be better. However the structure of the PO compiler prevents the use of any optimizing algorithms requiring program analysis (common subexpressions, register variables, etc.). Therefore we had to limit our optimizations to peephole strategies like the optimal selection of instructions (CLR instead of MOVE, BSR instead of JSR where possible, etc.) or the static evaluation of constants.

Besides common improvements like shifts instead of multiplying integers by a power of 2, MPW Oberon uses the FPU instruction FSCALE when a REAL is multiplied with a constant power of 2. Other detail improvements include the proper use of FMOVECR to use floating-point constants built into the FPU ROM.

680x0 branch instructions come in several sizes: 8-bit, 16-bit, and 32-bit displacement, resulting in 2, 4, or 6 byte instructions [Mot1]. The selection of the appropriate jump is easy in the case of a backward jump (e.g. in loops). Forward jumps are not as simple, the distance is unknown at the time of the instruction generation. It is necessary to repeat the jump instruction selection in an iterative fashion until all jump instructions in a block are determined. We implemented a special datastructure mimicking the jumps to find the relevant instructions.

Another performance critical aspect of code generation for CISC architectures is the choice of the optimal addressing mode for a given load and store operation. This is particularly true on the 680x0 processor with so many modes to choose from. Furthermore 'optimal' is different according to the value of 'x' in the processor name: the 68000, 68020/30, and 68040 process addressing modes with different efficiency.

Several instruction sequences using different addressing modes are possible to access a variable. The choice of the best scheme is influenced by the number of bytes for the sequence and the number of cycles for the instruction sequence. Combining as many operations as possible into a single instruction used to be the optimal strategy. In the times of burst transfers from memory to CPU and multistage instruction decode, this strategy is no longer adequate. For example the memory indirect addressing mode, used in MPW Oberon to access VAR parameters on lower lexical levels, is slower on the 68040 than a two instruction sequence. The 68030 performs better with the single instruction scheme.

After several performance tests, we decided to remove the memory indirect addressing mode we originally implemented as an optimization. As outlined above, the register indexed scheme is faster on the 68040. Both sequences have the same length, complex addressing modes need more bytes to code. Indexed addressing with a register is straightforward, index registers can be scaled by 2, 4, or 8. This addressing mode has more positive effects: less registers are necessary, resulting in smaller code size and faster execution.

The single most effective means to improve code quality was to avoid recomputing values where possible. MPW Oberon keeps track of values still available in a register. This is particularly efficient considering Wirth's redefinition of the WITH statement known from Pascal or Modula-2 [Wirth3]. Programmers are forced to use the complete qualifier in every access to a structure. Since access to multiple elements of a structure in a row is quite common, the same pointer has to be fetched constantly. MPW Oberon's value tracking scheme avoids the refetch of these pointers. The compiler has to be careful with this scheme however, it can only be applied within a basic block. Since the code generator doesn't have this information, it has to clear the value cache every time a branch is encountered or a pointer is touched. Although we cannot detect common subexpressions, the benefit of this scheme is significant: the compiler for example was accelerated by 7%.

The following two fragments show the substantial savings achieved by avoiding recomputations. The first fragment highlights the initialisation of a record. Its fields are accessed relative to register A4 after the first access to field head without recomputing the record's address.
PROCEDURE NewHead(VAR head: Object; level: LONGINT);
BEGINLINKA6,#0
head.mode:=Head; MOVE.L
MOVE.L
MOVE.W
12(A6),A3
(A3),A4
#20,4(A4)
head.a0:=level;MOVE.L8(A6),10(A4)
head.typ:=NILCLR.L6(A4)
ENDNewHead;UNLK
RTD
A6
#6
The second fragment shows the calculation of a window's width and height. The data structures in this fragment are defined by Quickdraw. Note that accessing a field within a handle (rgnBBox) is denoted exactly as if it were accessed via a pointer. Due to the complex qualifiers the savings are substantial.
PROCEDURE CalcWH(win: Windows.WindowPtr);
VAR w, h: INTEGER;
BEGINLINKA6,#-4
w:=win.contRgn.rgnBBox.topLeft.h
-win.contRgn.rgnBBox.botRight.h;
MOVE.L
MOVE.L
MOVE.L
MOVE.W
SUB.W
MOVE.W
8(A6),A4
118(A4),A3
(A3),A2
4(A2),D7
8(A2),D7
D7,-2(A6)
h:=win.contRgn.rgnBBox.topLeft.v
-win.contRgn.rgnBBox.botRight.v
MOVE.W
SUB.W
MOVE.W
2(A2),D6
6(A2),D6
D6,-4(A6)
END CalcWH;UNLK
RTD
A6
#4
It is a Macintosh convention, that data larger than 4 byte are not copied onto the stack, a heritage from the days of the good old 128 K Macs. They are passed by reference in any case. It is in the responsibility of the called procedure to make a copy of the structure in order to keep the parameter semantics. MPW Oberon avoids making that copy, when the procedure does not write into a value parameter. The code to make such a copy is in the proedure preamble. While the procedure is generated, MPW Oberon keeps track of access operations to the structure. If there were no write instructions to a particular structured parameter, the copy preamble is deleted.

The PO compiler manages the identifiers of a scope or an imported module in linear lists. Initially not a problem, big import files such as most Macintosh system modules with thousands of identifiers, show the limits of this strategy. We decided to use a balanced binary tree to hold the identifiers, the Red-Black-Tree [Sedge]. Their administration is simple and the resulting tree well balanced. Although an almost trivial change, the performance gain was enormous. Even the compilation of the compiler itself was speeded up significantly.


Previous Section, Next Section, Contents