The Design of a Distributed Oberon System

Stefan Traub

Department of Distributed Systems, University of Ulm, D-89069 Ulm, Germany

Abstract

In an ongoing research effort a distributed operating system based on the Oberon language and system is built. The basic abstraction is a global address space shared by all nodes in the system. The safe type-system of the Oberon language is crucial for efficient administration of the distributed virtual memory. Furthermore, essential parts of the Oberon system are well suited for a compact and efficient run-time environment for distributed applications. Approaches to garbage collection, relocation and distributed transactions are discussed.

Keywords: Distributed System, Oberon, Garbage Collection,
Transactions, Shared Memory

Keyword Codes: C.2.4; D.3.3; D.4.2;

1 Introduction

This report describes an ongoing research effort building a distributed operating system based on the Oberon language and system. Our primary goal is the design of a very small but nevertheless powerful system. The design focuses on the aspects, where a real typesafe language, such as Oberon, can assist us in building a distributed system.

The paper is divided in two major sections. After a first architectural overview we will discuss problems of garbage collection and object relocation in a distributed environment. The second section describes a transaction model for optimistic concurrency control.

Key features of the system are:

Simplicity

Consistent view of the user workspace from any station in the distributed system.

Controlled access to a set of shared objects, particularly within telecooperations scenarios.

2 Distributed Heap Storage

The basic abstraction is a global (unique) address space shared by all nodes in the system. All executing threads are operating within a single global address space. Portions of the global virtual memory are mapped to the node's physical address space by means of a memory management unit.

The safe type-system of the Oberon language [2] is crucial for efficient administration of the distributed virtual memory. Data, stack, and code segments are allocated in a global heap structure which is common to all participating nodes. Objects in the common heap are accessed via type-bound pointers in virtual memory space. Available memory management hardware maps the logical address into local memory or onto other nodes in the system.

In addition to the translation of logical to physical address the mapping hardware also considers available access privileges, such as (write, read, execute ...). Write-access privileges to a memory page are typically granted to a single thread only at a particular point in time. Pages with read-access only (code) can be freely replicated in other nodes.

Figure 1: Distributed Heap Storage

2.1 Size of the address space

Currently addresses into the global heap are 32 bits long. To show the basic functionality of our system this is sufficient. We expect processors to have at least 64 bits of address space in the near future [12]. Within this address space it will be possible to allocate very large objects and even to connect a many executing nodes. In addition, some actually available processors allow to extend the address size by using segmented address models [10,11].

2.2 Garbage Collection

Several methods have been proposed to identify unaccessible memory portions in distributed systems [5,6]. In an Oberon system garbage collection is typically based on type information provided by the compiler. Simple schemes will suspend the processing and collect unaccessible heap segments after following all pointers. However, in a distributed system with multiple threads incremental collection schemes are needed which can execute asynchronously to the user threads.

The garbage collector must:

1. Find one object as a candidate for collection.
2. Find all references to this object, if none could be found, then deallocate it.

Finding a candidate for collection is easy. Typically all allocated objects are chained together in the heap. The second part requires more effort and is also needed to relocate objects in the heap.

2.3 Relocation

Whenever the heap is fragmented and can no longer accommodate large contiguous objects, the necessity arises to relocate objects to another logical address. After moving the object to its new destination, all referencing pointers are updated.

Relocation of objects is also beneficial when two objects reside in the same memory page but are continuously written from different nodes. One of the objects should then be moved to a different memory page which separately migrate to the requesting node. Objects must also be moved out of a logical page, whenever there is the policy not to transfer a complete page over the network (to reduce network load). If more valid objects would remain in the same logical page, an ambiguous situation within the address mapping entries of the participating nodes will occur.

The relocator must:

1. Move the object to a new logical address.
2. Find all references to the moved object and update the pointer values.

Again the first point is easy and the second requires more effort. Common to the task of garbage collection and to relocation is the need to find the references to an object. Several methods for finding and updating references are now presented. Let's focus on the problem of finding references in a distributed system.

2.4 Backchain

For each pointer reference to an object in the heap a backchain is built from the object being referenced to the referencing pointers. The backchain is a linear list which connects all pointers referencing an object. To achieve this goal, it is necessary to modify the compilers code generator. The backchain is maintained by a few additional machine instructions at runtime, depending on the CPU architecture (some CPU's have built in linked-list instructions). Consider the following example of a backchain (figure 2).

Figure 2: backchain

This data structure is ideal for a garbage collector and a relocator because it requires only a minimum amount of computing effort. This advantage, however, is paired with the following drawbacks:

The compiler must be modified to generate the necessary code to maintain such a list.

The compiler must allocate storage for the backchain.

Pointer assignments become expensive.

On each pointer assignment the backchain must be maintained. The insertion operation is cheap (insertion into a linear list), but the removal operation may be expensive because a long list must be traversed. Nevertheless, we will experiment with this solution because of the following improvements.

The backchain can be extended to a double linked list, which reduces the amount of computer instructions while removing a pointer from a list.

Frequently modified pointers will often reside in the stack. These local pointer variables could be omitted on the backchain if the participating nodes are synchronized correctly. The compiler can distinguish local from non-local variables and can generate the appropriate code. Several methods are currently under research.

2.5 Supervisory Pointers

A different scheme uses supervisory pointers over pointers. A supervisory pointer is a reference to an existing pointer. For each allocated pointer an additional supervisory pointer is generated. The supervisory pointers can be used to find and to check all pointers of a particularly type. This principle is again only applicable when using a real typesafe language such as Oberon.

When a new heapobject is allocated, it is preceded as usual by a typetag, i.e. a pointer to the corresponding typedescriptor.

Consider the following example: There are two types of objects "A" and "B". "A" is a record type which contains a pointer to type B. There might by the following heap image.

Figure 3: objects and its typedescriptors

The datastructure is modified in the following manner: Each type descriptor is augmented by a list of pointers, so called "supervisory pointers". These "supervisory pointers" point to all pointers which itself points to heapobjects whose type is described by this typedescriptor. In other words, for each pointer in the system, there is an additional supervisory pointer. These supervisory pointers are inserted into the list associated with the type the pointer points to. This list is easily maintained by the memory manager.

In our example we will observe the following heap image after these modifications.

Figure 4: supervisory pointers

How can these supervisory pointers be used? If references to an object must be found, first of all the type of this object is determined. The typetag contains a pointer value to its typedescriptor. This typedescriptor has a list of pointers to all pointers which could possibly point to the observed object.

The advantage is, that we do not have to examine all pointers in the system against references to an object but only these pointers which could really point to our object. Obviously if we have many instances of one particular type, the supervisory list grows and the necessary time to inspect the pointers increases.

2.6 Read-only Pointers

On a single threaded system there is no interference from other nodes. In a multithreaded system, however, data objects may be changed asynchronously. This creates a problem while trying to relocate an object. Consider the case where the relocation algorithm walks along the list of pointers and another node writes a value to one of the pointers already visited: This will create a dangling pointer value and the system may crash.

To avoid this we change every pointer, visited by the relocater, to a read-only value. Every action which attempts to modify a read-only pointer is either stopped until the pointer becomes read/write again or is simply aborted.

One possible solution of implementing read-only pointers is to modify the compiler's code generation. The compiler can generate additional code to check against possible read-only values. For example, if each heapobject is at least word-aligned the LS-Bit of a pointer value can be used to implement the read-only property. This can be achieved by one or two machine instructions (test bit and jump on not zero) depending on the processor architecture. This overhead is acceptable. Another feasible solution is to temporarily change the access attribute of the page containing this pointer to read only. The latter one can only be achieved on a page level granularity.

2.6.1 Content dependent read-only

Pointers are set to read-only to prevent other nodes from modifying a pointer which points to an object that is currently subject to be moved to another location (relocation). However, the real semantic is the following:

No other node should be allowed to modify a pointer pointing to the actual moving object.

The pointer needs to be read-only only if its value is "critical". When the system detects a read-only pointer it can proceed if the current pointer value is not subject to change by other nodes. This "critical value" can be broadcasted to every node before the relocation/garbage-collection starts. A simple protocol can guarantee a reliable behavior of the participating nodes.

The next section describes our transaction model to solve the concurrency problem between any two nodes:

3 Transaction model

Traditionally asynchronous access to common data structures is protected by synchronization mechanisms of some kind. Only one process will be allowed to enter a critical section. In our system we allow several transactions to enter a critical section without providing explicit synchronization. The threads on different nodes can then proceed without excessive communication with partner nodes. Only at the end of a transaction the system must test whether a read/write collision between the executing threads has occurred. If it has occurred all but one of the colliding transactions may be reset.

Resetting a transaction means undoing all changes to the state of the system which have been effected by this transaction. To achieve this, a backup copy of each modified page is preserved. When a transaction starts, it owns no memory pages and only gradually it acquires access rights to memory as it progresses. When the transaction ends (TRANSACTION-END) the system checks whether the acquired rights overlap with those of another transaction and if so a collision is detected and the changes are reset. If no collision is detected the transaction commits and the system is ready for the next transaction.

Begin and end of transactions can be requested by the programmer in the context of a distributed algorithm or it can be implied by the execution of an Oberon command [1]. Module Oberon will call TRANSACTION-BEGIN before executing a command and TRANSACTION-END after the command. Because we assume that all Oberon commands leave data structures (piece-lists etc.) in a consistent state, we can also let Oberon commands on different nodes manipulate the same text or the same object.

3.1 Pessimistic and Optimistic approach

Distributed transactions are a field of current research [3,4,9]. Two major strategies exists for implementing a transaction in a distributed context. One is the pessimistic approach, where each accessed data object is locked (e.g. read or write lock) before any access is made. If the transaction gets all required locks, it runs to completion. The other transactions wait until the required locks are released. This approach is not deadlock free and therefore distributed deadlock detection protocols must be used [7].

The alternative answer is the optimistic strategy, where a priori access is allowed and in the case of a conflict, the transactions are restarted. We use an optimistic approach similar to the distributed transaction model of Kung and Robinson [3].

When an object is accessed, the type of access must be known (i.e. read or write). In traditional database systems the type of access can be determined using read- or write- procedures to access the object. Every time an object is accessed it is only accessed by corresponding read and write procedures. We do not have any of these read or write procedures because our access occurs directly from machine instructions.

3.2 Usage of the memory management unit

The hardware memory management can discriminate between read and write operations. A transaction starts with an empty set of logical to physical address mapping entries. If a transaction then accesses an object, a page-fault is generated and the requested object (i.e. the physical page containing the object) is added to the set of address mappings. The information provided by the page-fault-handler can be used to determine the type of access to an object.

3.3 Gradual acquisition of access rights

A transaction signals access to an object by simply referencing into the logical address space. Since none of the pages are mapped at the begin of a transaction, a page-fault occur and the required page is mapped read-only to the transactions address space in the case of a read-access, or it is mapped writeable in the case of a write access. Before a page is mapped writeable, a backup copy is made from this page (before image). The transaction works on the copy of the original page. This ensures that the transaction can be restarted or aborted and the system will reach a state as if the aborted transaction would never have been executed.

If there is an access to the same logical page from another transaction on another node, this other transaction also gets a new private copy of the old page. This guarantees that there is at least one transaction in the global system which can run to completion avoiding an endless restart where no transaction ever completes.

When a transaction completes it issues an "end-of-transaction" system call (we will call such a transaction a "terminating-transaction"). The system then checks whether the transaction can actually complete. If for example, two transactions have written to the same logical page, only one transaction can be allowed to complete. If a transaction is allowed to complete, the modified pages are committed for further transactions.

Special treatment is required if a transactions writes to an io-device, an operation, which is not restartable. Several solutions exist and are being studied.

We will now discuss the problem of detecting access conflicts and to check which transaction may run to completion.

3.4 Detecting access conflicts

We use the following definitions

Let T be the set of all transactions in the system, we denote a single transaction as Ti.

Let P be the set of all logical pages.

A logical page can be "read" or "written" by any transaction. We will denote this operation by

read(i,p) Transaction i reads page p

write(i,p) Transaction i writes page p.

Notice that the memory management unit can only detect the first read(i,p) or write(i,p) operation. But as we will see, this is sufficient for the algorithm to work. As the transaction goes on, we compute two sets W and R. W is the set of all pages the transaction has written to (or more precisely wants to write to, since the write operation is delayed until commit-time). R is the set of all pages the transaction has read from. These two sets are built as the transaction is going on, starting with two empty sets W={} and R={}.

The page fault handler has the information of the type of access (read or write). Each page added to the write set must also be added to the read set, because if a page is mapped writeable to the logical address space, there could also be read operations, which cannot be further distinguished by the hardware.

These read-sets and write-sets will now be used to compute a valid sequence of transactions.

3.5 Serial equivalence

A widely used criterion for the correctness of concurrent execution of transactions is known as "serial equivalence".

The assumption is, that any transaction works correctly if its execution does not overlap with other transactions. The concurrent execution is said to be serially equivalent, if there is at least one permutation of all possible serial execution schedules which produces the same result as the concurrent execution. Note that the only requirement is, that such a permutation must exist.

If such a permutation exists, there will also be a total ordering on the set of our transactions. To establish a total order, the ordering-relation of any pair of two transactions in the system must be computed.

We use the following ordering-relations:

if Ti starts with it's execution after Tj completes.

if Tj starts with it's execution after Ti completes.

To compute these relations we use the computed sets Wi and Ri of the current transaction and compare them with Wj, Rj of all other active transactions.

The following rules need to be observed:

1) or ( )

2)

3)

The rule 1 states that if the read and write sets are completely disjoint, then the transactions do not influence each other and can be ordered in any way. We simply choose Ti<Tj.

Rule 2 denotes a write collision which results in the restart of either Ti or Tj. Assume Ti is restarted then Ti>Tj follows because Tj is just done, and vice versa.

To understand rule 3, we must recall the behaviour of our system when new pages are referenced. The important point is that, if a transaction writes to a logical page, it gets a private copy of this page. It writes to it's private copy of a page. The other transaction sees only the value of the page, before this private copy was made. All pages of a transaction are written at the very end of its execution. This write phase is considered to be uninterruptable. Let's consider the case Ti reads any page p and Tj writes to the same page. Since neither of the transactions had actually completed, Ti sees the value of page p before the first write of Tj to page p. Recall, that at the moment Tj writes to a page, it operates on its private copy. Assume an equivalent serial execution of both transactions.

case 1: Ti<Tj: If Ti would terminate before Tj starts, it is certain, that all pages read by Ti have values before the first write of Tj to any page.

case 2: Ti>Tj: If Ti would start after Tj completes, all pages Ti reads from, will have a value after the last write of Tj to this pages.

Our system behaves like case 1, because Tj writes on its private copy of any page, thus Ti reads the value before any modification from Tj.

We have to check the read and write sets in both directions. If rule 3 implies Ti<Tj, this can only be true, if the same rule applied to transaction Tj does not imply Tj>Ti. This can be expressed in the following manner:

At commit time of a Transaction Ti, the following condition must hold:

,

where k is in the set of all actually executing transactions. This can easily be guaranteed by applying rule 3 to every transaction. If any violation is detected, the violating transaction is restarted.

Notice, that it suffices to compare the read/write sets of a terminating transaction to actual read/write sets of any active transaction. Since the terminating transaction cannot increase its own write set after its own completion, rule 3 cannot be violated in the future.

3.6 Implementation

The implementation of this scheme is straightforward because of the following properties:

The sets Ri and Wi are constructed locally, e.g. no communication overhead is needed.

The sets need only be updated when a page-fault occurs and not at every pointer reference.

When a transaction finally issues an end-of-transaction call, the computed read and write sets are broadcast to each other node or at least to these nodes, which have pages from the local read or write set. Therefore the set of Ti is compared in parallel with any other active read and write sets by the other nodes.

It is interesting to see what happens if a transaction issues such a validation broadcast. Since there is no way to decide how long the other current active transactions would run, it seems to be a good choice to commit the terminating transaction. As described above, there is at least one transaction which can run to completion (private copy of pages). Simply choose the terminating transaction as the "lucky one". This leads us to a further optimization.

If the communication protocol has a reliable broadcast property and there is exactly one possible broadcast at each time, then the terminating transaction can immediately run to completion.

With a slight modification to the protocol, the media access procedure can be used to serialize two terminating transactions. It is not necessary to wait for an answer from the other nodes. The transaction must merely check if the broadcast was successful and no other equivalent broadcast has arrived before the own broadcast was transmitted. This speeds up the local computation.

Upon reception of the EOT broadcast, the other nodes immediately check the received read and write sets against their own sets. If there is any conflict, then these transaction aborts.

3.6.1 Fairness

One problem is, that the algorithm is not fair. Assume two transactions Ti and Tj which operate on the same page. If both transactions are repeated in an endless loop (assume a command interpreter) and one of the transactions, lets say Tj computes longer than Ti. The result will be, that Ti commits every time. Tj is restarted infinitely. This problem can by solved using an additional fairness protocol. Further research need to be done with real applications to find an appropriate fairness protocol.

Dynamic transaction priorities may offer a solution to the fairness problem. Each node has an assigned transaction priority. These priorities are broadcast to all other nodes when they are changed. If finally a transaction wants to complete, it does this only if it's own priority is higher than or equal to all other priorities of the other nodes. If this is not the case, the local transaction may only complete if its termination is validated by a node with a higher priority.

When a transaction is restarted to often (a value which could be set on system generation time) it's dynamic priority is increased and restarted again. If a transaction completes it has to restore its value to the current base value. This guarantees that the longer transaction can also complete.

4 Conclusion

The Oberon language and essential parts of the Oberon system are well suited for implementing a compact and efficient run-time environment for distributed application scenarios. Key features of Oberon for this purpose are the safe type system of the language and the small size of the original Oberon modules. Equally important is the fact that Oberon commands can be viewed as transactions in our distributed system. The Oberon memory model can be extended to a distributed environment by making use of standard address translation mechanisms. We believe that there will be a strong requirement for small distributed systems and with our system we provide an alternative to the ever growing layer architectures of contemporary operating systems.

References:

[1] N.Wirth, J.Gutknecht. Project Oberon, The Design of an Operating System and Compiler. ACM Press 1992

[2] N. Wirth. The Programming Language Oberon, 1990

[3] H.T.Kung, John T. Robinson, On Optimistic Methods for Concurrency Control. ACM Transactions on Database Systems.

[4] Erhard Rahm, A.Thomasian, Distributed Optimistic Concurrency Control for High Performance Transaction Processing. IBM Research Report RC 14795

[5] Marc Shapiro. A garbage detection protocol for a realistic distributed object-support system. Rapport de Recherche INRIA 1329, November 1990

[6] James O'Toole, Scott Nettles, David Gifford. Concurrent Compacting Garbage Collection of a Persistent Heap. ACM 1993 O-89791-632-8/93/0012

[7] Shing-Tsaan Huang. A Distributed Deadlock Detection Algorithm for CSP-Like Communication. ACM Transactions on Programming Languages and Systems. January 1900

[8] R.Ananthanarayanan, Sathis Menon, Ajay Mohindra. Experiences in Integrating Distributed Shared Memory with Virtual Memory Management.

[9] A. Goscinsky. Distributed Operating System, The Logical Design. Addision-Wesley publishing company 1991

[10] Intel. i486 Microprocessor Programmers Reference Manual. Intel Corporation 1990

[11] Motorola. PowerPC, RISC Microprocessor User's Manual

[12] digital. Alpha Architecture Handbook. Digital Equiment Corporation 1992