a Tool for WWW-based Teleoperation

Heiner Wolf
Department of Distributed Systems
University of Ulm
D 89069 Ulm
Konrad Froitzheim
Department of Distributed Systems
University of Ulm
D 89069 Ulm


The Internet and the WWW are a ubiquitous network with a media rich interface. WWW browsers display text, graphics, video, audio and data in open and proprietary formats. Forms convey input from the user to the server. Video and form-based feedback in the WWW allow remote control and teleoperation over the Internet taking advantage of the relatively cheap, ubiquitous bandwidth and advanced user interfaces. We present a scheme to distribute multiple realtime video and data streams from a variety of sources and the interface to the controlled equipment.

WebVideo for Windows and UNIX are server programs to send video from various sources (cameras, RGB-instrument-output, bitmap displays, computer generated animations) in many formats (RGB, RGB-RLE, JPEG, GIF) to multiple client browsers at the same time. The WebVideo server measures the available bandwidth and generates individual videostreams. To optimize performance we introduce the CE/SC (Component Encoding/Stream Composition) mechanism to build multiple individually coded videostreams while the compute intensive video compression is performed only once.


A. The Challenge

Teleoperation requires two components, observation of the objects to be operated and the transmission of control commands based on communication networks. Three challenges can be identified:

Equipment which has only been controlled locally in the past is now to be remotely controlled. Engineers, want to control and observe machinery from their offices using the desktop computer. Equipment which has up to now been controlled locally must now be operated remotely or shared by several clients to save personnel and share resources.

The distance between operator and machinery can be as small as the next room or as far as another continent. An engineer may control a couple of machines distributed over the building. There are many applications, where control and data evaluation is done on a single computer for a town or a whole region. And there are long distance applications, where people on different continents share a single installation or where a central site needs to observe multiple plants throughout the world.

A growing number of tasks is performed through computers and computer networks not only by skilled engineers, but also by trained personnel. Therefore user interfaces must be simple, easy to understand, and maintainable.

For many remote control applications hardware and user interfaces have to be standardized, because they are to be used by people with their own private equipment. Applications where engineers log into machinery from home or where citizens watch environmental data with home computers will be common.

B. The Solution

The Internet is a ubiquitous and fairly cheap data network with easy access to corporations, academic and government institutions, and common citizens. As an open, universal network based on standards it fulfils the Anywhere and the Anyone re

quirement. The access to the network is scaleable from shared access via an IP provider, over ISDN connections, to dedicated T1 links, and ATM high speed lines.

Current developments in the Internet include new protocols and standards to provide Quality of Service (QoS) guarantees in the Internet. Currently, QoS depends on the investment in network access and router hardware.

The World Wide Web (WWW) adds more standard components to the Internet:


Fig. 1: Teleoperation with WebVideo

Our WebVideo system adds the missing component for Internet and WWW based teleoperation of equipment, answering the 'anything' challenge (see fig 1). It provides:

Chapter II is dedicated to the command or control channel and how browsers can be used as the user interface for teleoperation. Chapter III then explains WWW media formats and our solutions to provide video within standard browsers. This chapter cumulates in the presentation of the multirate CE/SC system. Chapter IV will give information on the implementation of the WebVideo system and report on a pilot application.


WWW browsers are very sophisticated applications to present hypertext documents with text, tables, and graphics. They provide many user interface elements such as buttons, check-boxes, menus, and text fields. Moreover they can be extended by applets, little program components downloaded from the server to the client when necessary.

The World Wide Web supports user interaction via forms. Forms are encoded as part of the base HTML document. They can contain different user interface elements like buttons, menus and text input fields. User input is not transmitted continuously but is collected by the client and transferred to the server on request of the user. Thus the transmission of user input fits seamlessly into the request-response mechanism of the Hypertext Transfer Protocol (HTTP, [1]). The collected results of user interaction are encoded in a Universal Resource Identifier (URI) and sent with the HTTP request. The server extracts information from the URI, processes it using a server extension to interpret the form data, and usually returns the HTTP response (see Fig. 2). The response carries a confirmation for the user input as a new WWW page which replaces the old one.

Fig. 2: HTTP servers can be extended via the common gateway interface (CGI). A URI points to an executable program (right) instead of a file.

A. Standard User Interfaces

The regular means to present information to the WWW user are defined by the Hypertext Markup Language HTML [7] which is based on an international standard for document description SGML [8]. So-called tags define the appearance of text on a page and allow the inclusion of user interface elements like clickable links.

In the context of user interfaces for teleoperation HTML has many benefits:

B. Advanced User Interfaces

The recent introduction of portable code for the WWW which is transferred through the HTTP-protocol from the client to the server computer has made advanced user interfaces in the WWW possible. So-called applets written in JAVA [2] or Plug-Ins offer the possibility to execute full-featured programs within the client's browser.


The most important and technically challenging component of Internet-based teleoperation is the remote visualization of the status of the controlled equipment. Several forms of status information, i.e. output exist:

A. WWW Data Formats

Since a standard WWW browser is used as user interface, data must be supplied in special formats understood by the software. The following paragraphs describe the formats in more detail.

Static Images

Although the WWW supports many media formats only a small number of formats is actually used. By far the most widely deployed image format is GIF (Graphics Interchange Format [3]). GIF has been designed for the CompuServe network and has been adopted by the Internet community. Other commonly used image formats are JFIF (JPEG File Interchange Format) of the Joint Photographic Experts Group [4] and the X Window system's text-based bitmap encoding (XBM). The JPEG encoding scheme is superior to GIF if applied to continuous-tone images.

Many data visualization applications deal with slowly changing data, e.g. environmental data. In this case the data acquisition system is able to generate static images which represent the data (GIF or JPEG depending of the image contents) and provide these images on request by means of an HTTP server. This is the way many so called 'live cameras' are implemented.

WWW Video Formats

Many digital video formats have been standardized and implemented over the last years. The most notable ones are MPEG, H.261, AVI, and CinePak. At the time of this writing neither of them is integrated into WWW browsers in a fashion suitable for teleoperation.

MPEG [6] is a movie format designed for digital TV with the necessary features for such a service:

MPEG is widely used on the WWW to transmit short video clips. Usually video clips are downloaded and presented after the transfer is finished according to the file-transfer paradigm. There are a few platform dependent WWW browser plug-ins which are able to decode MPEG while the data arrives at the client. This mode is called streaming.

Figure 3: Streamable multi-image formats

H.261 [5] is a standard for video-telephony over circuit switched networks such as ISDN. Although H.261 does not use the MPEG key-frames structure, the coding technique of H.261 could be considered a simplified MPEG compression. H.261 adjusts the compression ratio in a way that fixed bitrates, multiples of 64 kbit/s, are produced. In its original form H.261 is restricted to certain frame sizes (CIF, QCIF).

H.261 has been adopted by the Internet community as a live video format. Some Internet video telephony tools use H.261 streams (e.g. MBONE tools). A variant, the so-called H.261 intra-mode is an infinite sequence of DCT-transform-coded and Huffman coded sub-images.

Decoders are not available in typical WWW browser installations. They can not be downloaded as Java-based decoders because of performance reasons: Java is interpreted and not suited for the computationally intensive transform coding schemes (the same applies to MPEG of course).

WWW Live Video Transmission Systems

There are several live video transmission systems for the WWW. The most popular ones are VDOLive and RealVideo from Progressive Networks. Although VDOLive has been available considerably longer than RealVideo, RealVideo could become the de-facto standard in the WWW. All live video transmission systems require a hardware dependent WWW browser plug-in. They can be used for transmission of live events or for playback of recorded videos in a streaming fashion.

WWW live video transmission systems implement a TV-style broadcast scheme to deliver video. For such an application small delays are not detrimental. Delays are explicitly introduced through buffering in the receiver in order to protect the live presentation against packet loss and jitter. But remote control applications require instantaneous feedback and do not allow for delay. Therefore VDOLive and RealVideo are not suited for remote control applications.

Dynamic Images and Video

Remote control applications require real-time, i.e. low-delay feedback. The challenge is to provide direct and continuous feedback in a way understood by WWW browser software. A WWW based user interface must use only those data formats, which are either built into browsers or distributed together with the WWW browser.

Currently only few formats for video and animated graphics fulfil these requirements. These formats are variants of still image coding formats: GIF and JPEG as animated GIF [4] and M-JPEG.

The general principle for both schemes is similar: an image format with header and image part is used as substrate. The initial image is transmitted in the regular manner. However after this first image the transfer is not terminated. The connection is kept open and subsequent images are transmitted (compare fig. 3). The continuous replacement of the pixelmap-part of the picture with new pixelmaps leads to a stream of frames and the visual impression of a video.

To improve compression and optimize the stream for limited bandwidth connections, the update images can be restricted to sub-images only where differences between the frames have been detected. The first image can then serve as a reference frame. In the GIF standard this multiple-sub-image feature has always been present to create simple animations in a system similar to VideoTex.

If sub-images are computed in real-time and sent over the network to replace outdated picture elements, a live video stream is created. The resulting stream is a sequence of sub-images - a video with temporal compression based on frame differencing. This scheme is also called action-stream.

The advantage of the use of the GIF format as a format to code the action-stream is the compatibility with existing browsers without further download and installation. After initial problems Netscape and Internet Explorer have the full GIF format implemented.

Packet loss has typically severe or even catastrophic implications for the presentation of compressed media streams. As soon as reference information is lost, the picture is severely distorted. In the context of the WWW this is not a limitation, since data transfer in the WWW is built on the HTTP which in turn uses the reliable TCP protocol.

One problem arises with the presentation of the video stream: WWW servers try to transmit documents as fast as possible. The frame rate at the client display thus depends on the quality of the transport system connection. The rate may be too high in a local environment and too low over slow links. The first case results in time-compressed presentation of the video. The latter creates a backlog of frames, defeating the realtime capabilities. The creator (stream server) has to control the frame rate such that the presentation at the client is live, i.e. all the frames are presented at their exact playout time. The WebVideo server continually estimates the available bandwidth to the client and either delays frame transmission or drops frames. Dropped frames can however not be used as reference for future images. Precise book-keeping in the server solves this missing reference problem.

Animated JPEG

GIF has been very successful in the WWW because of its simplicity and flexibility. An animated GIF sequence is just an extension of a single GIF image. But the LZW-based compression scheme in GIF is not well suited for continuous tone images because GIF does not offer irrelevance suppression. Another bad match for the LZW algorithms are the small sub-images in the action-stream variant of GIF, since LZW is an adaptive scheme with a significant 'learning phase'.

DCT-transform coding as used in JPEG provides a significantly higher compression ratio than the LZW compression scheme of GIF. Even for high quality JPEG images (e.g. 2bit/Pixel) JPEG achieves better compression than GIF.

In order to combine the advantages of temporal compression through sub-image difference coding with JPEG's performance for pictures, we created the animated JPEG proposal. It follows the spirit of GIF87a with the simple frame differencing and adds DCT-transform coding in order to achieve JPEG compression performance.

An animated JPEG sequence is organized as a sequence of JPEG-coded sub-images, which follow the global JFIF header. This header contains Huffman tables and quantization tables. After the header data the first image, the base-image, completes the classical JFIF-coded image. Instead of closing the JFIF-record with the sequence terminator, the animated JPEG-format allows sub-images after the base-image. These sub-images are coded almost identical to the base-image. Only the position marker is added to tell the decoder where to place the difference pixel-block.

Fig. 4: Animated JPEG stream format

Additional quantization tables and Huffman tables can appear in the data stream with new IDs in order to optimize compression. Existing tables would be replaced if table-IDs are reused by DQT or DHT chunks. The size of sub-images is not restricted to macroblock boundaries.

It should be noted that this format is not standardized. We will supply reference implementations to foster rapid adoption of this simple and efficient stream format in the WWW.

B. Multiple Clients

A teleoperation system should allow for several observers of the equipment. Applications could be control of the equipment by several employees, an experiment watched by several students and the teacher, or a measuring instrument used by many scientists around the world. The generic scenario for feedback, i.e. WebVideo distribution is a multipoint topology. Multiple clients receiving the same video-stream poses a significant problem however.

Heterogeneous Bandwidth

The Internet is a very heterogeneous network in terms QoS, particularly with respect to bandwidth. Available throughput ranges from 10 kbit/s (wireless connection) up to 5 Mbit/s in local area networks, depending on the Internet connection and the network congestion. Since the bandwidth is not statically predictable, dynamic bandwidth adaptation to multiple connections is necessary. Another, albeit smaller problem is the decompression performance in the client computer.

In order to satisfy the realtime requirements in heterogeneous throughput topologies, multiple very similar video streams must be encoded concurrently in the server. Additionally the data rate from the video source (camera, graphics generator, simulator, display RAM) is different from the rate to the clients. The WebVideo system has to decouple the source from the client rates.

A third constraint is the available computational performance in the video server. Since workstation class computers such as SparcStation, Power Macintosh, and Pentium-class PC have to suffice as compression and distribution server, processing power is limited. And even when more processing becomes available in the future, more demanding coding schemes will be desirable in order to increase video resolution in space and time.

The Solution: CE/SC

In the WebVideo system we use the Component Encoding/Stream Composition (CE/SC) scheme to serve multiple clients from one server with limited processing power.

In the component encoding element the input image stream is processed in two logically parallel tracks:

The stream construction part is performed by several, conceptually standalone software tasks. One such output agent is assigned to each connected client. When the bandwidth estimation for a particular connection decides that a client should receive the next frame, the respective output agent is called. This stream handler first searches the generation matrix for the parts of the picture to be updated. It then retrieves the pre-processed encoded components and combines them into a stream.


Fig. 5: CE/SC software architecture

The stream structure depends on the required coding format. If it is GIF, the output agent fills GIF headers before it sends LZW encoded pixel rectangles. If the output is an MPEG video, appropriate picture/slice headers are written to the network connection before the DCT coded macroblocks are sent. Several types of output agents exist for the different stream formats. Not only is this a reasonable separation of functionality, it also makes the implementation of new coding schemes simple.

Although several types of components may have to be computed depending on the diversity of supported stream formats, we still achieve a high degree of synergy since the complex generation matrix computation is performed only once. Moreover component reuse for different encoding schemes such as DCT-blocks in the case of JPEG, H.261, and MPEG is possible with carefully chosen data structures.

The components of the CE/SC format are:

- DCT-coded macroblocks are stored in an intermediate format with an absolute DC-coefficient value. The final code for the main coefficient is again a difference to be computed at stream construction time. These components can be used for animated JPEG, MPEG and H.261 although the stream syntax differs.

- For the animated GIF stream all encoded sub-images are stored. Though GIF-sub-images have arbitrary size and location, they are kept for later reuse by other streams updating the same area. We have not yet explored a grid approach, where GIF-tiles are restricted to certain positions and sizes in order to increase reusability. Encoded sub-images expire when all streams have passed the time for which this tile applies.

Stream constructors keep only information pertinent to the particular stream such as current bandwidth, the number of the frame transmitted last, sequence numbers, etc.

It should be emphasized, that CE/SC is independent of the image encoding. The scheme can support GIF, JPEG, animated JPEG, MPEG, H.261, Wavelet-based schemes, and future compression techniques. Due to the limited processing power of CE/SC servers, the compression formats may not be fully exploited. In the case of MPEG for example, only simple motion vectors, i.e. no motion, are used. Complex vectors and so-called forward references are not applied.


To test the concepts and as an application showcase we have implemented the Interactive Model Railroad. The IMRR allows every WWW user in the world to control a model railroad in which physically resides in our laboratory in Ulm, Germany (see fig. 5). The IMRR is located at http://rr-vs.informatik.uni-ulm.de/rr.


Fig. 6: Remote WWW users can operate the model railroad and watch it in realtime. The interface has been modified since this screenshot was taken.

The model railroad layout is very simple. Two locomotives can be directed by the user to one of three tracks in the station. The locomotives themselves are controlled by digital commands received through the regular power feed. The command control scheme used is the so-called Motorola-format. The commands are generated by a PC with custom software. This software in turn is fed with commands from a CGI-script in the WWW-server according to the user input.

The user input is obtained from graphical interface based on html-page: each time the user clicks on a locomotive or track icon, the respective page is fetched from the server. When the user finally pushes the 'GO' button, the page on which 'GO' was pushed determines the command to be sent (see fig. 6).

A video camera mounted over the layout then sends a composite signal to a frame-grabber. The digitized video is then fed into the WebVideo server software, in this case on a UNIX system, a SparcStation.


WebVideo is a flexible video transmission system for the Internet. It can form the basis of Internet and WWW based teleoperation for existing and new equipment. WebVideo is adaptable to many kinds of status output of the controlled instruments. The Interactive Model Railroad has proven the concept to more than 700.000 visitors.

The core component of WebVideo, the CE/SC framework for the compression of scaleable media streams has been successfully applied to video. The use of CE/SC is not limited to video however. It could be applied to any data which is transmitted in a compressed form to multiple clients at different rates concurrently.

The WebVideo system has been implemented in 1995 and it has been in use in various applications ever since. We started with the GIF-stream and JPEG-sequences. Lately MPEG stream encoding has been added. The next major step will be the public release of the animated JPEG scheme as proposed above.


[1] Tim Berners-Lee; Hypertext Transfer Protocol - A Stateless Search, Retrieve and Manipulation Protocol; 1993; http://www.w3.org/hypertext/WWW/Protocols/Overview.html

[2] v.Hoff, A., Shaio, S., Starbuck, O.: Hooked on Java; Reading, MA, 1996.

[3] CompuServe, Incorporated: CompuServe GIF 87a,

[4] International Organization for Standardization: Information Technology - Digital Compression and Coding of Continous-tone Still Images; ISO/IEC DIS 10918-1; ISO 1991.

[5] ITU (International Telecommunications Union), Recommendation H.261 - Line Transmission of non-Telephone Signals: Video Codec for Audiovisual Services at px64 kbit/s; CCITT Recommendation H.261, Geneva, 1990.

[6] International Organization for Standardization: Information Technology - Coding of moving pictures and associated audio for digital storage up to about 1.5 Mbit/s; ISO/IEC DIS 11172; ISO 1992.

[7] Specification of the HyperText Markup Language; http://www.w3.org/pub/WWW/MarkUp/

[8] SGML Document Type Definition of the HyperText Markup Language; http://www.w3.org/pub/WWW/MarkUp/html3/html3.dtd