Interactive Video and Remote Control
via the World Wide Web

Klaus H. Wolf <wolf@informatik.uni-ulm.de>
Konrad Froitzheim <frz@informatik.uni-ulm.de>
Michael Weber <weber@informatik.uni-ulm.de>

Department of Distributed Systems
Computer Science Faculty
University of Ulm
89069 Ulm, Germany

  1. Introduction
    1. The World Wide Web
    2. Media in the WWW
    3. Image and Movie Formats in the WWW
  2. Pushing the Limits
    1. Moving Images
    2. Sequence of Still Images
    3. Image Streams
  3. A Prototype Implementation
    1. Client Software
    2. Server Software
    3. Server Extension
      1. Solution of Related Projects
      2. The WebVideo Approach
    4. Supporting Multiple Clients
  4. An Interactive Application
    1. User Interaction in WWW
    2. A Sample Scenario
    3. Real Application
  5. Future Work
  6. Conclusions
  7. References

Abstract:

The World Wide Web (WWW) is the most accepted service in the internet providing a globally distributed hypermedia information system. Despite its tremendous usage growth the WWW lacks functionality beyond its basic request-response resp. download paradigm especially in the area of conveying information carried by stream-oriented real-time media. Thus an important development direction for the WWW is the integration of such media into WWW pages. This paper considers methods to include moving images into WWW pages without changing existing standards or protocols. They are based on the multi-image capabilities of many image encoding standards. In combination with dialog elements the system provides interactive video in the World Wide Web. Video as visual feedback to user commands allows for a highly perceptible user interface compared to explicit confirmations. To demonstrate the concept's feasibility a prototype has been implemented and evaluated by using an application scenario featuring remote control accompanied by remote visualization.

Keywords:

World Wide Web, remote control, interactive video, GIF

1. Introduction

1.1 The World Wide Web

The World Wide Web (WWW) [WWW] is a globally distributed hypertext and hypermedia system. Hyperdocuments are non-linear documents which contain links to other documents. The hyperdocuments served by the WWW are not restricted to text. Other media such as images, graphics, sounds, and movies can be referenced. Documents are assembled to form the layout of WWW-pages. This layout is controlled by an HTML-type document. HTML, the Hypertext Markup Language [HTML], is a text based document description, defined by an SGML document type definition (DTD) [DTD].

From a system architectural point of view the World Wide Web is a client-server system. Documents are exchanged between WWW-clients and WWW-servers using the Hypertext Transfer Protocol (HTTP) [Berne93]. The currently used protocol version, HTTP 1.0, defines a stateless request-response mechanism. To retrieve a WWW-page a transport system connection is established for each document, referenced on this page. The client requests the document by its name and the server responds transferring the document's data. Documents containing non-continuous media are typically downloaded first and presented afterwards. Even interactive elements, such as forms, use the same request-response mechanism. There is no provision of continuous connections between client and server.

The World Wide Web is currently rapidly growing to establish the largest hypertext information system world-wide. Such extensive usage demands advanced features and more flexibility than the system provides today. The system has been intentionally designed as a distributed information system which links locally available documents into a global hypertext document. But many applications, especially in the commercial domain require enhanced features like security, interactivity, transport efficiency, improved layout capabilities, and server-control over the client display. Some of these features, e.g. interactive elements and security, have already been added to the World Wide Web. Others, e.g. HTTP 2.0, are still in the design phase or under discussion. Most enhancements, however, are based on extensions to the used standards or even replacements for them facing a problematic deployment phase.

1.2 Media in the WWW

WWW mechanisms are media independent. Since the HTTP protocol is transparent to the type of documents it is handling an HTTP request accesses documents only by name. Client and server negotiate media types using the MIME (Multipurpose Internet Mail Extensions) model. The server decides about the document type and returns a MIME type specification with the HTTP response. It is then the client's responsibility to select the proper presentation method according to the document type.

WWW pages currently contain text, graphics and dialog elements. Presentation of other data and media is accomplished by external programs, so-called viewers. WWW clients do not directly support synchronous playback of stream oriented media such as audio and video. Audio and video are treated as files which are processed in three phases: retrieve, store in the file system, and present through a viewer. Usually, the presentation of such data begins after the document has been retrieved completely from a server. Thus continuous media cannot be displayed directly (inline) on a WWW-page.

Applications which require continuous updates of the client's view have to force the user to repeatedly retrieve a document [Crocker94]. Discussion on the inclusion of inline media other than graphics, e.g. audio [Uhler94] and video, is currently very lively. How this integration can be achieved is subject to active research. New protocols, protocol elements and extensions to HTML have been proposed [Soo94], [KaasPinTaub94]. However, changes to existing standards widely and intensively used should be done very carefully, and if possible, they should be avoided altogether.

1.3 Image and Movie Formats in the WWW

Although the WWW supports many media formats only a small number of formats is actually used. By far the most widely deployed image format is Compuserve's GIF (Graphics Interchange Format [GIF87]). GIF has been designed for the Compuserve network and has been adopted by the internet community. In the internet not all capabilities of the standard are exploited yet. Software on the internet usually exploits only a subset of the features required to encode images. Other commonly used image formats are JFIF (JPEG File Interchange Format) of the Joint Photographic Experts Group [JPEG] and the X Window system's text-based bitmap encoding (XBM). The JPEG encoding scheme is superior to GIF if applied to continuous-tone images. Its usage is growing fast since the major WWW clients support inline JPEG images.

A standard movie format is the ISO standard MPEG (Moving Pictures Expert Group) [MPEG]. Quicktime and Video for Windows are other movie formats in use, but they are not vendor and platform independent. Nevertheless they can be decoded on nearly all platforms depending on the availability of the proper decoder software on each client system. The decoders are external programs because these data types are not decoded directly by WWW clients. Thus movies are presented off-line after the movie files have been retrieved and stored locally.

2. Pushing the Limits

2.1 Moving Images

All components of the World Wide Web are currently developing. One of the directions is the integration of stream oriented media into WWW pages. Movie delivery via WWW tends to develop towards inline decoding of MPEG within WWW clients. On the one hand it is not foreseen when such clients will become available. On the other hand, realtime encoding of MPEG streams from live sources requires sufficient dedicated hardware at the server sites.

However, there are application scenarios now which require moving images based on other formats and mechanisms. Remote control and remote visualization are applications making use of realtime generated animated computer graphics or live video. In contrast to MPEG, still image decoders are readily available in the WWW clients. They can be used to show sequences of images which are perceived as videos if the images replace each other at a reasonable rate.

Regardless of the actually chosen video stream format a client has to support continuous decoding explicitly. The WWW client has to be able to retrieve documents concurrently and present them continuously while successive data arrives. The implementation architecture of the WWW clients must be event and data driven. Clients must not block while retrieving a document.

2.2 Sequence of Still Images

One method for animated graphics or video is the consecutive transmission of individual images. Instead of terminating the transport system connection after transmitting an image, the server continues to send images until the client terminates the connection. Not only images can be animated this way, but also other document types, e.g. text. This leads to a new communication model of the HTTP connection where more than just one document is transferred during the lifetime of a connection.

However this animation method has some disadvantages. Combination of a series of documents to a single structure which fits into an HTTP response requires a description of the contents of the response. This means the addition of a new layer of control structures in between the document layer and the HTTP protocol layer. Existing WWW clients and servers have to be modified in order to deal with streams of documents instead of single documents.

The method allows to exploit the existing decoder software of WWW clients. It supports all image formats including GIF and JPEG. The major disadvantage of such a pseudo animation, if used for moving images, is the fact that always entire images have to be encoded and transmitted. The encoded animation contains considerable redundancy. All images are encoded completely independent from each other. They all contain header information and often encode the same unchanged image parts over and over again.

2.3 Image Streams

Some image formats are not restricted to single images. They describe the encoding of a sequence of consecutive images. Examples are the PNG [PNG95] and GIF formats. Recent specifications of GIF allow for an infinite sequence of images. The images can be of different size and depth within a global rectangle. The possibility to encode differently sized image parts permits simple frame differencing as optimization method. Of course, the extent of data reduction due to frame differencing depends on the encoded sequence. But a large number of applications, especially in the remote control and process visualization area, rely on animated graphics displays with only slight changes from image to image. Many publicly available GIF-viewers support this multi-image feature. It is indeed possible to present a movie with these viewers.

An image stream encoded in a multi-image format is accessed via hyper links as any other document. The WWW client opens the transport system connection, sends the HTTP request and waits for the response. The response contains HTTP header information and the image stream as HTTP body. The HTTP header indicates the document type to the client encoded in a string consisting of type and subtype, e.g. image/gif. This document type description does not have to be changed in order to support multiple images. A GIF encoded image sequence will have the same document type as single images. A single image is regarded as a special case of a sequence containing only one image.

The main advantage of image stream formats compared to sequences of separated images is the possibility to exploit format specific optimization methods like frame differencing. In addition image streams are backward compatible to WWW clients which do not support moving images. Clients, which do not support the multiple image feature of multi-image capable formats but tolerate them, will show the first image of the sequence. They will terminate the connection after having decoded the first image or just stop decoding.

3. A Prototype Implementation

3.1 Client Software

WWW clients supporting the multi-image option of multi-image graphics formats will automatically present a video when decoding an image sequence. This implies for the most widely used image format, GIF, that clients have to support at least the GIF87a specification. Most WWW clients, however, do not support this specification entirely, i.e. they particularly ignore the multi-image feature. This is due to the fact that GIF has been used in the internet only for the encoding of still images.

Figure 1: The format of a GIF image (upper) and a GIF stream (lower).

However, treatment of multiple images can easily be added to existing GIF decoder software. A GIF sequence consists of a global header and a series of separately encoded images (Figure 1). The difference between a GIF stream and a GIF image is only the number of images following the global header. There is no additional information or control structure accompanying the existence of more than one image. The global header does not contain information about the number of subsequent images.

An investigation of available GIF image decoder software showed that the required changes for an upgrade from an image decoder to a stream decoder are very small. Thus we modified the GIF decoders of two publicly available WWW clients (Chimera, Mosaic) in such a way that they continue decoding as long as a GIF data stream does not terminate. The modified versions present moving images on a WWW page until the WWW server stops sending images.

3.2 Server Software

A video stream is accessed like any other document via its universal resource identifier or locator (URI/URL). The server transmits documents regardless of type and size. It does not notice if a document is written to disk or directly decoded by the client. Therefore a video stream will be transmitted like any other document. For the simplest case of a pre-recorded video which has been encoded in a multi-image capable format there are no changes required to the server system at all.

Usually WWW servers try to transmit documents as fast as possible. That's why the frame rate at the client display depends on the quality of the transport system connection. The rate may be too high in a local environment and too low over slow links. The first case results in time-compressed presentation of the video. The later creates a backlog of frames, defeating the realtime capabilities. Our experiments showed that some changes to the server system are very useful in order provide controlled and smooth delivery of image sequences.

We identified three different approaches to show image sequences. In the first approach, called 'best effort' method, transmission and decoding is performed as fast as possible with the assumption that the connection is either slower or just fast enough for realtime display. If a transport system connection is not fast enough for realtime display the frame rate will be low. Anyway such a system will show each image of a sequence. It will be no skipped frames. The second method is time synchronisation. A video transmission system which provides time synchronisation tries to transmit only these images of a sequence which fit into the time scale of the video. It will skip images if transmission or display are too slow and it will delay playback at the receiver in the opposite case. A combination of the previous approaches is called 'best effort with upper rate limit', which tries to reach realtime display and limits the frame rate to an upper bound.

The latter two methods require a component which controls playback at the client. In addition this component needs a feedback mechanism between client and server in order to avoid overflowing of the client's storage space. Synchronization by the client, however, would require a major change to the client's software. We therefore propose time synchronization by the server. A software module in the server controls the transmission speed between video source and the WWW client.

3.3 Server Extension

The World Wide Web software provides a standardized method to include server side extensions called Common Gateway Interface (CGI). It has been designed to allow access to other information systems than WWW, e.g. WAIS or Gopher. But the CGI can be used to feed all kinds of data into the WWW system. In principle the CGI is an extension of the WWW name space of Universal Resource Identifiers to cover not only files but also the output of executable programs (figure 2).

3.3.1 Solution of Related Projects

Currently the Netscape server-push specification [Netscape95] provides the easiest way to include live video into WWW pages. A CGI program feeds a sequence of single images to the HTTP connection. These images are usually retrieved from the WWW server`s file system, i.e. it`s hard disk. The file system serves as a mediator between the frame grabber and multiple clients. It decouples the transmission rates of the HTTP connections and the data rate of the frame grabber. But the detour using the file system prooves disadvantageous at higher frame rates or higher system load. The number of simultaneous connections is limited by the I/O bandwidth of the file system.

3.3.2 The WebVideo Approach

We designed and implemented a CGI program as video extension to the server, which works entirely in the main memory of the server. It establishes an image database in memory in order to avoid unnecessary file system accesses. This synchronizing CGI program feeds video streams to HTTP servers. It is invoked by the server in response to a request for an image stream from the client. The CGI program is highly flexible by allowing different synchronisation mechanisms, video sources and output formats.

Figure 2: HTTP servers can be extended via the common gateway interface (CGI). An URI points to an executable program (right) instead of a file.

Supported synchronisation mechanisms are:

Synchronisation mechanism, frame rate and upper rate bounds are adjustable by the provider of hypertext documents. They can either be encoded into the URI or be fixed to a certain value. Encoding into the URI allows flexible adjustment for different streams.

Supported video sources are:

Besides synchronized playback of pre-recorded sequences the video extension is able to retrieve an image sequence from a system queue. The system queue is a named shared memory space which can be fed by any live video source, e.g. a camera/frame grabber combination. In addition to synchronized streams the system supports extraction of single images. Single images can be used to provide snapshots for directories of streams or to build WWW pages which do not contain stream oriented documents while being up to date though.

Supported output formats are:

The encoding of sequences of still images follows the specification of Netscape for server-push animation. The default output format chosen by the server is GIF. However the CGI program automatically recognizes if the connected client supports the JPEG format. In this case the still images contained in the sequence will be encoded in JPEG rather than GIF to save transmission bandwidth due to JPEG's higher compression rates.

For performance and availability reasons the World Wide Web relies on caching at different levels. Clients maintain local caches and institutions use caching WWW servers, called proxies, to reduce remote accesses. In the case of live video and animated graphics caching has to be avoided. Playback of a pre-recorded video stream from file will not be synchronized if the file is retrieved from a cache rather than from the synchronizing server extension. To avoid caching the server therefore marks stream oriented documents as already expired.

3.4 Supporting Multiple Clients

A shared system queue as stream source allows the CGI program to support multiple WWW clients at the same time without significant performance degradation (Figure 3).

A video stream from a live source is encoded once in one of the supported image or stream formats. Each encoded image of this stream is put into a shared memory space to be accessible for the synchronizing CGI programs. Many instances of stream synchronizers may retrieve the encoded images simultaneously from the shared space of the image queue. They may even retrieve different images at the same time to keep up with the state of their client connections. Encoding the stream only once allows many clients to connect to a live source at the same time without overloading the server computationally.

Figure 3: Many instances of synchronizing CGI programs retrieve the image stream simultaneously. Each of them serves one remote WWW client.

Figure 4: The stream server converts between the image format of a live source and the stream/image format used in the World Wide Web. The converted images are put into a shared memory queue.

An image stream from a live source is encoded by a stream server which fills the shared memory queue for the synchronizing CGI programs (figure 4). The stream server's front-end connects directly to the live video source resp. digitizer. Its back-end serves a number of stream synchronizers via the shared memory queue describe above. The main purpose of the stream server is conversion from the image data format of the video source to the target stream format for transmission to the WWW clients. The front-end comprises image decoder modules which accept different image formats. The back-end is currently equipped with GIF and JPEG encoders.

4. An Interactive Application

Live images are often produced by monitoring systems, either video taken from a camera or computer graphics generated by software. Many remote monitoring systems allow interaction of a user. Such remote control requires a user command channel from the display to the monitored system. The controlled system's feedback is then visually presented as a stream of computer graphics or live video from a camera.

4.1 User Interaction in WWW

The World Wide Web supports user interaction via forms. Forms are encoded as part of the base HTML document. They can contain different user interface elements like buttons, menus and text input fields. User input is not transmitted continuously but is collected by the client and transferred to the server on request of the user. Thus the transmission of user input fits seemlessly into the request-response mechanism of HTTP. The collected results of user interaction are encoded in a URI and sent with the HTTP request. The server extracts information from the URI, processes it using a server extension to interpret the form data, and usually returns the HTTP response. The response carries a confirmation for the user input as a new WWW page which replaces the old one.

Explicit command confirmations are not necessary if feedback is given visually as a graphics stream. A remote control WWW page contains animated graphics showing the state of the remote system and forms for user input. The page is not exchanged to show command confirmation. Instead, the WWW server is forced by its forms evaluating extension to return an empty HTTP response (using the 'No Response' response code 204) to the client. The form evaluating extension forwards the user input in an adequate format to the controlled system which in turn reflects its new bahaviour through a video or graphics stream. The WWW client will stay with the same page and shows the effect to the controlled system through the animated graphics or live video parts of the page.

Figure 5: User commands evaluated by server extensions control system parameters. Visualization of feedback fed into the WWW by another server extension.

4.2 A Sample Scenario

We built a model railroad layout in our laboratory as remote controlled system to serve as an example proving the capabilities of inline video within the proposed WWW environment.

Encoding and decoding of GIF is fast enough to allow about 5 QCIF sized frames per second in our set-up. We tested the performance in a local environment with a Sun workstation as server and Sun and Macintosh clients. Limiting factor in our demonstration scenario are the speed of the available frame grabber and the color conversions (dithering, colormap merging) for 8 bit pseudo color X-Window displays.

We do not exploit the frame differencing capabilities of GIF streams yet. We will add this feature to both the GIF encoder and WWW clients in the near future. We expect higher frame rates because encoder and decoder will have to process only the changed image parts. This will result in a speedup which is proportional to the relation between static and dynamic parts.

Figure 6: Remote WWW users can operate the model railroad and watch it in realtime. The HTTP request shown in the picture is issued in order to get the contents of the inline image which is referenced by the base HTML document. The URI points to a stream synchronizing server extension. This server extension gets images from a live camera. The response to the HTTP request is an infinite image stream displayed at the client as video.

An HTML form is used to submit commands to the WWW server. The server forwards the commands to the model railroad controller via a serial interface. After hitting the 'Go!'-button the chosen train begins to move to the selected destination. An additional confirmation is not necessary.

4.3 Real Application

We are currently integrating the system with the World Wide Web front-end of a biochemical synthesis laboratory. Up to now clients of the laboratory have requested synthesis of oligonucleotides from remote via the WWW front-end. The synthesis has been done offline by a robot and the results with printed descriptions have been sent back via the postal service.

Soon clients will operate the robot from remote and watch the synthesis. The status will be displayed as a live video showing the equipment and as a graphics animation of the changing absorption spectrum. A client can dynamically modify synthesis parameters or even stop the process in case of problems. Of course the product has still to be sent with the postal service, but the client knows about its quality instantly.

5. Future Work

GIF uses the Lempel-Ziv-Welch algorithm to compress bitmaps. UNISYS's LZW-copyright may lead to another common image format in the Web. But it is probable that a successor will support optimized encoding of image sequences as well. Candidates for a replacement of GIF are Planetary Data System (PDS) and Portable Network Graphics (PNG), which is currently developed by Compuserve. Both specifications mention multi-image extensions.

Upcoming WWW clients which support inline presentation of MPEG streams will allow integration of video into the WWW at much lower bandwidth than the current solutions. This is especially true for pre-encoded movies. But transmission of live video from a camera requires realtime encoding. Due to the computational cost of motion compensation software encoding on desktop computers is currently not able to deliver MPEG streams of very high compression rates. The bandwidth requirements of such MPEG streams are smaller, but still comparable to a sequence of JPEG coded images. However we will add an MPEG backend to the stream server mentioned above. This means not just integration of available MPEG encoder software. The MPEG software has to be adapted to the stream server system in order to support multiple clients at different transmission rates at the same time while it encodes only once.

6. Conclusions

We presented a scheme to include moving images into ordinary WWW pages without changing existing standards or protocols. The mechanism is based on the multi-image capabilities present in many image encoding standards and the HTTP protocol. In combination with the already standardized dialog elements of HTML the system provides interactive video for the World Wide Web. Visual feedback to commands through animated graphics or live video provides a much smoother and conceivable user interface than explicit confirmations. We validated the concepts by a prototype implementation and the successful operation of a remote control scenario with inline video.

References

[Berne93] Tim Berners-Lee; Hypertext Transfer Protocol - A Stateless Search, Retrieve and Manipulation Protocol; 1993; http://www.w3.org/hypertext/WWW/Protocols/Overview.html

[Crocker94] G. Crocker: web2mush: Serving Interactive Resources to the Web, 1994; http://www.ncsa.uiuc.edu/SDG/IT94/Proceedings/DDay/crocker/tech.html

[Uhler94] S. Uhler: Incorporating real-time audio on the Web, 1994; http://www.ncsa.uiuc.edu/SDG/IT94/Proceedings/DDay/uhler/uhler.html

[GIF87] CompuServe, Incorporated: CompuServe GIF 87a, http://icib.igd.fhg.de/icib/it/defacto/company/compuserve/gif87a/gen.html

[Soo94] J. C. Soo: Live Multimedia over HTTP, 1994; http://www.ncsa.uiuc.edu/SDG/IT94/Proceedings/DDay/soo/www94a.html

[KaasPinTaub94] M. F. Kaashoek, T. Pinckney, J. A. Tauber: Dynamic Documents: Extensibility and Adaptability in the WWW, 1994; http://www.ncsa.uiuc.edu/SDG/IT94/Proceedings/DDay/pinckney/dd.html

[JPEG] International Organization for Standardization: Information Technology - Digital Compression and Coding of Continous-tone Still Images; ISO/IEC DIS 10918-1; ISO 1991.

[MPEG] International Organization for Standardization: Information Technology - Coding of moving pictures and associated audio for digital storage up to about 1.5 Mbit/s; ISO/IEC DIS 11172; ISO 1992.

[PNG95] T. Boutell, M. Adler, L. D. Crocker,T. Lane: PNG (Portable Network Graphics) Specification, 1995; http://sunsite.unc.edu/boutell/png.html

[Netscape95] Netscape Communications Corporation: An Exploration of Dynamic Documents; 1995; http://home.netscape.com/assist/net_sites/dynamic_docs.html

[WWW] The World Wide Web Consortium; http://www.w3.org/hypertext/WWW/

[HTML] Specification of the HyperText Markup Language; http://www.w3.org/pub/WWW/MarkUp/MarkUp.html

[DTD] SGML Document Type Definition of the HyperText Markup Language; http://www.w3.org/pub/WWW/MarkUp/html3/html3.dtd