A Knowledge-Based Approach
to JPEG Acceleration

  1. JPEG Overview
      1. Compression
      2. Decompression
    1. Acceleration Techniques for the DCT
    2. Our Implementation
  2. Statistical Analysis
      1. Sparse Population
      2. Empty Columns and Rows
      3. Shape of non-zero parts
  3. A Knowledge Based IDCT Algorithm
    1. Computation of structural information
    2. IDCT
      1. Improving the first 1-dimensional step
      2. Optimizing the second 1-dimensional step
      3. Choosing the best first step
  4. Results
  5. Conclusions
  6. References
Dr. Konrad Froitzheim, Heiner Wolf
Department of Distributed Systems
University of Ulm, Germany
Oberer Eselsberg
D 89069 Ulm / Germany
internet: frz@informatik.uni-ulm.de , wolf@informatik.uni-ulm.de

Abstract

JPEG picture compression and related algorithms are not only used in still picture compression, but also to a growing degree for moving picture compression in telecommunication applications. Real-time JPEG compression and decompression are crucial in these scenarios. We present a method to significantly improve the performance of software based JPEG decompression. Key to these performance gains are adequate knowledge of the structure of the JPEG coded picture information and transfer of structural information between consecutive processing steps. Our implementation achieved an 80% performance increase decompressing typical JPEG video streams.

Keywords:

Video, compression, decompression, JPEG, DCT, telecommunications

1. JPEG Overview

JPEG (Joint Photographic Experts Group) [ISO] is a compression scheme to reduce the size of continuous-tone digital images. It is particularly useful to save space on storage devices, where compression ratio and image quality are more important than compression speed or delay. A growing class of multimedia telecommunications applications such as video conferencing requires real time compression and decompression of moving pictures. Many of those applications use JPEG or a related image compression method.

JPEG's 'lossy' baseline process is based on the Discrete Cosine Transform (DCT). It packs the transformed data in sequential order, using zero suppression in combination with Huffman symbol coding. We will focus on the IDCT related steps to show where our improvements apply.

Figure 1: JPEG block diagram

Compression

The baseline algorithm first separates the picture in color component planes, e.g RGB. A subsequent color space conversion transforms RGB pixels into luminance and chrominance (YUV) values. The resulting planes are then decomposed into 8x8 pixel component blocks. These blocks are transformed by a forward DCT (FDCT) into the domain of 2-dimensional DCT (2D-DCT) basis images [RaoYip]. The FDCT concentrates the energy or information content in the left upper corner of the 8x8 block. Figure 2 shows an example. Note that large numbers appear only in the left upper corner.

Figure 2: FDCT example, only the upper left corner contains significant numbers

The 2-dimensional DCT and its inverse are orthogonal transforms. A DCT can be expressed as matrix multiplications

FDCT:

where P denotes the color component block, Q the transformation matrix, and C the coefficient matrix.

The quantization prepares the blocks for efficient coding. Each DCT coefficient block is divided by an 8x8 quantization matrix. Typical quantization matrices used in telecommunication applications turn small coefficients and coefficients representing finer detail into zeros. The compression rate can be adjusted by changing the quantization matrix. Figure 3 shows the sample block after quantization.

Figure 3: Quantization: most coefficients are eliminated

The quantized array is then linearized in a zigzag fashion, zeros are suppressed, and it is further reduced in size by an appropriate Huffman encoder. For further detail see [ISO]. In our example we achieve good compression, only 7 code symbols are required to represent the 64 coefficients:

(10) (0, -2) (-1) (-1) (-1) (0, 0, -1) (End_of_Block)

Decompression

The decoder reconstructs each block from the Huffman codes and dequantizes it by multiplying the result with the quantization factors used in the compression step. However, coefficients suppressed during compression remain zero. The resulting coefficient matrices are sparsely populated in a distinctive fashion: only a few relatively large values are concentrated in the upper left corner and many zeros in the right and lower parts. Chapter 2 quantifies the extent of this effect.

The inverse DCT (IDCT) then computes a pixel block P from the coefficient matrix C. This step is again a matrix multiplication with the DCT matrix Q:

IDCT:

The color space conversion back to RGB is the final step in the decompression process.

Acceleration Techniques for the DCT

The computationally intensive parts of the JPEG compression and decompression are: Many attempts have been made to develop fast algorithms for JPEG. The biggest effort has been devoted to the optimization of the two dimensional DCT.

A two dimensional DCT, like any separable transform, can be implemented as two successive sets of 1D-DCTs [RaoYip]. Many different algorithms for 1D-DCTs have been proposed over the last 15 years. Chen's algorithm, based on sparse matrix factorization [ChenHarFra], reduced the number of arithmetic operations to 16 multiplications and 26 additions for an 8 point 1D-DCT. Others gave algorithms using even less multiplications [Wang] [Lee]. The theoretical lower bound for the number of multiplications is 11 [Duhamel]. [LoeLigMo] presented a class of such DCT algorithms.

In addition to the 2 consecutive 1D-DCTs more sophisticated methods for 2D-DCTs using true 2D methods have been published [KamRao] [ChoYunLee] [Vetterli]. They decrease the number of multiplications by a factor of 2, but increase the number of additions slightly.

With the arrival of advanced processor architectures, the determining factors for the computational cost have changed. An increasing number of general purpose processors is designed for very effective floating-point calculations. Some of them even provide single cycle multiply-accumulate instructions (MAC). Such processors are best used by algorithms with well balanced multiplications and additions. Exploitations of special instructions such as the one step MAC [LinFeig] focus on the reduction of the number of instructions rather than the reduction of multiplication operations.

Other DCT based compression schemes use frequency filtering [ThoBlinHias] to guarantee zero coefficients in well defined areas of the matrix, thus enabling tailored algorithms without calculations on coefficients of suppressed frequencies. The JPEG baseline algorithm as used in most telecommunication applications however does not produce well defined 'empty' areas in the coeffizient matrices. We propose a fully JPEG-compatible decompression method based on dynamically generated knowledge about the matrix occupancy without explicit frequency filtering in the compression step.

Our Implementation

In the course of the JVTOS multimedia collaboration project, we implemented the processing of multiple JPEG coded video streams in software. The target platform was the Apple Macintosh Quadra 840av. The 840av is equipped with a 40 MHz 68040 and the AT&T 3210 digital signal processor (DSP).

We decided on a parallel implementation with functional decomposition, where the 68040 parses the data stream, converts from YUV to RGB, and moves the data into the display RAM, because the DSP does not have direct access to the frame buffer. The DSP performs matrix reconstruction (Huffman) and IDCT steps. We chose Chen's 1-dimensional DCT scheme - which might be somewhat slower than recent DCT algorithms - as a basis for our implementation. But it proved advantageous for the DSP-typical multiply/accumulate architecture: it maintains a good balance between multiplications and additions.

Initially we achieved decompression rates ranging from 6 to 8 frames per second for 256x192 frames. In depth analysis of the compression scheme and typical data configurations lead to algorithmic improvements resulting in an 80% higher frame rate.

2. Statistical Analysis

Our analysis of JPEG compressed image sequences concentrated on typical scenarios found in the teleconferencing area. Of particular interest for the reduction of computationally effort are: This chapter presents the statistical analysis of five JPEG compressed image sequences from different video sources. They are of a short (20-40 seconds, 470-1000 frames), but statistically sufficient length: each 256x192 frame provides 786 values per coefficient position.

The analyzed image sequences are:

Figure 4: One frame of the 'News anchor' sequence.

Sparse Population

Telecommunication applications usually select high compression ratios (0.5 to 0.7 bit per pixel). This is obviously not good for picture quality, but it reduces bandwidth. Due to the high compression ratio the number of non-zero coefficients in the reconstructed matrix is small. Table 1 lists the percentage of non-zero coefficients.
                News       News       Office     Movie      Text       
                woman      anchor                                      
Y-plane         7.8 %      9.7 %      8.0 %      12.6 %     22.4 %     
U/V-planes      2.2 %      2.3 %      2.4 %      2.1 %      1.5 %      

Table 1: Percentage of non-zero coefficients in reconstructed matrices. Note that the chrominance planes contain less high frequency components resulting in fewer non-zero coefficients.

Figure 5 visualizes the distribution of coefficients in the matrices for luminance (Y) and chrominance (U/V) planes. As expected most of the non-zero entries are found in the low frequency coefficient area.

Figure 5: Position of non-zero coefficients in reconstructed matrices. Data from 1000 frames; image sequence 'Office'.

Empty Columns and Rows

Another aspect of the sparse matrix population is the existence of empty columns and rows. Figure 5 suggests that many rows and columns in the coefficient matrices are empty, i.e. all components are zero. Figure 6 shows the percentage of matrices with a maximum of i occupied columns. Most matrices contain no more than 3 occupied columns. 95% of the U/V plane matrices contain only 2 occupied columns.

Figure 6: Percentage of matrices with a maximum of i, 1 <= i <= 8 occupied columns for the Y plane and U/V planes.

The graphs for the rows look similar. Note that the only image sequence with many fully occupied matrices is 'Text'. The reason is that JPEG is not very effective for images full of tiny detail, for example characters. They produce lots of high frequency components in the luminance plane.

Shape of non-zero parts

The third interesting - and useful - fact is that most matrices with more than one non-zero coefficient exhibit an asymmetrically shaped distribution of these coefficients. They have either more occupied rows than columns or vice versa (figure 7).

Figure 7: Distribution of occupancy shapes in compressed image sequences. Separated in vertically dominated (left), symmetric (middle) and horizontally dominated (right) matrices.

The next figure (8) shows examples of typical distributions of non-zero coefficients.

Figure 8: Examples: Typical matrices show asymmetric distributions of non-zero coefficients. Dark fields hold values unequal to zero.

The asymmetric characteristic of coefficient matrices can be exploited by adaptive transform coding methods to achieve better compression ratios [HoGersho]. For the case of the JPEG baseline algorithm, knowledge about the shape of non-zero parts in coefficient matrices can be used to improve decompression speed.

Numerical mathematicians have developed effective techniques for sparsely populated matrices. They are typically exploiting geometric properties of the nonempty regions such as band or triangular shape. Other important preconditions for specialized algorthms are certain mathematical properties of the matrices, e.g regularity, det(M) != 0, or positive definitness [Locher]. As shown above, IDCT input matrices typically have empty colums or rows (=> det(C) = 0) and they are not positiv definite. They are not band-shaped or triangular in the mathematical sense either (see figure 8).

3. A Knowledge Based IDCT Algorithm

The basis of our algorithm is the fact, that during the IDCT zero coefficients do not contribute to the pixel values. We can therefore restrict the algorithm to the nontrivial values as good as we can without creating too much bookkeeping overhead.

The idea of utilizing the sparcity of coefficient matrices for speed-up has been presented before [HungMeng]. This work focused on the reduction of the number of multiplications needed for the IDCT. They identify and count zero values in coefficent matrices in order to select the best of the proposed IDCT algorithms. They do neither explain how to obtain structural information efficiently, nor do they optimize for processor arcitectures with multiply-accumulate instructions. Efficient extraction of information about coefficient matrices and their exploitation concerning instruction count is important and not trivial. Recent microprocessors need about the same instruction count for calculations with and test against zero. Selective execution following tests against zero needs even more instructions. Our objective is to avoid contact with coefficients equal to zero and not only to suppress calculations with zero values.

Many input matrices for the IDCT have only one non-zero coefficient in the left upper corner (the DC value). Special treatment of this case, which avoids the IDCT at all, achieved a speedup of 30%. It is simple to implement this special case, a generic algorithm however needs to know where it can skip operations dynamically.

Computation of structural information

An important (and time-consuming) task in JPEG decompression is the reconstruction of the coefficient matrices. Code symbols include a count of leading zeros (in the zig-zag scheme) and the value of the next non-zero element. The latter has to be written into the appropriate place in the matrix. We use this index do generate two bitsets, one for the rows, one for columns. A bit is set, if the corresponding row (column) contains at least one non-zero coefficient. The bitset data structure was chosen to optimize the loops over the rows (columns). Treating the bitset as a number provides a fast way to compare the number of occupied rows and columns.

The structural information gained in the matrix reconstruction is then used to control the adaptive IDCT transform algorithm. It is important to note, that structural information already available in the matrix reconstruction phase is recorded for further use in the IDCT phase.

IDCT

Before we show how to use the structural information, we have to look at the IDCT one more time: the 2-dimensional IDCT is mathematically equivalent to two successive sets of 1-dimensional DCTs. This property is exploited in many IDCT implementations including ours. A 1-dimensional discrete cosine transform is defined as follows:

where uk are IDCT coefficients and vj are the output values of the transformation.

In the first step a 1-dimensional transform is applied to each column, the second step is a loop that performs the 1-dimensional transform on each row (figure 9).

Figure 9: Separation of the 2D-DCT into two 1-dimensional transforms. The first series results in a semi-transformed matrix. The second series completes the 2D transform.

Improving the first 1-dimensional step

The separation into 1-dimensional transforms on columns allows for a first improvement: skip empty columns. A 1D-DCT on a column containing only zeros as coefficients results in a column of zeros as output values. Therefore the transform is solely applied to occupied columns, i.e. columns containing at least one non-zero coefficient. The column-loop tests our bitset to decide whether a transformation is necessary or not. After this step all the elements in the occupied columns are filled with unpredictable, mostly non-zero values (figure 10).

Figure 10: Transformation of non-zero columns

Optimizing the second 1-dimensional step

As shown above every row contains at least one non-zero element after the column transformation (unless we encounter the trivial case that the matrix is completely zero). We are able to use the column bitset a second time to choose a reduced transform. The rows are usually not completely filled. Therefore it is sufficient to perform reduced transformations, where only the content of occupied columns is used as input. A reduced transform avoids calculations with coefficients which are known to be zero. We define a reduced 1-dimensional inverse DCT of degree r as follows:

where again uk are IDCT coefficients and vj are the output values of the transformation. This reduced IDCT produces the same results as the regular IDCT over 8 coefficients with fewer calculations. A reduced IDCT of 8th degree is identical to the regular 8 point 1D-IDCT. Table 2 shows the number of instructions for different degrees of the reduced transformation.

   degree       1    2      3      4      5      6      7      8      
instructions   12    27     33     37     42     44     46     48     

Table 2: Instruction count for the reduced 1-dimensional IDCT of r-th degree. The lower degrees show by far the best reduction of the instruction count.

In the second step after the column transformation a reduced IDCT of r-th degree is applied to a row containing r non-zero elements. Figure 11 gives an example with a reduced IDCT of 5th degree in the second step. The thicker part of the arrows on the right side shows the input values used.

Figure 11: Reduced row transform

Note that we did not tailor our algorithm to ignore individual zeros. The administrative overhead would certainly defeat the purpose of such an action.

Choosing the best first step

The analysis shows that the matrices are usually asymmetric (compare figure 7). Most of them have either more occupied rows or more occupied columns. The first step fills every element in a column with a non-zero value. Transforming a matrix with many non-zero columns and only few non-zero rows, we build an almost completely filled matrix (see fig. 12a). The matrix stays almost empty in the opposite case, i.e. many rows and only few columns.

Figure 12a/b: Distribution of non-zero entries in the semi-transformed matrix after the first series of 1D-IDCTs. The number of non-zero entries depends on the direction of the 1D-IDCT series.

The IDCT involves two matrix multiplications. They are of course associative:

This associativity enables the use of the occupation bitset to choose the best first step, i.e. perform the first transform in the direction that fills the matrix the least:

Choosing the better direction does not only reduce the number of entities (rows/ columns) to process in the first step: it enables us to use a faster reduced IDCT in the second step by creating less non-zero values in the semi-transformed matrix.

Figure 11a shows a typical block of coefficients with more occupied columns than rows. Transformations on the 6 occupied columns result in 48 of 64 fields with non-zero values. The second step then requires 8 reduced transformations of 6th degree over the rows. In contrast, if the transforms of the first step are applied to the 3 occupied rows we get only 24 of 64 fields with non-zero values. In this case the second step requires reduced transforms of only 3rd degree (figure 12b).

4. Results

We tested our accelerated algorithm in a teleconference system where we experienced a drastically improved decompression speed. Moving from 7 frames/sec to 13 frames/sec in a video does make a big difference for the human perception.
 frame rate     basic     basic    occupied  +reduced  best           best       
[frames/sec]   algorithm  +DConly   columns   IDCT     row/column    row/column+DConly    
                                                                       
 News Woman      7.6       9.9     9.7       11.8      12.6          14.1 (86%)    
 News Anchor     7.3       9.0     9.2       10.9      11.5          12.5 (70%)    
   Office        7.5       10.8    9.6       12.0      12.6          14.5 (92%)    
    Movie        7.0       8.9     8.6       9.9       10.6          11.5 (65%)    
    Text         6.1       7.9     7.0       8.0       8.1           8.8 (45%)     

Table 3: Frame rates for the test videos with frame size 256x192 pixels (speedup).

The subjective experience was verified experimentally in a test scenario measuring the performance of the different versions of the decompression algorithm.

The 'basic algorithm' uses the standard method of 2 series of Chen-IDCTs without other algorithmic improvements. However, the IDCT kernal is hand optimized and the code is the same as used in the best version (Table 3, last column). The 'occupied columns' version uses the proposed fast information extraction mechanism in the huffman step. The publicly available JPEG code version 5 of the Independent JPEG Group (IJG) performs in between the 'basic algorithm' and the 'occupied columns' version. It avoids the IDCT on columns containing only zero coefficients, but the decision wether to skip an IDCT or execute it, needs a significant time on processors with multiply accumulate arcitectures.

Table 3 gives the performance in frames per second, figure 13 shows the time for the IDCT. The percentages in the last column reflect the improvement compared to the basic algorithm. The total overhead to compute and use the structural information is about 9% of the overall decompression time in our implementation.

Figure 13: Time for IDCT computation with different acceleration techniques (256x192 pixel per frame).

5. Conclusions

We have shown that JPEG decompression can be improved drastically with knowledge about the IDCT input. The knowledge is present in the matrix reconstruction phase. We developed and implemented means to keep this knowledge and exploit it during the IDCT.

Our algorithm almost doubles the decompression frame rate for typical telecommunication video streams. We expect that this result and other algorithmic advances, for example in Huffman decoding, will pave the way for software based real time JPEG or MPEG processing in every workstation.

References

[ChenHarFra] W.A. Chen, C. Harrison, S.C. Fralick: A Fast computational Algorithm for the Discrete Cosine Transform; IEEE Transactions on Communications; Vol. COM-25, No. 9, Sept. 1977, page 1004.

[ChoYunLee] Nam Ik Cho, Il Dong Yun, Sang Uk Lee: A Fast Algorithm for 2D-DCT; ICASSP '91; 1991; page 2197.

[Duhamel] P. Duhamel, H. H'Mida: New 2n DCT Algorithms suitable for VLSI Implementation; Proceedings IEEE International Conference Acoustics, Speech and Signal Processing ICASSP-87, Dallas, Apr. 1987, page1805.

[HoGersho] Y.S. Ho, A. Gersho: Classified transform coding of images using vector quantization; ICASSP '89, Glasgow, May 1989; page 1890.

[HungMeng] Andy C. Hung, Teresa H.-Y. Meng: Statistical Inverse Discrete Cosine Transforms for Image Compression; Proceedings SPIE vol. 2187, page 196.

[ISO] International Organization for Standardization: Information Technology - Digital Compression and Coding of Continous-tone Still Images; ISO/IEC DIS 10918-1; ISO 1991.

[KamRao] F.A. Kamangar, K.R. Rao: Fast Algorithms for the 2-D Discrete Cosine Transform; IEEE Transactions on Computers, Vol. C-31, No. 9, Sept. 1982, page 899.

[Lee] Byeong Lee: A New Algorithms to compute the Discrete Cosine Transform; IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. ASSP-32, No. 6, Dec. 1984, page 1243.

[LinFeig] E. Linzer, E. Feig: New Scaled DCT Algorithms for Fused Multiply/Add Architectures; IEEE Intl. Conference on Acoustics, Speech, and Signal Processing '91; 1991, page 2201.

[Locher] F. Locher: Numerische Mathematik für Informatiker; Berlin, 1992.

[LoeLigMo] Ch. Loeffler, A.Ligtenberg, G.S. Moschytz: Practical Fast 1-D DCT Algorithms with 11 Multiplications; IEEE Intl. Conference on Acoustics, Speech, and Signal Processing '89; 1989, page 988.

[MaChSwWo] L. Matterne, D. Chong, B. McSweeney, R. Woudsma: A flexible high performance 2-D discrete cosine transform IC; ISCAS 89, International Symposium Circuits and Systems; May 1989, page 618.

[RaoYip] K.R. Rao, P. Yip: Discrete Cosine Transform, Academic Press, 1990.

[ThoBlinHias] V. Thomas, J.L. Blin, M. Hias: Experiments on visibility thresholds of DCT Coefficients; Picture Coding Symposium 88; Italy, Sept 1988; page3.1-1.

[Vetterli] M. Vetterli: Fast 2-D Discrete Cosine Transform; ICASSP '85; 1985, page 1538.

[Wang] Z. Wang: Fast Algorithms for the Discrete W-Transform and for the Discrete Fourier Transform; IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. ASSP-32, No. 4, Aug. 1984, page 803.