ACM Multimedia 97 - Electronic Proceedings

November 8-14, 1997

Crowne Plaza Hotel, Seattle, USA


Optimizing the Data Cache Performance of a Software MPEG-2 Video Decoder

Peter Soderquist
School of Electrical Engineering
Dept. of Electrical and Computer Engineering
Cornell University
Ithaca, New York 14853
(607) 255-0314
pgs@cs.cornell.edu
http://www.ee.cornell.edu/~pgs/

Miriam Leeser
Dept. of Electrical and Computer Engineering
Northeastern University
Boston, Massachusetts 02115
(617) 373-3814
mel@ece.neu.edu
http://www.ece.neu.edu/


ACM Copyright Notice


Abstract

Multimedia functionality has become an established component of core computer workloads. MPEG-2 video decoding represents a particularly important and computationally demanding application example. Instruction set extensions like Intel's MMX significantly reduce the computational challenges of this and other multimedia algorithms. However, memory subsystem deficiencies have now become the major barrier to increased performance, partly as a consequence of this improved CPU performance. Decoding MPEG-2 video data in software makes significant bandwidth demands on memory subsystems, which is seriously aggravated by cache inefficiencies. Conventional data caches generate many times more cache-memory traffic than required, at best double the minimum necessary to support decoding. Improving efficiency requires understanding the behavior of the decoder and composition of its data set. We provide an analysis of the memory and cache behavior of software MPEG-2 video decoding, and lay out a set of cache-oriented architectural enhancements which offer relief for the problem of excess cache-memory bandwidth. Our results show that cache-sensitive handling of different data types can reduce traffic by 50 percent or more.

Keywords

MPEG-2, digital video, caches, computer architecture


Table of Contents


1   Introduction

Multimedia computing has become a practical reality, and computer architectures are changing in response. Extensions for continuous media processing are a current or planned part of every market-leading instruction set architecture, as shown in Table 1. The primary feature of these extensions is SIMD-style processing of small data types. While contemporary microprocessor datapaths are 32 or 64 bits wide, multimedia data usually consists of 16-bit or 8-bit integers. Furthermore, multimedia algorithms commonly feature an abundance of data parallelism. Therefore, modifying wide datapaths to process multiple small data types in a single instruction promises significant performance improvements.

Architecture ISA Extension
x86 MMX (MultiMedia eXtensions)
SPARC VIS (Visual Instruction Set)
PA-RISC MAX-2 (Multimedia Acceleration eXtensions)
Alpha MVI (Motion Video Instruction)
MIPS MDMX (MIPS Digital Media eXtensions)
PowerPC VMX (Video and Multimedia eXtensions)*

Table 1. ISA multimedia extensions for popular architectures (* = proposed)

All of these extensions are targeted, at least partially, at MPEG-2 video decoding. Collectively, they mount a formidable assault on its computational complexity. Unfortunately they largely fail to address the significant challenge of MPEG processing for memory subsystems, which have become the primary performance bottleneck. The high data rate, large set sizes, and distinctive memory access patterns of MPEG exert a particular strain on caches. Manipulating standard parameters (cache size, associativity, and line size) fails to cost-effectively reduce excess memory traffic. While miss rate levels are acceptable, standard caches, even very large and highly associative ones, generate significant excess cache-memory traffic. Multimedia instruction extensions actually exacerbate this problem by enabling the CPU to consume and produce more data in fewer cycles.

This cache inefficiency is seriously limiting for desktop multitasking systems. A foregrounded video process can potentially cripple communications, print spooling, system management, network agent, and other background tasks. The video playback may also suffer from dropped frames, jitter, blocking, or other annoying artifacts. For low-power/cost systems and information appliances, cache inefficiency can have a direct cost impact, requiring the use of faster or higher capacity components than strictly necessary to achieve specified functionality at the required quality level. This can drive up system cost, increase power consumption, or even prevent implementation.

To explore the cache efficiency problem and its solutions, we examine the traffic and data cache behavior of an MPEG-2 decoder running Main Level streams on a general-purpose microprocessor. Trace-driven cache simulations of typical sequences reveal the memory behavior of the decoder. We show that, contrary to some predictions, there is a benefit to even relatively small, simple caches. However, even very large and complex caches generate significant excess traffic. An analysis of the data types utilized by the decoder and their patterns of access provides the basis for proposed architectural enhancements. We also show that even relatively simple measures, such as selective caching of specific data types, can dramatically improve efficiency, reducing cache-memory traffic by 50% or more for a wide range of cache sizes.

<-- Table of Contents


2   Background

Consider the problem of decoding MPEG-2 in software on a general-purpose computing platform. The machine type of primary interest is a typical desktop PC or workstation, as envisioned in Figure 1. The diagram also illustrates the typical flow of video data in the system. Imagine the user is viewing, for example, an entry in a multimedia encyclopedia. Compressed data is transferred from the file system (e.g. CD-ROM or hard disk) by direct memory access (DMA) and buffered in the main memory. The CPU reads the data through its cache hierarchy and writes the decoded, dithered RGB video, one frame at a time, to main memory where it is stored in an image of the displayed video window. Finally, another DMA transfer brings the video data to the frame buffer where it eventually appears on the display.

[IMG]
Figure 1. Typical desktop machine with software MPEG-2 decoding traffic

It is clear that there is a large amount of data, much of it time-sensitive, being transferred through the system. Most of this is concentrated on the main system bus. At the very least, there are two streams each of encoded and decoded video being concurrently transferred in and out of main memory, amounting to a minimum sustained load of 63 Mbytes/s. This ignores the bandwidth required by other applications running on the system and sharing interconnect, main memory, and peripherals. Any excess memory traffic generated by cache inefficiency will further exacerbate this situation. Unfortunately, our simulations indicate that standard caches generate at least twice the minimum required level of cache-memory traffic, typically many times more, amounting to a significant strain on system capacity. The result is some combination of dropped frames, reduced frame rate, and degradation of overall system performance.

<-- Table of Contents


3   MPEG Overview

Appreciating the challenges of supporting video decoding requires some understanding of the MPEG standard [4]. MPEG attacks both the spatial and temporal redundancy of video signals to achieve compression. Video data is broken down into 8 by 8 pixel Blocks and passed through a discrete cosine transform (DCT). The resulting spatial frequency coefficients are quantized, run-length encoded, and then further compressed with an entropy coding algorithm. To exploit temporal redundancy, MPEG encoding uses motion compensation with three different types of frames. I (intra) frames contain a complete image, compressed for spatial redundancy only. P (predicted) MPEG frames are built from 16 by 16 fragments known as macroblocks. These consist primarily of pixels from the closest previous I or P frame (the reference frame), translated as a group from their location in the source. This information is stored as a vector representing the translation, and a DCT-encoded difference term, requiring far fewer bits then the original image fragment. B (bidirectional) frames can use the closest two I or P pictures - one before and one after in temporal order - as reference frames. Information not present in reference frames is encoded spatially on a block-by-block basis. All of data in P and B frames is also subject to run-length and entropy coding.

[IMG]
Figure 2. Typical sequence of MPEG frames, showing interframe dependencies

Figure 2 shows the interframe dependencies of the different frame types, superimposed on the displayed frame order. For decoding, these frames must be processed in the non-temporal order [1,4,2,3,7,5,6,10,8,9], which is a result of these dependencies. Interframe dependencies and the properties and sequence of frame types determine in critical ways the flow pattern of MPEG data and the nature of the hardware support required.

[IMG]
Figure 3. MPEG bitstream structure

The decoder reads the MPEG data as a stream of bits. Unique bit patterns, known as startcodes mark the division between different sections of the data. The bitstream has a hierarchical structure, shown in simplified form in Figure 3. A Sequence (video clip) consists of groups of pictures (GOP's). A GOP contains at least one I frame and typically a number of dependent P and B frames; Figure 2 shows a possible GOP of 9 frames (pictures) followed by the I frame of the subsequent GOP. Pictures consist of collections of macroblocks called slices. The ``headers'' shown at each level contain parameters relevant to each bitstream type but no actual image data.

<-- Table of Contents


4   Related Work

Deficiencies in the MPEG-2 decoding performance of systems are being actively addressed. This section discusses several significant methods for performance enhancement under investigation by researchers both in industry and academia. Some of these are merely proposals, while others have already been implemented in hardware or are forthcoming. We explore the benefits and disadvantages of each approach, and assess its impact on the problem of excess cache-memory bandwidth.

4.1   Media Processors

A new breed of microprocessor, the so-called ``media processor'' has emerged in just the past few years. These chips combine programmability with specialized hardware to support multiple concurrent multimedia operations, including MPEG-2 decoding, and usually reside on peripheral cards in desktop systems. Their manufacturers, which include Chromatic Research, Fujitsu, Mitsubishi, Philips, and Samsung, claim that media processors are the best present and long-term solution for providing multimedia functionality in computer systems, not the host CPU.

To their advantage, media processors yield reduced memory bandwidth requirements, since they deal in compressed data, and take a significant computational load off the host. They are also good at handling the real-time constraints of multimedia, while present-day operating systems and memory hierarchies are not. Finally, present-day multimedia functions consist of a limited set of operations, tied to international standards, an efficient target for highly specialized solutions.

However, performing multimedia functionality with the host processor is inherently less expensive than using extra specialized hardware. Memory systems are also improving to meet the needs of multimedia, as illustrated by Intel's AGP and Direct RDRAM initiatives. Operating systems are also likely to become more real-time oriented as they evolve. More fundamentally, multimedia data are a new first-order data type, on the same order as floating-point. These functions are becoming central to what computers do, not ``peripheral'' like mass-storage or printing. As applications evolve, media and other operations will become more and more interleaved - someday viewing a video clip will be as seamless as reading e-mail or editing a spreadsheet. Host multimedia processing best supports the construction of highly integrated, fully featured, yet portable applications. Hardware integration trends and the powerful vested interest of CPU manufacturers also favor this outcome.

Nevertheless, parts of video processing are likely to remain in specialized hardware for quite some time, such as color space conversion, scaling, and dithering. These are straightforward to implement in hardware, and to a large extent not data-dependent and therefore not subject to pipeline hazards and other impediments. These are more akin to truly ``peripheral'' operations.

<-- Related Work

4.2   SIMD Instruction Extensions

Multimedia instruction extensions such as MMX, VIS, MAX-2, and the others in Table 1, achieve speedup primarily through performing SIMD-style operations on multimedia data types. These ISA enhancements are an essential step to improving microprocessor performance on multimedia applications - as well as insuring the continued viability of host processing. However, these operations tend to increase rather than relieve the strain on memory bandwidth. By consuming more operands per unit time, multimedia instructions expose the underlying weakness of the cache, system bus, and main memory. The caches in the MMX-enhanced Pentium are double the size of those in the earlier Pentium for this very reason. Intel researchers even blame the below expected speedup on MPEG with MMX on memory and I/O system deficiencies [9].

<-- Related Work

4.3   Video/Graphics Data Buses

There have been proposals for implementing special interconnect to transfer image data from the host processor to the frame buffer, bypassing slower general-purpose I/O buses. The Intel Advanced Graphics Port (AGP) is the most prominent such effort, intended to provide a proposed direct link between main memory and the graphics subsystem. While primarily meant to support the high bandwidth of 3D graphics rendering on the host processor, it is a definite boon to video as well. This method provides another significant way to boost the capabilities of host multimedia processing relative to custom hardware solutions. However, it does nothing in and of itself to alleviate cache-memory traffic waste or reduce overall memory-system bandwidth needs. The undiminished requirement for extremely high-bandwidth memory is reflected in Intel's support of the Rambus architecture for future PC memory systems.

<-- Related Work

4.4   Prefetching

Hardware prefetching has been suggested as a remedy for inadequate MPEG performance [11]. Yet it is primarily promoted as a means to reduce miss rates and without much consideration of cache-memory traffic, which tends to increase with prefetching, especially the more aggressive schemes. Prefetching is essentially a form of latency hiding; for memory-bound problems, it merely exposes underlying bandwidth problems, and tends to generate excess memory traffic of its own. Selectively applied, prefetching may provide a boost to compute-intensive phases of MPEG and other multimedia algorithms. However, the emphasis of our research is not so much on getting values into the cache early, but keeping them around for as long as they are needed.

<-- Related Work

4.5   Our Approach: Cache-Oriented Enhancements

The methods discussed above largely bypass the issue of excess cache-memory traffic, while some of them even exacerbate the problem. None of them directly target the issue of cache inefficiency. Even prefetching focuses on reducing miss rates at the possible expense of memory traffic. We suggest a different type of solution, driven by the internal dynamics of the decoder itself. Exploiting knowledge of individual data types - their sizes, access types, and access patterns - can lead to effective architectural solutions to cache inefficiency.

<-- Related Work
<-- Table of Contents


5   Experimental Setup

Solving the problem of excess cache-memory traffic requires first establishing its extent, and how critical system parameters affect this behavior. Our results come from trace-driven cache simulations of video stream decoding. The clips themselves are progressive MPEG-2 Main Level sequences, with a resolution of 704 by 480 pixels at 30 frames/s, and a compressed bitrate of 6 Mbits/s. These are comparable in perceptual quality to conventional analog broadcast (e.g. NTSC, PAL). Sequences are chosen as representative of different types of programming.

5.1   Machine and Software Environment

For experimental purposes we use a simplified machine model, limiting the microprocessor to a single level of caching. We further assume that MPEG-2 decoding is the only active process. The decoder is a version of mpeg2decode from the MPEG Software Simulation Group [1], running on a SuperSPARC processor under Solaris. This decoder has the advantage of portability and relative system independence, with the disadvantage that performance is suboptimal for any given architecture. Since it originates with the body responsible for MPEG, it is closely written to comply with that standard. Finally, it has the essential feature for research purposes of being available in source code form.

We have chosen to treat the decoder implementation as a given, implementing minimal modification to support integration with our simulation tools. Trace generation is performed by QPT2 (the Quick Profiler and Tracer) [6], a component of the Wisconsin Architectural Research Toolkit (WARTS). We have implemented the cache simulator, Pacino, to support the unique qualities of continuous media applications. These are considerably different from conventional benchmarks of the SPEC variety in their consumption and production of data, and interaction with the operating system.

<-- Experimental Setup

5.2   Steady-State Simulation

To limit the considerable storage and CPU cycle requirements of trace-driven simulation, we restrict each run to one GOP, the largest logical bitstream unit below a sequence. Simulations run as if at steady state - i.e. preceded and followed by a large number of other GOP's, as if plucked from the middle of the sequence. This eliminates the side effects of program startup and termination. To achieve this the results of running initialization code are not logged, the cache is primed with data to avoid the effects of cold start, and dirty lines are not flushed at the end of the GOP.

<-- Experimental Setup

5.3   Color Space and Bandwidth

The simulation framework implements one significant performance optimization beyond the generic hardware model. Decoded video is sent to the frame buffer in 4:2:0 YCrCb form, the native representation of MPEG, rather than 4:4:4 RGB. YCrCb is a color representation with one luminance (grayscale) and two chrominance components per pixel, rather than one each of red, green, and blue. In 4:2:0 YCrCb, the chrominance components are subsampled in both dimensions. We assume that upsampling, color space conversion, and any dithering are done on the fly by the video display subsystem. These features are becoming common in workstations and budget PC video accelerators alike. According to prior work on MPEG-1 decoders, this processing can account for almost 25% of total execution time if performed in software [8]. In addition, the 4:2:0 representation of an image is only half the size of the 4:4:4 version, reducing the minimum sustained system bus load to 31.5 Mbytes/s. Finally, these conversion operations account for so many memory and instruction references that they make trace-driven simulation prohibitive - a sufficiently compelling reason to avoid them.

<-- Experimental Setup
<-- Table of Contents


6   Cache Simulation Results

Trace-driven cache simulations clarify how data requests from the MPEG-2 decoder translate into what traffic is seen by the memory. First of all, there is a distinct benefit to caching video decoder data, even using naive, generic schemes. It has occasionally been suggested that caches are critically inefficient for video data. Accordingly, several media processors dispense with data caches altogether in favor of SRAM banks - essentially large register files - managed by software [5,3,2]. Yet we have found that despite the large set sizes and the magnitude of data consumed and discarded, there is sufficient re-use of values for caching to significantly reduce the required memory bandwidth.

For example, decoding one Group Of Pictures from a typical MPEG-2 stream generates 2.1*10exp(8) memory requests. A 16 Kbyte, direct-mapped cache generates 3.9*10exp(7) words of memory traffic. If we estimate conservatively that each word of cache data represents a memory request, then even a relatively small, simple cache keeps almost 85% of traffic off of the memory bus. In fact, caches typically load several words at a time, while many CPU memory requests are for data types smaller than a word. Therefore, data caches can make far more efficient use of available memory bandwidth than registers alone. Even very clever register management would have trouble competing with anything but the smallest caches, so even information appliance-type platforms might benefit from implementing an actual cache.

The remainder of this section examines in more detail how the basic parameters of a simple cache (cache size, set associativity, and line size) affect memory traffic as well as misses. All of the caches in our simulations implement a write-back policy on replacement. Not only is this the most popular write policy in actual implementations, but it typically generates far less memory traffic than write-through, the best alternative. Write-allocate, where a write miss causes the loading of the associated line, is also a common feature of all simulated caches.

If we assume a ``perfect'' cache, the traffic required to support decoding is equal to the size of encoded stream, which is read in from memory, plus the decoded video data, which is written back. This absolute minimum cache-memory traffic is used as a yardstick to compare the performance of different configurations in our simulations.

6.1   Cache Size

The size of a cache is its most significant design parameter, certainly from a cost standpoint. Because cache size usually increases or decreases by factors of two, the decision of how large a cache to implement in a system is pivotal. For Main Level MPEG-2 decoding, the cache-memory traffic as a function of cache size and associativity shows little variation between video sequences. Figure 4 shows a representative plot, assuming a standard line size of 32 bytes. This is superimposed on a surface showing the minimum possible cache-memory traffic for the sequence, consisting of the encoded and decoded video streams combined.

[IMG]
Figure 4. Typical cache-memory traffic for MPEG-2 Main Level decoding, over minimum possible traffic level

The most prominent feature of the traffic function is a large plateau, at a level of 6.3 times the minimum value. Except for direct-mapped caches, which produce a peak of 16 times the minimum, cache-memory traffic changes little with increasing cache size for most of the range. Traffic starts to roll off at 1 Mbyte, and plummets at 2 Mbytes. This reflects the 2 Mbyte size of the decoder data set. Larger caches show negligible improvement, with the additional space providing no extra benefit. However, the smallest measured value is still almost 2 times higher than the minimum possible value.

These data imply that a 32 Kbyte cache is just as good - or bad - for MPEG-2 Main Level decoding as a 64 Kbyte or 128 Kbyte one, for example. Improving on cache-memory traffic requires a 1 Mbyte cache or higher, but a 2 Mbytes or larger cache only brings traffic down to double the absolute minimum, leaving plenty of room for improvement.

<-- Cache Simulation Results

6.2   Associativity

Increasing set associativity is a popular method for getting higher performance out of smaller caches. The memory traffic effects of varying set associativity are also visible in Figure 4. Going from a direct-mapped cache to a 2-way set-associative one can reduce memory traffic by as much as 50% for small caches. Increasing associativity to 4 can squeeze out almost another 10% improvement over the direct-mapped case. Set sizes of greater than 4, however, show minimal benefit across all cache sizes. Since increasingly higher levels of associativity add considerably to cache cost, complexity and access time, such enhancements are not justified.

While memory traffic is the primary focus, we would prefer to not adversely affect miss rates. As a function of cache size and set associativity, miss rates show the same behavioral pattern as memory traffic. However, unlike for cache-memory traffic, the performance is more than satisfactory, with an average value below 0.5%.

<-- Cache Simulation Results

6.3   Line Size

Line size, another fundamental parameter, is less costly to experiment with than cache size. Subblock placement can help decouple the size of cache lines and that of the memory bus. Unfortunately, there is contention between miss rate and memory traffic minimization. Low miss rates call for larger lines than the typical 32 bytes, as illustrated in Figure 5. Larger lines tend to provide superior spatial locality, but require more data to be read and possibly written back on a miss. For this reason, minimal memory traffic occurs with the smaller lines. This relationship is readily apparent in Figure 6, where for the smallest caches, the largest line sizes lead to cache-memory traffic almost 200 times the absolute minimum.

[IMG]
Figure 5. Typical miss rates for MPEG-2 Main Level decoding over different cache sizes

[IMG]
Figure 6. Typical cache-memory traffic for MPEG-2 Main Level decoding over different cache sizes

Balancing the demands of miss rates and memory traffic requires further investigation. For the time being, maintaining the ubiquitous 32 byte line size seems sensible. The interests of memory traffic reduction argue against a switch to larger values.

<-- Cache Simulation Results

6.4   Summary

It is clear that conventional cache techniques provide limited relief from excess cache-memory traffic. The simplest way to improve memory traffic and miss rates is to have a very large cache - but a large cache is not always feasible, and simulations show that improving MPEG-2 decoding performance requires very big caches. In any case, one would prefer to extract better performance from smaller, lower-cost resources, improving performance through increased efficiency rather than brute force.

<-- Cache Simulation Results
<-- Table of Contents


7   Decoder Internal Analysis

This section explores the internal functioning of a software-based MPEG-2 video decoder, as a first step to explaining and rectifying its suboptimal use of cache resources. First, we dissect the different kinds of data used by the decoder, including compressed input data, image output data, and all of the intermediate types. Second, we examine how these data objects are used in the process of decoding, and how this affects the performance of the cache.

7.1   Data Set Composition

The data set of an MPEG-2 decoder - the information that must be available to the program in the course of its execution - consists of several different data types which serve distinct purposes. As a result, they are quite heterogeneous with respect to the types and patterns of access utilized by the decoder, as well as the amount of memory space required. The following classes account for the global and static local values in user space.

This partitioning of data types may not be explicit within the decoder source, but represents an abstraction at a higher level of the data needed for decoder operation. Most of these data types are both read and written. The exceptions are the tabular data, which are read only, and the the output data which is write only. Within a memory hierarchy, reading and writing are not symmetric operations. Read misses and write misses in caches have very different latencies. The fact that some MPEG decoder data types are only read or written can be exploited for performance optimization.

The essential properties of the different data types are summarized in Table 2. Note how in terms of size, the reference and output types completely dominate the others. With respect to caching, this means that their presence will tend to repeatedly expel the other types from the cache, except for very large data caches. It doesn't matter how many times the other values are re-used. Capacity limitations alone will insure that reference and output data, when updated, will throw the other data types out.

Data Type Access Type Size Fraction of
References
Input read/write 2 KB 2.7%
Output write only 500 KB 3.9%
Tabular read only 5 KB 5.5%
Reference read/write 1500 KB 23.7%
Block read/write 1.5 KB 31.4%
State read/write 0.5 KB 25.6%

Table 2. Summary of decoder data types, sizes, access types, and proportion of memory references accounted for

Ranking the data types in terms of the number of references rather than by size gives a very different picture. Block and state data account for by far the most memory requests. Reference data is next, but output data is in next to last place, accounting for 8 times fewer memory requests than block data. Note that the percentages don't add up to 100%; approximately 7.2% of data references are due to temporary variables and library functions.

<-- Decoder Internal Analysis

7.2   Decoder Program Flow

The hierarchical composition of the MPEG bitstream, and its intrinsic sequentiality constrain the structure of the decoder program. Figure 7 shows the phases of operation for a single GOP from the perspective of data accesses. As the diagram makes evident, execution proceeds as a set of nested loops corresponding to different data levels in the bitstream as shown in Figure 3. The ``initialization'' and ``termination'' blocks represent the operations excluded by steady-state simulation.

[IMG]
Figure 7. Model of MPEG-2 software decoder showing phases of operation for a GOP

Table 3 shows which data types are accessed in the different phases of operation. Parsing header information and reading in block data involves operating on relatively small data types, and results in a small active cache set. However, during reconstruction, the decoder accesses portions of the reference frames and copies them into the frame currently under construction. The Inverse DCT (IDCT) phase focuses on smaller data types, but merging the decoded pixels back into the new picture requires once again accessing large data types. Writing the completed frame requires traversing significant portions of reference and output data. Notice that the decoder accesses state data on a fairly continuous basis.

Phases of
Operation
Data Types
input output tabular reference block state
read headers X   X     X
read block X   X   X X
reconstruct MB       X   X
IDCT     X   X X
merge MB data       X X X
write frame   X   X   X

Table 3. Data types accessed in each phase of decoder operation

Each iteration through the macroblock loop accesses a new portion of one or two reference frames. Each new P or B frame repeats this traversal through its reference frame(s), recalling significant fractions ofeach frame back into the cache. During the ``write frame'' phase, the decoder spends most of its time copying the current frame from the reference space to the output space. For all but the largest caches, this has the effect of flushing out almost everything except video data for the next displayed picture. The other data types will have to be reloaded for each new picture, or possibly several times a picture for small caches, resulting in wasted cache-memory traffic.

<-- Decoder Internal Analysis

7.3   System and Audio Interaction

Our simulations assume that MPEG video decoding is the only active process. In reality, video is usually viewed in conjunction with audio data, which requires the concurrent execution of an MPEG audio codec and an MPEG system parser to de-multiplex and synchronize the two media. We expect that running these other processes will not dramatically affect our results, since the amount of data they operate on at any given time is relatively small. The net effect is that of having a slightly smaller cache, and our simulations show that memory traffic levels are fairly insensitive to small changes in cache size. Nevertheless, the presence of these other programs is a good argument for keeping the cache footprint of the video data as small as possible.

<-- Decoder Internal Analysis
<-- Table of Contents


8   Cache-Oriented Enhancements

Our analysis of decoder behavior and data composition motivates a specific approach to performance optimization, where the goal is to improve cache efficiency for video decoding without adversely impacting other performance metrics or other applications. In particular, there are great potential benefits in treating different data types distinctly, guided by their different sizes, access types, and access patterns.

For example, write-only values, like video output data, are clearly a waste of cache space. Likewise, while block and state data account for most of the memory requests, their relatively small size means that larger data types, like reference data, systematically crowd them out of the cache. Preventing blatant cache pollution and the predation of small data types with high re-use can yield considerable cache-memory traffic savings. To this end, we are evaluating the following traffic-reduction techniques.

Many of these methods are relatively simple to implement, and most have precedents in other architectures since they have performance benefits beyond video decoding. Perhaps their efficacy for MPEG-2 processing might encourage their broader adoption in architectures. In any case, the main challenge is to successfully apply these various techniques in a manner that enhances decoder performance while avoiding excessive cost and complexity. The remainder of this section considers each method on its own.

8.1   Selective Caching

Data objects with no possibility of re-use steal cache space from other data, which leads to excess memory traffic when the excluded data are loaded in once again. Video output data is a perfect example, since it is written but never read by the CPU, and occupies a fairly large amount of space. Bypassing the cache entirely in favor of direct storage to main memory promises more efficient use.

In our simulations, we have found that excluding video output data alone from the cache reduces cache-memory traffic by up to 50%. Across the configurations considered, not caching video output yields a 25% reduction on average. Improvements in cache-memory traffic are global, yet the shape of the curve is not substantially modified. The plateau out to 512 Kbytes persists, as does the swift drop down at 2 Mbytes. Excluding output data is even more helpful for miss rates, which drop by a maximum of 85% from earlier levels, and 60% on average.

There are several distinct ways of implementing this feature. In the PA-RISC architecture, the load and store instructions can contain ``locality hints'' which signal that the data referred to is unlikely to be used. If supported by the particular implementation, the processor can elect to bypass the cache with the data [7]. The UltraSPARC has block load and store instructions, which move groups of data directly between registers and main memory at high speed [10]. This approach allows for potentially more efficient use of processor bandwidth. Finally, the PowerPC architecture provides for marking areas of memory as non-cacheable. Once marked, data is transparently and automatically transferred between registers and memory buffers. The cache hint and block load/store approaches require more explicit programming effort.

<-- Cache-Oriented Enhancements

8.2   Cache Locking

One way to protect small but frequently reused data types, like input, state, and tabular values, from victimization by raw video data is to lock the parts of the cache which contain the vulnerable data. While locked, these cache lines are invisible to the replacement algorithm, and the contents will not be thrown out only to be re-loaded when needed again.

This is accomplished by special machine instructions which execute the locking and unlocking. There are two basic variations on this technique. Static locking simply freezes the tag and contents of the affected line, allowing for the writing of values but not replacement; the line is associated with the same portion of main memory until unlocked. Dynamic locking is somewhat more flexible, treating locked lines as an extension of the register set, with special instructions to copy contents directly to and from main memory. This scheme is used in the Cyrix MediaGX processor to speed the software emulation of legacy PC hardware standards.

<-- Cache-Oriented Enhancements

8.3   Scratch Memory

Another means of providing a safe haven for vulnerable data is to put them in a separate memory. Many processors today consist to a large extent not of logic but SRAM, in the form of caches. It is easy to imagine implementing a portion of the memory space itself on the die - not as a cache but addressable memory. Data which are subject to being prematurely ejected from the cache could then be kept in this space. The components of the MPEG-2 data which would most benefit from this are relatively small, so only a few KBytes would be required. This ``scratch memory'' would be relatively inexpensive to implement in hardware, smaller than most caches, and far simpler than any of them due to the lack of tags and logic for lookup, replacement, etc. Similar memories have actually been a featured in Intel microcontrollers for quite some time, for storing Page 0 of the address space.

A scratch memory provides potentially very fast access to important data, since there is no possibility of a read or write miss, and the delay of searching tags can be eliminated. Intel microcontrollers even use special addressing modes to access Page 0 memory which are more concise and faster than those used for off-chip addresses. Unlike with cache locking, cache performance and capacity are not compromised since the scratch memory is off to the side. Of course, getting a hardware feature used in embedded systems, where operating systems and even compilers are optional, to work effectively in an interactive, multitasking environment will take careful consideration, but the potential benefits are considerable.

<-- Cache-Oriented Enhancements

8.4   Data Reordering

This category of enhancement is a departure from our standard procedure of not tampering with the decoder itself. Our goal is not to improve computational efficiency, a task which we leave to the many others working on that problem. However, we suggest that modifications of specifically how the decoder software manages and addresses memory can provide significant improvements in cache efficiency - some of which may require hardware support to properly implement.

For example, there is strictly speaking no essential need for the output data type. B frame data, considered separately, is never used by the decoder to construct other frames. The only time it is read is for copying to the output buffer space. I and P pictures need to remain available to serve as reference frames. Yet it turns out that in the natural flow of decoder data, by the time these frames are copied to the output space they are no longer needed. If pictures could be transferred directly by DMA from the reference frame space where they are stored, rather than routed through the output data, this would result in greater efficiency. It would eliminate all of the cycle-consuming copying of output data, and transform the B frame data into write-only values which could be efficiently excluded from the cache.

<-- Cache-Oriented Enhancements

8.5   Cache Partitioning

It is possible to take the basic idea of cache locking in a slightly different direction. With cache partitioning, particular data types are relegated to specific sections of the cache, which continue to behave like cache. However, rather than locking in the physical address associated with a line, or treating cache lines as extensions of the register set, it is as if each data type has its own separate, smaller cache space for storing its values.

This method provides very precise control over cache behavior, and an extremely powerful tool for the optimal management of cache resources. However, it requires a level of hardware complexity not demanded by the methods discussed previously. Also, the task of efficiently exploiting the available capabilities in software is considerably more challenging.

<-- Cache-Oriented Enhancements

8.6   Summary

None of the methods proposed are mutually exclusive, although some of them, like cache locking and scratch memory, are arguably redundant if used in conjunction. Whether one of these techniques or some combination of them provides the most efficient solution remains to be seen. Also, each one raises issues of implementation cost and complexity, interaction with the operating system and other processes, and software interfacing and portability which need to be more thoroughly addressed. We will investigate these issues and perform a more detailed evaluation of these architectural enhancements in future work.

<-- Cache-Oriented Enhancements
<-- Table of Contents


9   Conclusions

The effectiveness of computer memory subsystems, including caches, is currently a significant barrier to improved performance on multimedia applications. We have documented the memory traffic and data cache behavior of MPEG-2 video decoding on a general-purpose computer, and demonstrated that standard caches produce a significant excess of cache-memory traffic. While almost any cache is superior to none - even simple caches can reduce required memory bandwidth considerably - experimenting with basic cache parameters like cache size, associativity, and line size has a limited ability to reduce cache-memory traffic. The best value than can be achieved is double the absolute minimum traffic required. However, cache-oriented architectural enhancements, driven by an understanding of decoder behavior, can dramatically improve cache efficiency. These enhancements include selective caching, scratch memory, data reordering, and cache partitioning. All have in common an emphasis on treating different elements of the data distinctly, accommodating their unique properties - size, usage patterns, temporal locality, etc. This approach provides for optimal management of available cache resources. We expect that further refinement of these techniques will enable improved video decoding performance of general purpose microprocessors, and encourage the broader proliferation of digital video platforms and applications.

<-- Table of Contents


10   Acknowledgments

This research was supported in part by NSF grant CCR-9696196. Miriam Leeser is the recipient of an NSF Young Investigator Award. The research is also supported by generous donations from Sun Microsystems. The authors would like to thank Shantanu Tarafdar and Brian Smith for useful discussions.

<-- Table of Contents


References

[1]
Stefan Eckart et al. ISO/IEC MPEG-2 software video codec. In Proc. Digital Video Compression: Algorithms and Technologies 1995, pages 100-109. SPIE, Jan. 1995.
[2]
Peter N. Glaskowsky. Fujitsu aims media processor at DVD. Microprocessor Report, 10:11-13, November 1996.
[3]
Linley Gwennap. Mitsubishi designs DVD decoder. Microprocessor Report, 10:1,6-9, December 1996.
[4]
International Organization for Standardization. ISO/IEC JTC1/SC29/WG11/602 13818-2 Committee Draft (MPEG-2), Nov. 1993.
[5]
Paul Kalapathy. Hardware-software interactions on MPACT. IEEE Micro, 17(2):20-26, Mar./Apr. 1997.
[6]
James R. Larus. Efficient program tracing. IEEE Computer, 26(5):52-61, May 1993.
[7]
Ruby B. Lee. Subworld parallelism with MAX-2. IEEE Micro, 16(4):51-59, August 1996.
[8]
Ketan Patel et al. Performance of a software MPEG video decoder. In Proc. 1st ACM Intl. Conf. on Multimedia, pages 75-82. ACM, Aug. 1993.
[9]
Alex Peleg et al. Intel MMX for multimedia PCs. Comm. of the ACM, 40(1):25-38, Jan. 1997.
[10]
Marc Tremblay et al. VIS speeds new media processing. IEEE Micro, 16(4):10-20, August 1996.
[11]
Daniel F. Zucker et al. A comparison of hardware prefetching techniques for multimedia. Technical Report CSL-TR-95-683, Stanford University Depts. of Electrical Engineering and Computer Science, Stanford, CA, Dec. 1995.

<-- Table of Contents