HD Video Transcoding Strategies using Multicore Media Processors: Part 2 – Flexible Architecture

Delivering video across a variety of platforms involving multiple codecs can be efficiently handled by multicore media processors. Part Two explains the architectural requirements for flexible processing.

By Bahman Barazesh, Senior Technical Manager, and George Kustka, Senior Video Architect, LSI Corporation

Page 2 of 4
Video/Imaging DesignWire
(4/12/2010 8:30:02 AM)

Multicore Encoder Architecture

A typical video transcoder implementation requires an HD decoder (SD, 720p or 1080p), pos­sibly resizing or distributing YUV output to another core or device, and video encoding in CIF, SD, 720p, or 1080p resolution. Turning our attention to full de­code/encode transcoding, we will see that the same principles also apply to efficient transcoders where decoder parameters, such as motion vectors, are used by the encoder to reduce encoder complexity.

As we learned in part one, H.264 is identical to MPEG-4 Part 10. It defines its output in terms of a Network Abstraction Layer, or NAL. A given picture in a base layer or enhancement layer may be comprised of a single NAL Unit (NALU) or partitioned into multiple NALUs to meet system requirements. Reasons for partitioning into multiple NALUs include reduction of NALU size to meet network requirements, increased number of synchronization points for error resilience, and buffer size constraints in the system. Numerous error-recovery techniques have been described as part of the H.264 development process, and these techniques apply to other codec standards as well.

Some applications require a single NALU per picture for compatibility with existing equipment and coding efficiency. With single NALU applications, the issues of network bandwidth and error resiliency are handled in other ways. When coding a picture as a single NALU, several data dependencies exist that propagate throughout the picture. While partitioning into multiple NALUs has been proposed as a simple way to break the data dependencies and distribute work among multiple processors, it is also possible to handle the dependencies in-line, with minimal effect on coding efficiency and processor requirements. The flexible multiple-processor video codec architecture can address these requirements.

In this article, we’ll focus more on the multiple NALU architecture that defines slices that can be encoded separately with minimal dependency and are more suitable for robust transmission required by video-conferencing applications. High-definition video encoding, 1080p (1920×1080) and 720p (1280×720) resolutions require several media-processing cores for real-time implementation at 30 to 60 frames per second (FPS). The processing cores may even span across several multicore digital signal processor (DSP) devices. The principles of operation that we’ll describe in this article for the H.264 encoder also apply to H.263 and MPEG4 encoders.

The exploration of task partitioning between DSP cores reveals several different ways that this partitioning can be accomplished. One traditional approach is functional partitioning, which consists of allocating the computational load as evenly as possible between several DSP cores.
Figure 1: Block Diagram of H.264 Encoder

View full size

For example, the inter prediction and intra prediction in Figure 1 can be assigned to one core, while another core implements transforms and quantization, and a third core runs de-blocking filtering and entropy coding. This approach has the advantage of generality and, hence, supports single-slice as well as multiple-slice implementations. However, there are some drawbacks to this approach. For example, to use all DSP cores, a pipelined architecture is required so that each DSP core is performing a task on a new frame while the next stage DSP core is performing another task on the previous frame. This process results in higher latency due to the pipelined nature of this architecture. Also, it is more difficult to balance the computational load evenly among cores, because each functional block has a different level of complexity and processor load.

Figure 2: Functional Partitioning Pipeline Architecture

View full size

In the functional partitioning shown in Figure 2, let us assume that the function of motion estimation is performed over two time intervals. It is then followed by motion compensation, quantization, and reconstruction, each in its own time interval. This figure illustrates one of the drawbacks of functionally pipelined partitions. Data required by a process is not always available when needed, causing it to stall. In this case, the reconstructed picture needed for prediction of the next picture is not available when needed for motion estimation on that picture. One workaround is performing motion estimation on the source pictures, not reconstructed pictures. This workaround provides decent motion estimation in many cases, but the encoder cannot jointly optimize motion compensation and residual quantization error.

Another solution is to include all of the processing needed to reconstruct the reference picture in the first one or two tasks. If the motion search range is restricted, motion estimation can proceed as soon as a portion of the reference picture is available.

It is possible to make acceptable compromises that permit good quality with functional partitioning, but it becomes progressively more difficult to implement a system using more than a few processors.

NEXT: Spatial Partitioning

Page 2: next page

Pages: 1 2 3 4