** **

Video compression involves encoding/decoding of pixel information in 16×16 pixel macroblocks. H.264 allows the use of 16 x 16, 16 x 8, 8 x 16, 8 x 8, 8 x 4, 4 x 8, and 4 x 4 sub-macroblock sizes. A de-blocking filter is applied to every decoded macroblock edge to reduce blocking artifacts produced by the decoding process. Filtering is done macroblock-wise with each 16 x 16 pixel macroblock selected in raster scan order. The filter is applied between every two 4×4 pixel sub-blocks in both directions, resulting in horizontal filtering of the vertical edges followed by vertical filtering of the horizontal edges.

The filtering operation may effect up to three pixels on each side of the boundary depending on the Boundary Strength (BS) scheme. The filtering scheme varies from (a) Bs=0 *no pixels are filtered* to (b) Bs=4 for strong filtering *where all three pixels p0, p1, p2, q0, q1, q2 are filtered*.

Since a SIMD (Single Instruction Multiple Data) processor can solve similar problems in parallel on different sets of local data it can be characterized as n times faster than a single compute unit processor where n is the number of compute units in the SIMD.

This benefit is easily achieved on vector oriented/ parallel problems where each compute unit is working on the same problem using independent local data. Examples for such problems are FIR where each of the compute units is generating a different output point, FFT, and DCT, IDCT, etc.

This article explains how to convert the H.264 Loop Filter problem, which is sequential in nature, into a vector oriented/ parallel process that efficiently uses the SIMD architecture.

** **

**NEXT: Ways to accelerate H.264 De-blocking Filter on a parallel processor with multiple compute units**

**Page 1:** next page