H.264 De-Blocking Filter Acceleration with SIMD Processor

How to convert the H.264 Loop Filter problem, which is sequential in nature, into a vector oriented/ parallel process that efficiently uses SIMD architecture.

By Yosi Stein, DSP Principal System Architect/Advanced Technologies Manager, Analog Devices Inc.

Page 3 of 5
Video/Imaging DesignWire
(8/31/2009 3:20:46 PM)

A Generic Instruction Stream
This problem can be solved by realizing that even though different operations are to be performed, the SIMD processing model can still take place in two or more compute units by converting the equations such as (3) and (6) to a more generalized generic instruction stream that carries both equations within it but calls up local coefficients stored in each compute unit to produce its localized solution.

For example, equations (3) and (6) for P0 can be generalized as follow,

for ap = 1, equation (3) can be rewritten as

null (7)

and for ap = 0, equation (6) can be rewritten as

null (8)

Equation (7) can then be generalized to:

null(9)

and equation (8) can be generalized to:

null(10)

It can be seen that equation (9) and equation (10) are in the same form, except that equation (10) for P0 and ap = 0 has no p2 or q0 term. The generic instruction stream can be represented as

null (11)

where all the terms in both equations (9) and (10) are represented, p0, p1, p2, q0, q1, but with different coefficients K0 - K4.  The specific coefficient set is selected according to ap and aq and when applied, each of the compute units generates its localized solution to the generic instruction stream.

NEXT: Dual Compute Units

Page 3: next page

Pages: 1 2 3 4 5