Ways to accelerate H.264 De-blocking Filter on a parallel processor with multiple compute units
According to the H.264 specification, the filter sample decision ap for the left side, and aq for the right side of the filter depends on the pixel gradient across block boundaries and is defined as:
(1)
(2)
where b is the slice threshold.

Figure 1: Boundary Filtering
Assuming the filter at the boundary shown in Fig. 1 is to be Bs = 4, according to the H.264 specification, the processor that executes the de-blocking filter has two choices: if ap =1, then the processor must carry out the three filters to update P0, P1, and P2 as shown in equations (3), (4), and (5).
(3)
(4)
(5)
If ap=0 then only one filter needs to be carried out to update P0 as shown in equation (6), leaving P1 = p1 and P2 = p2 unchanged.
(6)
An identical set of equations depending on aq = [0,1] would be used to process Q0 - Q3.
If the filter boundary strength were to be Bs = 4 and if both sample decision ap and aq were equal to “1,” then, the filtering for P0 and Q0 could be carried out by a dual compute unit processor such as ADI BlackFin processor using the SIMD model, where both compute units could in parallel move through operations (3), (4), and (5). However, this cannot be assured. The p0 - p3 sample decision might be ap = 1, and the q0 - q3 sample decision might be aq = 0. In this case, one compute unit must perform operations as shown in (3), (4), and (5), while the other is simply doing the one operation of (6). That is, they are no longer following the SIMD processing model. In one case, the operations would involve equations (3), (4), and (5), whereas in the other case, only equation (6).
NEXT: A Generic Instruction Stream
Page 2: next page



