Dual Compute Units
Understanding the operation shown in Fig. 2, as the dual compute units are driven in parallel by the generic instruction stream equation (11), each compute unit will perform the necessary operations to obtain P0 and Q0. In this particular example, the coefficients applied to the generic instruction stream for the left compute unit indicates that ap = 0. The left coefficients set, which will include 0s for K2 and K3, so that equation (8) is carried out, whereas in the right compute unit where aq = 1, the right nonzero coefficients set is applied so that equation (7) is carried out.

Figure 2: Parallel Processing Using the Generic Instruction Stream
By creating coefficient tables for all four possible conditions of ap and aq 00, 01, 10, and 11, not only can the filtered values P0 and Q0 be processed, but the SIMD processing model can be extended to obtain P1, Q1, P2, and Q2 of the same boundary strength Bs as well.
While the method has been demonstrated with respect to only two compute units, this approach can be extended for use with processors having many more than just two/dual compute units. For example, using a processor with eight compute units, one could not only process the filtering of P0 and Q0 for the first row as indicated in Fig. 1, but also could do so for all four rows between the two 4×4 blocks.
In many instances, the SIMD signal processor may contain more than eight compute units. In that case, not only can a boundary between two 4×4 blocks be processed, but also a full vertical/horizontal direction of 4×4 blocks can be processed simultaneously. However, this presents a new problem because while the filter strength parameter Bs is the same for each of the rows within a 4×4 block, it could have an entirely different filter strength parameter for the next 4×4 block within the same vertical/horizontal direction.
NEXT: A More Generalized Generic Instruction Stream
Page 4: next page



