Mamidi Paper 3
Mamidi Paper 3
Microelectronics Journal
journal homepage: www.elsevier.com/locate/mejo
A R T I C L E I N F O A B S T R A C T
Keywords: Video compression imposes a high throughput requirement on the latest video encoders (HEVC and VVC), and
Entropy Coding CABAC is primarily used entropy method in such applications. Binarization is first and vital sub-block of CABAC
Context-Adaptive Binary Arithmetic Coding that requires high throughput and better performance. A high-throughput and area-efficient Binarization has
(CABAC)
been proposed in this work. A parallel processing hardware architecture is used, which processes five input
Binarization
High-Efficiency Video Coding (HEVC)
symbols per clock cycle and probability estimation to obtain high throughput. The resource sharing technique is
Versatile Video Coding (VVC) used to optimize the utilization of hardware resources. Further, a pairing-SEs scheme and storage buffers have
Application Specific Integrated Circuit (ASIC) been incorporated in the data path for better performance. The Binarization architecture is designed in Verilog
Field Programmable Gate Array (FPGA) HDL and verified on Artix-7 FPGA. It is also implemented on ASIC using 90 nm technology. The proposed design
Ultra-High-Definition (UHD) achieved a maximum throughput of 3.14 Gbin/s at 282 MHz and consumed considerably low hardware area as
compared to other architectures. The proposed Binarization is adaptive, scalable, and versatile in functionality.
1. Introduction various consumer video applications. However, the basic structure of all
these standards (H.264/H.265/H.266) are the same. The general block
In recent years, video processing has become more significant due to diagram of the HEVC encoder has been shown in Fig. 1.
the vast volume of video data being transmitted or stored. According to In general, video encoding is done on the basis of basic blocks known
Cisco’s Visual Networking Index (VNI), 75% of all internet traffic had as coding tree units (CTUs), which are made up of different sizes (4 × 4
video data in 2017, and it is predicted to increase up to 82% by 2022 [1]. to 256 × 256) and obtained by dividing each frame. These CTUs have
So, advanced video compression techniques are being used to reduce the been processed through several internal coding tools, viz. Transform and
vast amount of this data. However, next-generation video compression Quantization, Motion estimation, Intra-frame prediction, Deblocking,
technology is expected to support the UHD and beyond formats Sample Adaptive Offset (SAO) filters, and Context-based Adaptive Bi
(4K–16K, 360◦ videos) at high frames per second (fps). To store and nary Arithmetic Coding (CABAC). The final stage of the encoder is
transmit video data, high-throughput and area-efficient compression CABAC [5], which performs entropy coding. It’s a lossless compression
algorithms are required. technique that eliminates statistical redundancy and helps to increase
In this effort, International Telecommunication Union compression efficiency. CABAC is used to compress symbols known as
–Telecommunication (ITU-T)/Video Coding Experts Group (VCEG), In syntax elements (SEs) to encoded bits. It is one of the entropy methods
ternational Standardization Organization/International Electro that resemble Context-based Adaptive Variable Length Coding (CAVLC)
technical Commission (ISO/IEC), and Moving Picture Experts Group [6]. But it provides about 9–14% higher coding efficiency over CAVLC
(MPEG) are the international organizations involved in the development [7]. The development of video standards, entropy coding, and major
of the video coding standards. The H.264 or AVC (Advanced Video support is shown in Table 1.
Coding) [2], HEVC/H.265 [3], and VVC/H.266 [4] video standards CABAC has emerged as the most efficient entropy coding approach
were jointly developed by ITU-T and ISO/IEC. HEVC delivers 50% for next-generation video standards. Due to restricted parallelization
higher coding efficiency than its predecessor AVC/H.264. The VVC is the opportunities, the CABAC is regarded as one of the key throughput
successor of HEVC and has expected to achieve 30% more coding effi bottlenecks in the video encoder, especially for high-resolution and
ciency than HEVC. Presently, all these standards are being used in high-frame rate videos [7]. It’s because of the CABAC Algorithm’s
* Corresponding author.
E-mail addresses: rel1651@mnnit.ac.in (M. Nagaraju), skg@mnnit.ac.in (S.K. Gupta), vijaya@mnnit.ac.in (V. Bhadauria).
https://doi.org/10.1016/j.mejo.2022.105425
Received 15 October 2021; Received in revised form 27 January 2022; Accepted 13 March 2022
Available online 16 March 2022
0026-2692/© 2022 Elsevier Ltd. All rights reserved.
M. Nagaraju et al. Microelectronics Journal 123 (2022) 105425
extremely serial data dependencies, which are produced by multiple The CABAC is typically made up of three primary processing blocks:
feedback loops. The schematic diagram of CABAC in HEVC is shown in (i) Binarization (BN), (ii) Context Modeling (CM), and (iii) Binary
Fig. 2, with the essential functions indicated inboxes and their data Arithmetic Encoding (BAE). The CABAC input data symbols such as
dependencies (red pointer). The serial processing of binary symbols Encoder control, Quantized transform coefficients, Intra-frame predic
(bins), context adaptive, and bin-by-bin dependence make it difficult to tion, filter control data, and Motion data, originate from the previous
process multiple bits concurrently. encoding processes as SEs (Fig. 1)
The BN block is used to convert the data of non-binary SEs into bins.
These bins are divided into two primary categories, viz. (i) regular and
Table 1
(ii) bypass bins. The regular bins with higher data dependency, i.e.,
Video standards and their entropy coding.
higher probability, undergo the CM. Whereas bins having lower prob
Standards H.264/AVC (2003) H.265/HEVC (2013) H.266/VVC (2020) ability are categorized as the bypass bins that skip the CM. The bins are
Entropy type CAVLC & CABAC CABAC CABAC processed using two data paths: either regular bin path or bypass bin
Main support 2K/4K@30fps 4K/8K@60fps beyond 8K@120fps path. Finally, the BAE is used to compress the bins based on the esti
mated probabilities using a regular arithmetic encoder or bypass
2
M. Nagaraju et al. Microelectronics Journal 123 (2022) 105425
3
M. Nagaraju et al. Microelectronics Journal 123 (2022) 105425
Table 3
List of SES as input to the CABAC block [7,8].
Coding Units Type of data SE_type
4
M. Nagaraju et al. Microelectronics Journal 123 (2022) 105425
Fig. 5. Distribution of SEs and average bins per pixel (a–d) [18].
hardware architecture of CABAC - Binarization has been proposed in this technique helps in saving of hardware area.
work, which offers better performance and meets UHD & beyond re
quirements. This paper presents an architecture for CABAC-BN to ach 3. Probability analysis of SEs/Bins
ieve the highest possible throughput using massive parallelism with
balanced storage buffers. At the same time, using the resource sharing The workload of the BN in terms of SE types is a concerning aspect
5
M. Nagaraju et al. Microelectronics Journal 123 (2022) 105425
Table 4 configurations considered viz., Low Delay (LD) and Random Access (RA)
Significant of bins contribution in HEVC encoder [19]. at two quantization parameters (22 and 37). Table 4 and Fig. 5 show that
Bin types Common Test Condition the most occurrence and significant SEs have related to the transform
coefficients.
Low Delay (LD) Random Access (RA) Worst Case
Furthermore, the Transform Unit (TrU) related bins occupy a notable
Coding tree unit (CTU) 14.2% 10.7% 0.6% amount in the CABAC data, viz., approximately 67% on average and
Prediction unit (PU) 20.3% 18.7% 5.0%
Transform Unit (TrU) 64.5% 69.9% 94.0%
94% in the worst-case, as shown in Table 4. Hence, it is clear that the
Loop Filter (LF) 1.1% 0.7% 0.8% transform coefficients related to SEs significantly contribute to the final
generated bitstream. Therefore, the throughput of such SEs dramatically
influences the whole CABAC throughput.
that needs to be investigated to design efficient hardware. Table 3 shows The standard specifies that the transform coefficients have assembled
the different types of SEs at the BN stage. All these SEs specified in from the 4 × 4 block size [7]. For instance, a typical 4 × 4 Transform
standard [8] and the same used in this paper. The readable name of SEs Block (TrB) consisting of an array of signed 16-bit integer decimal co
with underscore reflects respective functionalities. efficients can be seen in Fig. 6 (a to e).
As per the work reported in Ref. [18], an experimental test and The SEs extracted from the TrB are as follows:
analysis have made for the number of bins per pixel using the HEVC
Model (HM) reference software. • last_sig_coeff_x_prefix (LASTx) and last_sig_coeff_y_prefix (LASTy):
This data also gave information about the distribution of SEs and bins These SEs represent the position of the last significant coefficient
per pixel at the BN block for the test sequence. The experimental data is within the TrB from x and y-axis, respectively.
referred from Ref. [18] and reproduced here in Fig. 5. The two different
Fig. 7. Test sequences used for the probability occurrence estimation of SEs.
6
M. Nagaraju et al. Microelectronics Journal 123 (2022) 105425
Fig. 8. The consecutive occurrence of (a) regular and (b) bypass bins.
7
M. Nagaraju et al. Microelectronics Journal 123 (2022) 105425
Hence, the throughput of CABAC is limited for regular bins due to the pairing-SEs scheme, the storage buffers, and a data router (Multiplexer).
data dependencies in context modeling. In contrast, it is easier to process The following sub-section has described the proposed design in detail.
bypass bins in parallel since they do not contain data dependencies [21]. The main controller takes the SE_type and generates parameters such
Thus, there is a potential to increase throughput if regular/bypass bin as binarization type, bin index (bin_idx), cRiceParam, and cMax. Then
related multiple SEs are combined and processed concurrently. This one of the processing cores for the EGk, TR, FL, TU, and custom formats
analysis helps to find the probability of SEs consecutive occurrence and is activated to convert the SE value to a sequence of bins. The five-core
guides the hardware design. BN architecture consists of an independent five single-core BN con
The minimum bitrate for 8K UHD@60fps, according to the HEVC nected in parallel by the main controller. The introduction of five-core
standard, is 800 Mbits/s at level 6.2 high tier, which equals to an BN parallel architecture can process up to five SEs in a single clock
average bin rate (symbol rate) of about 1 Gbin/s [9,19]. In reality, the cycle. A pairing SEs scheme has been introduced before the five-core BN
number of bins fluctuates heavily between frames. As a result, a to process multiple SEs in parallel. These SEs from the other processes in
real-time CABAC encoder may be required to deliver bin rates exceeding HEVC architecture (transform coefficients, filter parameters, and Pre
1 Gbin/s. All of these things alleviate the workload at the BN block, diction data, etc.) have to be buffered at the input and output of the BN.
which enhances the CABAC throughput. Hence, two storage buffers have been used before and after the five-core
This section work may be summarized as follows: BN to balance data flow between processes. The Serial In Parallel Out
(SIPO) buffer has been used before the five-core BN, which provides five
(i) The SEs, which occur consecutively as regular and bypass bins at buffered SEs by converting serial into parallel, while the Parallel In
the BN stage, have been grouped and processed concurrently. Parallel Out (PIPO) buffer has been used to fetch the binarized SEs.
(ii) The probability of an average number of regular/bypass bins has Once binarization is finished, the bins are split into two streams
been analyzed. composed of regular and bypass bins, as shown in Fig. 9. The output bin
(iii) The minimum workload at the BN block has been realized. string and their bin lengths are temporarily stored at the PIPO. Then,
depending on bin types, regular bin or bypass bin, the multiplexer
4. Proposed architecture for Binarization (BN) separates and routes them to the regular bin path or bypass bin path,
respectively. Encoding of bypass bin is simple, in which estimation of
The processing of more SEs becomes the bottleneck due to the their probability is not necessary; however, in regular bins, appropriate
inherent serial processing nature at BN, thus it limits the CABAC probability models for encoding are required. These output bins are
throughput. Other hands, a one-by-one SE binarization could not sup passed into the Bit generator to form the final output bitstream of the
port the throughput of subsequent blocks (CM & BAE), which have the complete encoder.
encoding capability of 4–4.95 BPCC [17]. Therefore, acceleration at the
binarization block for multi-SEs is required to enhance the throughput.
The proposed design features massively parallelized architecture to 4.1. Proposed single-core BN
achieve the highest possible throughput, and it has embedded with the
resource-sharing technique for area-efficient resources. The proposed The proposed single-core BN architecture, has been shown in Fig. 10.
BN architecture (dotted box) has considered a hardware acceleration for The BN module inputs are SE_value of 16 bits, SE_type of 9 bits, and mode
the CABAC, as shown in Fig. 9. It consists of the five-core BN blocks, the of 2 bits. It consists of different BN methods such as U, TU, TR, TB, EGk,
and FL. All the BN methods map the SEs into bins. The single format four
8
M. Nagaraju et al. Microelectronics Journal 123 (2022) 105425
BN methods (TB, U, FL, and EGk) are connected in parallel and work
independently, while two interdependent methods (TU and TR) are
derived from U and (TU + FL), respectively.
The main controller module activates BN methods based on the given
input SE_type. This SE_type determines which BN method has to convert
the input SE_value into a bin string (bin_string). For instance, refer to
Table 127 of [8], SE_type = 0 denotes the SE_value of “end_of_slice_one_
bit”, which utilizes the FL method to convert into bins, additionally the
value of cMax is set as 1. Likewise, the cMax and cRiceParam are derived Fig. 13. Adaptable BN process for CALR [10].
from SE_type, and these parameters are required for most of all the SEs.
The BN formats are activated depending on the format signal from the
for all cases except for the case SE_value = cMax (bin string length =
main controller. All formats such as single, combined, and custom are
SE_value).
utilized based on format signal. The output of this BN process is in the
Another resource-sharing technique employed at the BN block for TR
form of a bin_string and bin_length.
is using the combination of TU and FL, as shown in Fig. 12.
The TR consists of two parts, viz., prefix and suffix in the bin string.
4.2. Resource-sharing technique for area-efficient in single-core BN The prefix part invokes the TU method, whereas the suffix part invokes
the FL method. The operators “>> and “<<” are bit-wise right and left
The resource-sharing technique is utilized in a single-core BN ar shift. For, cRiceParam = 0, TR remains the same as TU. This BN process is
chitecture. It allows for sharing the same hardware resource for also selected based on the SE_type. In some cases, BN is adaptable,
executing two or more BN methods/formats. The following resource depending on previously processed SEs.
sharing technique is proposed at the single-core BN architecture level to
reduce hardware area. The BN methods such as U and TU are for un
signed SEs which share the same hardware resource, as shown in Fig. 11. 4.3. Adaptive BN
These BN methods bins are represented as a sequence of ‘1’ and termi
nated by ‘0’, which depends on the cMax constraint, as shown in Table 2. The CALR SE_type is binarized using TR coding for the prefix, and EGk
The TU is a reduced form of U which generates a bin string of ‘1’ fol coding for the suffix. However, TR is a combination of TU and FL. The
lowed by a terminating ‘0’ at the end of bin string for the case of SE_value number of FL bins depends on the prefix value [10]. The number of FL
< cMax, and when SE_value reaches the cMax, terminator ‘0’ is removed bins also depends on parameter cRiceParam, which adaptively changes
from the bin string. The bin string length is calculated from SE_value + 1 based on the value of the previous nonzero coefficient level. The bin
9
M. Nagaraju et al. Microelectronics Journal 123 (2022) 105425
length is adaptively updated in accordance with the cRiceParam. It in of the test videos, the average throughput of CABAC is 0.99 bin/cycle,
troduces a dependency between coefficient levels. The main controller which is quite approximate to 1 bin/cycle [22]. However, the
generates parameters including context index (bin_idx), cRiceParam, and well-designed BAE block is capable of the maximum throughput of 4.94
cMax. The flowchart for the CALR process, has been shown in Fig. 13. bin/cycle [17] and is mainly influenced by the performance of the pre
In this process, the total number of bins has been reduced by modi vious BN stage. Hence, the BN block is BPCC is estimated two times (more
fying the choice of the binarization algorithm for CALR [10]. The than 10 bins/cycle) to keep its processing capability higher than BAE at all
H.264/AVC utilizes TU and EGk methods, while HEVC uses TR (TU + the time. Since BN is a serial process (SE/bin) dependent operation, fixing
FL) and EGk, generating fewer bins [17]. It accounts for a significant its throughput in BPCC is difficult [9].
portion, on average 15–25% of the total bins. A straightforward approach is to binarize multiple SEs in parallel.
Hence, to achieve the maximum throughput, at least on average, 4-5 SEs
should be processed at a time. Here, the BN block is designed using five
4.4. Multi-SEs parallel processing for high-throughput
single-core BN blocks in parallel to process five SEs simultaneously, as
shown in Fig. 14. The main controller controls the core stages. The
CABAC has a required throughput of higher than 1 Gbin/s for 8K
resource-sharing technique has also been maintained within each core
UHD and beyond applications. This throughput depends on operating
stage.
frequency, and the number of processed BPCC, is given by:
( )
bins
Max. Throughput = Max. Operating Frequency (MHz) × Avg. BPCC
s 4.5. Pairing-SEs and SIPO/PIPO buffers for high performance
(3)
It is noticed that SEs are next to each other and advent order as
Where average BPCC is estimated on the basis of the functional simulation regular/bypass bin type (Section 3). The bin distribution analysis reveals
10
M. Nagaraju et al. Microelectronics Journal 123 (2022) 105425
that the frequently occurred SEs may be paired before BN [13]. Hence,
Table 6
some SEs present together are paired and binarize unitedly [7,21]. This
The different standards support.
scheme is named “pairing-SEs,” in which two SEs are paired together in
a single cycle (Fig. 15 (a & b)). In Fig. 15(a) shows the SE order of de Mode Standards BN methods Combination Custom
livery to the CABAC encoding process, and Fig. 15(b) shows the 00 Default TU, TR, FL, TR + EGk for CALR Table mapping
pairing-SEs (P1 & P2) and corresponding processing BN core. (HEVC) and EGk and QP_Delta
01 H.264/AVC U, TU, FL, TU + EGk for CALR Table mapping
Considering data from Table 4, and Figs. 5, 6 and 8; the SE types such as
and EGk
last_sig_coeff_x/y_prefix (LAST) and coeff_abs_level_greater1/2_flag (COEFF) 10 H.265/ TU, TR, FL, TR + EGk for CALR Table mapping
appear frequently. The considered test configuration (LD_QP@22, HEVC and EGk and QP_Delta
LD_QP@37, RA_QP@22, and RA_QP@37) also shows a significant frequent 11 H.266/VVC TR, TB, FL, TR + EGk for CALR Table mapping
occurrence of LAST and COEFF SE_types accounting for approximately 30% and EGk and QP_Delta
11
M. Nagaraju et al. Microelectronics Journal 123 (2022) 105425
Table 7
The performance of the proposed architecture in BPCC.
Test video sequences Encoder Configuration Bins per clock cycle (BPCC)
QP@22 QP@37
Basketball LD 9 4 13 7 3 10
RA 5 5 10 5 4 9
PeopleOnStreet LD 8 4 12 7 3 10
RA 7 4 11 5 3 8
Traffic LD 6 4 10 5 4 9
RA 6 5 11 4 3 7
Average BPCC 6.83 4.33 11.16 5.50 3.33 8.83
standalone single-core BN block. However, in the present design, only a Each BN method has been tested using a test bench of 4 × 4 TrB (Fig. 6)
single main controller has been designed to control the entire five-core in Xilinx ISE 14.7 Simulator. The functional validity of the design has
BN. The main controller block instantiates an encoding request of one been verified and compared with the reference software HM-16.2 [20].
or more BN methods based on SE_type signals for binarization, as shown
in Fig. 10. The controller logic in the form of a finite state machine (FSM)
that controls the five-core BN is shown in Fig. 17. The main controller 5.1. Experimental test setup and RTL simulation with test sequences
also generates a selector (sel) signal that chooses the BN block’s output
using a 5:1 multiplexer after the PIPO buffer stage to give the final An illustration of the experimental test setup and verification pro
output bin string. cess, including reference software and proposed hardware, is shown in
Fig. 18. The system-level structure of interfacing between hardware and
software with tools environment has also been presented.
4.7. Multi-standard support Further, to analyze the correctness of the proposed design, the RTL
has been simulated with the same test video sequences used in Section 3
Finally, the HEVC, VVC, and H.264/AVC formats are supported by (Fig. 7) for the probability analysis of SEs, and obtained results have
using a two-bit mode signal at the top level of five-core BN. Comparing been presented in Table 7. Here, the regular and bypass bins have been
HEVC with H.264/AVC, the basic BN methods are the same except that recorded separately for the three UHD video sequences, with the
used for combination and custom methods, while VVC requires an extra encoder configurations of LD and RA, at the QPs of 22 and 37.
TB BN method. In addition, the proposed five-core BN architecture The average BPCC has been achieved, 11.16 and 8.83 for QP@22 and
supports the multi-standards, as shown in Table 6. The BN processes SEs QP@37, respectively, as the proposed architecture has the ability to
order defined in the standards. process five SEs in a clock cycle (SECC). The timing analysis of the
parallel process has also been provided to demonstrate its impact on the
5. Experimental test setup and hardware implementation throughput. Fig. 19 shows the timing diagram of the proposed five-core
BN for a 4 × 4 TrB (Fig. 15) in terms of input SECC and output BPCC. It
The proposed BN architecture has been described in Verilog RTL. explains the process of converting SEs into bin strings by the proposed
12
M. Nagaraju et al. Microelectronics Journal 123 (2022) 105425
Fig. 19. The timing diagram of the proposed five-core BN design. A striped blue block denotes an idle state, whereas a white block represents the working state. The
output symbols in white blocks specify the status of the bin.
single-core (Fig. 10) and five-core (Fig. 14). The FL method has been
Table 8 used by the majority of the SEs having equal probability. The synthesis
FPGA post-implementation results. netlist shows that the FL block is a combinational circuit, and it has also
U/ TR TB FL EGk Single- Five- been noticed that the five-core BN consumes approximately five times
TU core core the area compared to single-core BN.
No. of Slices 72 56 48 32 119 327 1694
Max. Operating 466 644 852 920 343 384 124
5.3. ASIC implementation
Frequency (MHz)
The ASIC synthesis has been done with the Cadence Genus tool using
BN architecture. 90 nm CMOS technology. Here, the four BN architecture versions have
been considered for synthesis viz., (I) A baseline single-SE BN (Single-
5.2. FPGA prototype core) without resource sharing, named as ‘SBN’, (II) A single-SE BN with
the resource-sharing approach, named as ‘SBN_RS’, (III) A baseline
The proposed design has been synthesized, implemented, and pro multi-SE BN (Five-core) without resource sharing, called as ‘MBN’, and
totyped on the 28 nm Artix7 FPGA (Nexys4 DDR board), and on-chip (IV) A multi-SE BN with the resource sharing, called as ‘MBN_RS’.
hardware debugging has been performed using the logic analyzer tool Table 9 presents the synthesis results of these architectures in terms
(Xilinx Chip-Scope pro) [23]. The FPGA post Place and Route (P&R) of maximum SECC, maximum operating frequency, maximum
results of each BN method in terms of the occupied area (No. of Slices) throughput, logic gate count, and obtained average BPCC. The proposed
and Maximum operating frequency have been shown in Table 8. Two MBN_RS architecture has been synthesized using five SBN_RS while
versions of the proposed BN architectures have been implemented as a other core stages have also been considered. The design has achieved an
Table 9
ASIC Synthesis of the four BN architectures.
Versions BN Architectural Max. SEs/cycle Max. Operating Max. throughput Gate count Avg. Bins/cycle (BPCC)
Features (SECC) frequency (MHz) (Gbin/s) (Kgates) @QP = 22
13
M. Nagaraju et al. Microelectronics Journal 123 (2022) 105425
Table 10
ASIC P&R implementation of the five-core BN architecture.
Technology (nm) 90
average of 1.40 BPCC for single-core and 11.16 BPCC for five-core BN
effective way to increase the throughput, but the hardware area rises
with resource sharing. Here, the five-core architecture with resource
accordingly. In CABAC, it is essential to consider the possibility of
sharing has been implemented in the proposed design for ASIC layout
avoiding the main potential bottleneck at the CM/BAE block. Hence, the
using the Cadence Innovus 16.8 (RTL-GDSII flow).
primary BN block should cater to the high throughput requirements by
The post P&R implementation results have been presented in
processing multiple SECC.
Table 10. The total gate count is 7090 gates, including the storage
The proposed work mainly focuses on improving the throughput by
buffers. The maximum throughput of 3.14 Gbin/s has been obtained at
processing multiple SECC at the BN stage. The proposed five-core BN
an operating frequency of 282 MHz.
architecture processes multiple SEs in parallel and generates multiple
Furthermore, several in-built optimization techniques provided by
bins to enhance throughput. The resource-sharing technique in the ar
EDA tools were also used to achieve better performance and low hard
chitecture has been used to reduce hardware area. In addition, the
ware area, such as logic balancing, clock scheduling, and retiming
pairing of the SEs scheme has been used based on the probability of SE’s
techniques. The RTL-GDSII has no timing errors and passed the DRC/
consecutive occurrences (Section 3), which can process two or more SEs
LVS checks. Fig. 20 shows the layout view of the proposed design and
per clock cycle. The storage buffers used for the SEs distribute workload
has a positive slack at the GDSII Stage. The Cadence Virtuoso tool has
in parallel processing.
been used for the chip-level layout implementation by creating the
input-output (IO) pad rings. The post layout checks such as LVS (Layout
vs. Schematic) have been passed. The total chip area is 2.01 mm2 with
extra space to integrate other CABAC sub-blocks in the future.
14
M. Nagaraju et al. Microelectronics Journal 123 (2022) 105425
Table 12
Impact of operating frequency on throughput.
BN core stages Operating frequency Overall throughput
Decrement (%) Increment (%)
The throughput exploration has been done for different core stage
architectures. The synthesis results (Table 9) help understand the core
stages’ efficiency required to process UHD and beyond. The relationship
between SECC, maximum throughput, and Operating frequency, is
shown in Fig. 21. It has been observed that any increment in the SECC
allows high throughput.
Fig. 24. The hierarchy of the resources sharing in the single-core BN.
Furthermore, high throughput is not always correlated with the
levels of parallelism. It also depends strongly on the SEs workload bal
ance between parallel cores. If the workload is not equally distributed,
some cores will be idle, and the throughput is reduced (i.e., N parallel
Table 13
hardware blocks may not increase N times throughput). Performance parameters of the proposed architecture.
Fig. 22 illustrates the levels of parallelism in the proposed architec
Proposed four BN architectures
ture at the system level. Moreover, the proposed architecture also pro
vides a flexible choice of selecting the appropriate number of core stages SBN (Single- SBN_RS MBN (Five- MBN_RS (Five-
core) (Single- core) core) core)
for required resolution, as shown in Table 11.
The maximum throughput of all the proposed design cores has been Area Saving 34% 25%
achieved between 1 and 3.14 Gbin/s. Hence, this architecture is able to (%)
FoM 72.9 112 117 157
reach the requirements for UHD videos (more than 1 Gbin/s). In prac
Design 0.59 0.80 0.43 0.46
tical, the BPCC fluctuates heavily between CTUs and frames. Therefore, efficiency
a real-time CABAC encoder may have to deliver even higher throughput
than the actual requirement [9].
Generally, the throughput is proportional to operating frequency parallel path receives data separately (Fig. 19).
(Eq. (3)). In the proposed architecture, the throughput decreases as the Consequently, the proposed storage buffers limit the performance of
operating frequency increases (Fig. 21); however, the system’s overall the proposed system to a certain extent. Therefore, there is a trade-off
throughput increases due to processing multiple bins per clock cycle between high throughput and high performance over increased SECC/
(Table 9). The throughput impact on operating frequency has been BPCC. However, the requirements may vary with application. For
shown in Table 12. instance, high throughput is a critical factor in UHD applications and
It happens due to the delay caused by storage buffers used in the requires more SECC/BPCC than the operating frequency [17].
processing path of the BN process. The critical path delay was optimized
by using the multi-cycle technique. In this technique, the simple BN 6.2. Hardware area and performance analysis
methods such as FL and TU use single-cycle while other critical methods
such as TR, TB, and EGk have the multi-cycle constraints [12]. The A 4-level hierarchy of the resource-sharing technique has been
processing delay of each SE (16-bits) at the SIPO buffer is shown in employed in the single-core architecture, as shown in Fig. 24. Hence, a
Fig. 23. The SIPO buffer stage requires 16 clock cycles to process a single significant impact on the area has been noticed between SBN and
SE in a single-core BN. This delay would keep increasing along with the SBN_RS (Table 9). It is also learned that the single-SE versions (SBN and
degree of parallel processing levels (Five cores). The PIPO stage also SBN_RS) have almost five times less gates than the multi-SE versions
causes a delay, used to separate regular and bypass bins. Thus, each (MBN and MBN_RS). Nevertheless, the multi-SE version architectures
can process up to five SEs/cycle.
15
M. Nagaraju et al. Microelectronics Journal 123 (2022) 105425
Fig. 25. The timing diagram of the proposed BN design. A striped blue block denotes an idle state, whereas a white block represents the working state. The output
symbols in white blocks specify the status of the output.
Data from Table 9, the proposed versions could help understand the 6.3. Overall design performance evaluation
proposed resources sharing (RS) technique impact on the area/
throughput and their trade-off. The percentage area saving, Figure of Furthermore, in this work, the proposed architecture and techniques
Merit (FoM), and Design efficiency (DE) of the proposed system are have been analyzed on the statistics and characteristics of TrU SEs.
computed below, and Table 13 shows the performance parameters of the These data occupy (Fig. 1) a notable amount of SEs, 75% on the average
proposed architecture. and 94% in the worst-case of the total bins of CABAC [7,16]. Thus,
efficient coding of this type of SEs would contribute to the whole CABAC
Without RS − With RS
Area Saving (%) = × 100 (4) throughput. The proposed scheme of pairing-SEs is re-arranging to
Without RS
support parallel architecture at the BN stage. In the present imple
No. of bins process per clock cycle (BPCC) mentation, the 4 × 4 TrB contains seven types of SEs [16], and these
FoM = × 100 (5) have been processed in two cycles rather than the original seven cycles,
Total gate count (Kgates)
as illustrated in Fig. 25(a–e). The SIPO provides buffered SEs to the BN in
Max. Throughput (Gbin/s) a clock cycle. In the proposed architectures, the baseline single-core BN
Design Efficiency = (6) would process 1 SE/cycle, whereas five-core BN would process 5 SEs/
Total Gate count (Kgates)
cycle in parallel. Therefore, the overall throughput of 5 SEs/cycle
At the BN system top level, an average of 25% hardware area is saved instead of 1 SE/cycle when considered five-core BN as the final version.
using the resource sharing technique. The MBN_RS achieves the highest The timing diagram is shown in Fig. 25(a–e). The TrB (4 × 4) block
FoM value (157) among all the above architectures. The design effi has been shown in Fig. 25(a), and the baseline single-core binarization
ciency comes to 0.80 of the SBN_RS version, which is 1.74 times the table is shown in Fig. 25(b). The proposed five-core binarization table is
efficiency of the MBN_RS version (0.46). Therefore, any further incre shown in Fig. 25(c). Fig. 25(d) shows the timing diagram of a single-core
ment in the core stage results in a decrease in the design efficiency. BN with 1 SE/cycle, while Fig. 25(e) shows the timing diagram of a five-
core BN with 5 SEs/cycle. The proposed five-core BN design could
process five SEs in a clock cycle and are fetched at the 2nd clock cycle, as
Table 14
Comparison of proposed work FPGA results with prior works.
[24] [12] [10] [25] [26] [15] [27] This work
2006 2010 2013 2013 2014 2016 2020
Single-SE Multi-SE
FPGA Device Virtex-II Virtex-II Virtex 6 Virtex-II Virtex 7 Spartan 6 Arria II Artix 7
Slices used 403 212 1196a 400 394 376 9833a 327 1694
Max. Frequency(MHz) 185 247.5 – 145.15 267 287 – 384 124
Throughput Max. SECC at the input 1 1 4×4 1 1 1 4×4 1 5
Avg. BPCC at the output 2 0.42 1.18 – – – 13.08 1.40 11.16
CABAC Block BN BN BN + CM BN BN BN BN + CM BN
Standard support H.264 H.264 HEVC H.264 H.264 H.264/HEVC HEVC H.264/HEVC/VVC
a
BN and CM combined.
16
M. Nagaraju et al. Microelectronics Journal 123 (2022) 105425
Table 15
Comparison of proposed work ASIC with literature.
Design [24] [28] [12] [29] [13] [30] [10] [31] [9] [32] [19] [33] [27] This work
Parameters 2006 2007 2010 2010 2011 2011 2013 2014 2015 2016 2019 2020 2020
Single- Multi-SE
SE
17
M. Nagaraju et al. Microelectronics Journal 123 (2022) 105425
CM blocks are combined and processed in parallel in those architectures. [6] M. Orlandić, K. Svarstad, An efficient hardware architecture of CAVLC encoder
based on stream processing, Microelectron. J. 67 (2017) 43–49, https://doi.org/
The result [19] also offers a better BPCC, but its maximum throughput is
10.1016/j.mejo.2017.07.013.
less than the proposed work. In terms of hardware, the proposed work is [7] V. Sze, M. Budagavi (Eds.), High-Efficiency Video Coding (HEVC)-Algorithms and
compared with works [13,30] because these works implemented the BN Architectures, 2014, https://doi.org/10.1007/978-3-319-06895-4. G.J.S. Editors.
architecture at the 90 nm technology node (Fig. 26). [8] VVC/H.266: Versatile video coding standard ISO/IEC 23090-3, ISO/IEC JTC 1,
H.266 (08/2020). https://www.itu.int/rec/T-REC-H.266-202008-I/en.
It is observed that the proposed design occupies less hardware area [9] D. Zhou, J. Zhou, W. Fei, S. Goto, Ultra-high-throughput VLSI architecture of
and achieved better operating frequency due to the effective usage of H.265/HEVC CABAC encoder for UHDTV applications, IEEE Trans. Circ. Syst.
resource sharing technique and storage buffers. Sometimes, the extra Video Technol. 25 (2015) 497–507, https://doi.org/10.1109/
TCSVT.2014.2337572.
hardware area depends on memories, which is common for all archi [10] B. Peng, D. Ding, X. Zhu, L. Yu, A hardware CABAC encoder for HEVC, in: Proc.
tectures. Using the storage buffers in parallelism becomes more efficient IEEE Int. Symp. Circ. Syst., 2013, pp. 1372–1375, https://doi.org/10.1109/
than the existing architecture. Furthermore, for better comparison, all ISCAS.2013.6572110.
[11] B. Vizzotto, V. Mazui, S. Bampi, Area efficient and high throughput CABAC
the works are scaled down using the design efficiency equation (Eq. (6)). encoder architecture for HEVC, in: Proc. IEEE Int. Conf. Electron. Circuits, Syst.
The proposed design outperforms all the existing architectures. Never 2016-March, 2016, pp. 572–575, https://doi.org/10.1109/ICECS.2015.7440381.
theless, the proposed multi-SEs architecture seamlessly supports the [12] A.L.D.M. Martins, V. Rosa, S. Bampi, A low-cost hardware architecture binarizer
design for the H.264/AVC CABAC entropy coding, in: 2010 IEEE Int. Conf.
multiple standards than other architectures. To summarize, the present Electron. Circuits, Syst. ICECS 2010 - Proc., 2010, pp. 392–395, https://doi.org/
work has achieved a smaller hardware area to meet the throughput 10.1109/ICECS.2010.5724535.
constraints of 8K UHD. [13] Y. Liu, T. Song, T. Shimamoto, High performance binarizer for H.264/AVC CABAC,
in: 2011 Int. Conf. Electr. Inf. Control Eng. ICEICE 2011 - Proc, 2011,
pp. 2237–2240, https://doi.org/10.1109/ICEICE.2011.5777972.
7. Conclusion [14] C. De Matos Alonso, F.L. Livi Ramos, B. Zatt, M. Porto, S. Bampi, Low-power HEVC
Binarizer architecture for the CABAC block targeting UHD video processing, in:
Proc. - 30th Symp. Integr. Circuits Syst. Des. Chip Sands, SBCCI, 2017, pp. 30–35,
A high-throughput and area-efficient hardware architecture is pro https://doi.org/10.1145/3109984.3109988.
posed in this work, to meet UHD and beyond requirements. The [15] N. Neji, M. Jridi, A. Alfalou, N. Masmoudi, FPGA implementation of improved
throughput improvement is achieved at the architectural level with the binarizer design for context-based adaptive binary arithmetic coder, in: IPAS 2016
- 2nd Int. Image Process. Appl. Syst. Conf., 2017, pp. 1–4, https://doi.org/
multiple SEs parallel processing, while area-efficiency is achieved using
10.1109/IPAS.2016.7880123.
the multi-level resource-sharing technique. The storage buffers and [16] V. Sze, M. Budagavi, High throughput CABAC entropy coding in HEVC, IEEE Trans.
pairing SEs are used to support the parallel processing. In addition, the Circ. Syst. Video Technol. 22 (2012) 1778–1791, https://doi.org/10.1109/
TCSVT.2012.2221526.
computational demand of the binarization, probability occurrence of
[17] D.-L. Tran, V.-H. Pham, H.K. Nguyen, X.-T. Tran, A survey of high-efficiency
SEs, and the need for hardware acceleration are also explored. The context-adaptive binary arithmetic coding hardware implementations in high-
design has used an adaptive binarization method and scalability in the efficiency video coding standard, VNU J. Sci. Comput. Sci. Commun. Eng. 35
core stages, which supports the CABAC process used in the popular video (2019) 1–22, https://doi.org/10.25073/2588-1086/vnucsce.233.
[18] J. Lainema, K. Ugur, A. Hallapuro, in: Single Entropy Coder for HEVC with a High
coding standards (H.264/HEVC/VVC). The proposed binarization block Throughput Binarization Mode, JCTVC-G569, 7th Joint Collaborative Team on
is designed in Verilog and implemented on both FPGA and ASIC plat Video Coding (JCT-VC) Meeting, JCT-VC Document Management System, Geneva,
forms, respectively. The proposed architecture offers a throughput of Switzerland, 2011. Nov. http://phenix.it-sudparis.eu/jct/. (Accessed 5 October
2021).
3.14 Gbin/s at the operating frequency of 282 MHz. The hardware area [19] F.L.L. Ramos, A.V.P. Saggiorato, B. Zatt, M. Porto, S. Bampi, Residual syntax
has saved significantly in comparison to state-of-the-art architecture. elements analysis and design targeting high-throughput HEVC CABAC, IEEE Trans.
The proposed architecture of binarization is suitable for UHDTV Circuits Syst. I Regul. Pap. 67 (2020) 475–488, https://doi.org/10.1109/
TCSI.2019.2932891.
application. [20] F. Bossen, Common test conditions and software reference configurations, JCTVC-
L1100 12 (7) (2013). Jan 23, https://hevc.hhi.fraunhofer.de/. (Accessed 5 October
2021).
Declaration of competing interest [21] N. Mamidi, S.K. Gupta, V. Bhadauria, Design and implementation of parallel
bypass bin processing for CABAC encoder, Adv. Electr. Electron. Eng. 19 (2021)
243–257, https://doi.org/10.15598/aeee.v19i3.4010.
The authors declare that they have no known competing financial
[22] X. Tian, T.M. Le, Y. Lian, Entropy Coders of the H. 264/AVC Standard: Algorithms
interests or personal relationships that could have appeared to influence and VLSI Architectures, Springer Science & Business Media, 2010, https://doi.org/
the work reported in this paper. 10.1007/978-3-642-14703-6. Oct 17.
[23] DigilentInc, Nexys4 DDR TM FPGA Board Reference Manual, Nexys 4 DDR Artix-7
FPGA Train. Board Recomm. ECE Curric.- Digilent, 2014, pp. 1–29. https://refer
Acknowledgment ence.digilentinc.com/_media/nexys4-ddr:nexys4ddr_rm.pdf. (Accessed 5 October
2021).
[24] R.R. Osorio, J.D. Bruguera, High-throughput architecture for H.264/AVC CABAC
This work was carried out with the resources of the “Special compression system, IEEE Trans. Circ. Syst. Video Technol. 16 (2006) 1376–1384,
Manpower Development Program for Chip to System Design (SMDP- https://doi.org/10.1109/TCSVT.2006.883508.
C2SD)” project funded by the Ministry of Electronics and Information [25] N. Jarray, S. Dhahri, M. Elhaji, A. Zitouni, A high level hardware architecture
binarizer for H. 264/AVC CABAC encoder, in: Proceedings Engineering &
Technology (MeitY), Government of India.
Technology- vol. 3, 2013, pp. 216–218.
[26] A. Ben Hmida, S. Dhahri, A. Zitouni, A hardware architecture binarizer design for
References the H.264/AVC CABAC entropy coding, 2014, in: 2014 Int. Conf. Electr. Sci.
Technol. Maghreb, Cist., 2014, pp. 1–4, https://doi.org/10.1109/
CISTEM.2014.7076749.
[1] Cisco Visual Networking Index (VNI), Forecast and Trends, 2017–2022, 2019 white
[27] G. Pastuszak, Multisymbol architecture of the entropy coder for H.265/HEVC video
paper Feb;17:13.
encoders, IEEE Trans. Very Large Scale Integr. Syst. 28 (2020) 2573–2583, https://
[2] T. Wiegand, G.J. Sullivan, S. Member, G. Bjøntegaard, A. Luthra, S. Member,
doi.org/10.1109/TVLSI.2020.3016386.
Overview of the H.264/AVC video coding standard, IEEE Trans. Circ. Syst. Video
[28] P.S. Liu, J.W. Chen, Y.L. Lin, A hardwired context-based adaptive binary arithmetic
Technol. 13 (2003) 560–576, https://doi.org/10.1109/TCSVT.2003.815165.
encoder for H.264 advanced video coding, in: 2007 Int. Symp. VLSI Des. Autom.
[3] G.J. Sullivan, J.R. Ohm, W.J. Han, T. Wiegand, Overview of the high efficiency
Test, VLSI-DAT 2007 - Proc. Tech. Pap., 2007, p. 936349, https://doi.org/
video coding (HEVC) standard, IEEE Trans. Circ. Syst. Video Technol. 22 (2012)
10.1109/VDAT.2007.373239.
1649–1668, https://doi.org/10.1109/TCSVT.2012.2221191.
[29] J.W. Chen, L.C. Wu, P.S. Liu, Y.L. Lin, A high-throughput fully hardwired CABAC
[4] B. Bross, Y.K. Wang, Y. Ye, S. Liu, J. Chen, G.J. Sullivan, J.R. Ohm, Overview of the
encoder for QFHD H.264/AVC main profile video, IEEE Trans. Consum. Electron.
versatile video coding (VVC) standard and its applications, IEEE Trans. Circ. Syst.
56 (2010) 2529–2536, https://doi.org/10.1109/TCE.2010.5681137.
Video Technol. 31 (2021) 3736–3764, https://doi.org/10.1109/
[30] W. Fei, D. Zhou, S. Goto, A 1 Gbin/s CABAC encoder for H . 264/AVC Wei Fei,
TCSVT.2021.3101953.
Dajiang Zhou, and Satoshi Goto, 2011 19th, Eur. Signal Process. Conf. (2011)
[5] D. Marpe, H. Schwarz, T. Wiegand, Context-based adaptive binary arithmetic
1524–1528, https://doi.org/10.5281/zenodo.42535.
coding in the H.264/AVC video compression standard, IEEE Trans. Circ. Syst.
[31] D.H. Pham, J. Moon, S. Lee, Hardware implementation of HEVC CABAC binarizer,
Video Technol. 13 (2003) 620–636, https://doi.org/10.1109/
J. IKEEE. 18 (2014) 356–361, https://doi.org/10.7471/ikeee.2014.18.3.356.
TCSVT.2003.815173.
18
M. Nagaraju et al. Microelectronics Journal 123 (2022) 105425
[32] D. Kim, J. Moon, S. Lee, Hardware implementation of HEVC CABAC encoder, [34] Grois D, Nguyen T, Marpe D. Performance comparison of AV1, JEM, VP9, and
ISOCC 2015 - Int. SoC Des. Conf. SoC Internet Everything (2016) 183–184, https:// HEVC encoders. In Applications of Digital Image Processing XL 2018 Feb 8 (Vol.
doi.org/10.1109/ISOCC.2015.7401779. vol. 10396, p. 103960L). International Society for Optics and Photonics.
[33] D.L. Tran, X.T. Tran, D.H. Bui, C.K. Pham, An efficient hardware implementation of [35] F.L.L. Ramos, B. Zatt, M.S. Porto, S. Bampi, Novel multiple bypass bin scheme and
residual data binarization in HEVC CABAC encoder, Electronics 9 (4) (2020) 684, low-power approach for HEVC CABAC binary arithmetic encoder, J. Integr.
https://doi.org/10.3390/electronics9040684. Apr. Circuits Syst. 13 (2018) 1–11, https://doi.org/10.29292/JICS.V13I3.3.
19