0% found this document useful (0 votes)
15 views19 pages

Mamidi Paper 3

The document presents a high-throughput and area-efficient hardware architecture for the Binarization process in Context-Adaptive Binary Arithmetic Coding (CABAC) used in video compression for UHD applications. It utilizes parallel processing to enhance throughput, achieving 3.14 Gbin/s at 282 MHz while minimizing hardware area through resource sharing techniques. The architecture is implemented in Verilog HDL and verified on FPGA and ASIC, demonstrating adaptability and scalability for various video standards.

Uploaded by

Bhavya Gowda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views19 pages

Mamidi Paper 3

The document presents a high-throughput and area-efficient hardware architecture for the Binarization process in Context-Adaptive Binary Arithmetic Coding (CABAC) used in video compression for UHD applications. It utilizes parallel processing to enhance throughput, achieving 3.14 Gbin/s at 282 MHz while minimizing hardware area through resource sharing techniques. The architecture is implemented in Verilog HDL and verified on FPGA and ASIC, demonstrating adaptability and scalability for various video standards.

Uploaded by

Bhavya Gowda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Microelectronics Journal 123 (2022) 105425

Contents lists available at ScienceDirect

Microelectronics Journal
journal homepage: www.elsevier.com/locate/mejo

High-throughput, area-efficient hardware architecture of


CABAC-Binarization for UHD applications
Mamidi Nagaraju *, Santosh Kumar Gupta, Vijaya Bhadauria
Department of Electronics and Communication Engineering, Motilal Nehru National Institute of Technology Allahabad, Prayagraj, 211004, India

A R T I C L E I N F O A B S T R A C T

Keywords: Video compression imposes a high throughput requirement on the latest video encoders (HEVC and VVC), and
Entropy Coding CABAC is primarily used entropy method in such applications. Binarization is first and vital sub-block of CABAC
Context-Adaptive Binary Arithmetic Coding that requires high throughput and better performance. A high-throughput and area-efficient Binarization has
(CABAC)
been proposed in this work. A parallel processing hardware architecture is used, which processes five input
Binarization
High-Efficiency Video Coding (HEVC)
symbols per clock cycle and probability estimation to obtain high throughput. The resource sharing technique is
Versatile Video Coding (VVC) used to optimize the utilization of hardware resources. Further, a pairing-SEs scheme and storage buffers have
Application Specific Integrated Circuit (ASIC) been incorporated in the data path for better performance. The Binarization architecture is designed in Verilog
Field Programmable Gate Array (FPGA) HDL and verified on Artix-7 FPGA. It is also implemented on ASIC using 90 nm technology. The proposed design
Ultra-High-Definition (UHD) achieved a maximum throughput of 3.14 Gbin/s at 282 MHz and consumed considerably low hardware area as
compared to other architectures. The proposed Binarization is adaptive, scalable, and versatile in functionality.

1. Introduction various consumer video applications. However, the basic structure of all
these standards (H.264/H.265/H.266) are the same. The general block
In recent years, video processing has become more significant due to diagram of the HEVC encoder has been shown in Fig. 1.
the vast volume of video data being transmitted or stored. According to In general, video encoding is done on the basis of basic blocks known
Cisco’s Visual Networking Index (VNI), 75% of all internet traffic had as coding tree units (CTUs), which are made up of different sizes (4 × 4
video data in 2017, and it is predicted to increase up to 82% by 2022 [1]. to 256 × 256) and obtained by dividing each frame. These CTUs have
So, advanced video compression techniques are being used to reduce the been processed through several internal coding tools, viz. Transform and
vast amount of this data. However, next-generation video compression Quantization, Motion estimation, Intra-frame prediction, Deblocking,
technology is expected to support the UHD and beyond formats Sample Adaptive Offset (SAO) filters, and Context-based Adaptive Bi­
(4K–16K, 360◦ videos) at high frames per second (fps). To store and nary Arithmetic Coding (CABAC). The final stage of the encoder is
transmit video data, high-throughput and area-efficient compression CABAC [5], which performs entropy coding. It’s a lossless compression
algorithms are required. technique that eliminates statistical redundancy and helps to increase
In this effort, International Telecommunication Union­ compression efficiency. CABAC is used to compress symbols known as
–Telecommunication (ITU-T)/Video Coding Experts Group (VCEG), In­ syntax elements (SEs) to encoded bits. It is one of the entropy methods
ternational Standardization Organization/International Electro that resemble Context-based Adaptive Variable Length Coding (CAVLC)
technical Commission (ISO/IEC), and Moving Picture Experts Group [6]. But it provides about 9–14% higher coding efficiency over CAVLC
(MPEG) are the international organizations involved in the development [7]. The development of video standards, entropy coding, and major
of the video coding standards. The H.264 or AVC (Advanced Video support is shown in Table 1.
Coding) [2], HEVC/H.265 [3], and VVC/H.266 [4] video standards CABAC has emerged as the most efficient entropy coding approach
were jointly developed by ITU-T and ISO/IEC. HEVC delivers 50% for next-generation video standards. Due to restricted parallelization
higher coding efficiency than its predecessor AVC/H.264. The VVC is the opportunities, the CABAC is regarded as one of the key throughput
successor of HEVC and has expected to achieve 30% more coding effi­ bottlenecks in the video encoder, especially for high-resolution and
ciency than HEVC. Presently, all these standards are being used in high-frame rate videos [7]. It’s because of the CABAC Algorithm’s

* Corresponding author.
E-mail addresses: rel1651@mnnit.ac.in (M. Nagaraju), skg@mnnit.ac.in (S.K. Gupta), vijaya@mnnit.ac.in (V. Bhadauria).

https://doi.org/10.1016/j.mejo.2022.105425
Received 15 October 2021; Received in revised form 27 January 2022; Accepted 13 March 2022
Available online 16 March 2022
0026-2692/© 2022 Elsevier Ltd. All rights reserved.
M. Nagaraju et al. Microelectronics Journal 123 (2022) 105425

Fig. 1. The block diagram of the HEVC encoder [3].

extremely serial data dependencies, which are produced by multiple The CABAC is typically made up of three primary processing blocks:
feedback loops. The schematic diagram of CABAC in HEVC is shown in (i) Binarization (BN), (ii) Context Modeling (CM), and (iii) Binary
Fig. 2, with the essential functions indicated inboxes and their data Arithmetic Encoding (BAE). The CABAC input data symbols such as
dependencies (red pointer). The serial processing of binary symbols Encoder control, Quantized transform coefficients, Intra-frame predic­
(bins), context adaptive, and bin-by-bin dependence make it difficult to tion, filter control data, and Motion data, originate from the previous
process multiple bits concurrently. encoding processes as SEs (Fig. 1)
The BN block is used to convert the data of non-binary SEs into bins.
These bins are divided into two primary categories, viz. (i) regular and
Table 1
(ii) bypass bins. The regular bins with higher data dependency, i.e.,
Video standards and their entropy coding.
higher probability, undergo the CM. Whereas bins having lower prob­
Standards H.264/AVC (2003) H.265/HEVC (2013) H.266/VVC (2020) ability are categorized as the bypass bins that skip the CM. The bins are
Entropy type CAVLC & CABAC CABAC CABAC processed using two data paths: either regular bin path or bypass bin
Main support 2K/4K@30fps 4K/8K@60fps beyond 8K@120fps path. Finally, the BAE is used to compress the bins based on the esti­
mated probabilities using a regular arithmetic encoder or bypass

Fig. 2. Essential components of the CABAC process [5,7].

2
M. Nagaraju et al. Microelectronics Journal 123 (2022) 105425

Fig. 3. General block diagram of the BN block.

arithmetic encoder using the recursive interval division principle [7].


The BN is the first block in the CABAC encoding technique. Because Table 2
Basic binarization methods [7,8].
the BAE could only compress binary data, it is responsible for turning
each non-binary SE into a bin format. Therefore, it is required that N U TU FL TR (cMax = 7) TB EGk (k
(cMax = (cMax (cRiceparam (cMax = 0)
incoming SEs are mostly non-binary values. However, it becomes a
7) = 7) = 1) = 7)
critical throughput bottleneck due to its serial processing of the input
SEs (Fig. 2). As a result, the performance of CM and BAE strongly de­ 0 0 0 000 00 000 1
1 10 10 001 01 001 010
pends on the BN being the only input source, and it influences the overall 2 110 110 010 100 010 011
throughput of the CABAC. Hence, the BN is considered as a significant 3 1110 1110 011 101 011 00100
and critical part of the CABAC. The present work focuses on the BN 4 11110 11110 100 1100 100 00101
hardware architecture, design, and implementation. It supports a high- 5 111110 111110 101 1101 101 00110
6 1111110 1111110 110 1110 110 00111
throughput, better performance, and area-efficient subject to specific
7 11111110 1111111 111 1111 111 0001000
UHD applications. The throughput improvement has been achieved at
the architectural level with the multiple SEs (Multi-SEs) parallel pro­
cessing while area-efficiency using the resource sharing (RS) technique. implementation. Section 6 results and discussion, and finally, Section 7
The proposal also analyzed the SEs/bin process to accomplish the brings the paper to a conclusion.
hardware architecture. The main contributions are as follows:
2. Binarization methods and literature survey
(i) Probability analysis of SEs at the BN Block:
BN gets the input as SE_value, and converts them into bins using a
An investigation has been done for consecutive occurrence (proba­ pre-defined standard function [8]. The video standards use several BN
bility) of SEs and bins at the BN stage using UHD videos. methods depending on the SE_type to be binarized. The primary purpose
of the binarization is to generate a smaller bit representation for the SEs
(ii) Development of a High-throughput and Area-efficient that happen at a higher occurrence rate. The total bins generated for
architecture: given SEs are called as “bin string.”

Proposed Multi-SEs processing (to improve the throughput) and


2.1. Binarization methods
resource sharing technique (to reduce the hardware area). The Pairing-
SEs scheme and storage buffers for better performance. BN has been
Fig. 3 shows the general representation of the BN process commonly
designed to support multi-standard, adaptable, and scalable.
used in the CABAC encoder. It consists of a controller, single, combined,
and custom format blocks. The single format block contains Unary (U),
(iii) FPGA Prototype and ASIC Implementation:
TU (Truncated Unary), FL (Fixed-Length), TR (Truncated Rice), TB
(Truncated Binary), and EGk (k-th order Exp-Golomb) coding. The
Modeled in Verilog HDL, simulated with test-bench, and prototyped
combined format uses a combination of one or more BN methods where
on the 28 nm Artix7 FPGA. Using 90 nm technology, the ASIC has been
the prefix and suffix are binarized differently.
implemented.
For instance, ‘coeff_abs_level_remaining’ (CALR) uses (TU + FL) (pre­
The following are the sections of the paper: Section 2 provides a brief
fix) + EG0 (suffix), and ‘cu_qp_delta_abs’ (QP Delta) uses TU (prefix) +
summary of BN methods and a literature review. The probability anal­
EG0 (suffix). The custom format block contains Intra Prediction mode
ysis of SEs and bin distribution using test sequences is presented in
(intra_chrome_pred_mode), Inter Prediction mode (inter_pred_idc), and Part
Section 3. In Section 4, the proposed hardware architecture is described.
Mode (part_mode) SEs use pre-defined look-up tables. The bins are
Section 5 describes the experimental test setup and hardware
combined by controller block to generate the bin string [7].

3
M. Nagaraju et al. Microelectronics Journal 123 (2022) 105425

Table 3
List of SES as input to the CABAC block [7,8].
Coding Units Type of data SE_type

Coding Tree Block structure and part_mode, pcm_flag, skip_flag, split_cu_flag,


Unit (CTU) quantization cu_transquant_bypass_flag, pred_mode_flag,
cu_qp_delta_abs, end_of_slice_flag,
cu_qp_delta_sign.
Prediction Intra mode coding mpm_idx, prev_intra_luma_pred_flag,
Unit (PU) intra_chroma_pred_mode,
rem_intra_luma_pred_mode.
Motion data merge_idx, ref_idx_l0, ref_idx_l1,
inter_pred_idc, abs_mvd_greater0_flag,
abs_mvd_greater1_flag, merge_flag,
abs_mvd_minus2, mvp_l0_flag, mvp_l1_flag,
mvd_sign_flag.
Transform Transform split_transform_flag,
Unit (TrU) Coefficient coding no_residual_syntax_flag, cbf_luma,
transform_skip_flag, cbf_cb, cbf_cr,
last_significant_coeff_y_prefix,
last_significant_coeff_x_prefix,
last_significant_coeff_y_suffix,
last_significant_coeff_x_suffix,
coded_sub_block_flag, significant_coeff_flag,
coeff_abs_level_greater2_flag,
coeff_abs_level_greater1_flag,
Fig. 4. TB Binarization flowchart [8]. coeff_sign_flag, coeff_abs_level_remaining.
Loop Filter Sample adaptive sao_type_idx_luma, sao_merge_left_flag,
(LF) offset (SAO) sao_merge_up_flag, sao_offset_abs,
Specific BN method is used for particular SE, and the results have
parameters sao_type_idx_chroma, sao_offset_sign,
different bins length. The standard defines which type of binarization sao_eo_class_chroma, sao_band_position,
has to use for a particular SE [8]. Table 2 shows the example of primary sao_eo_class_luma.
BN methods used in AVC, HEVC, and VVC. Let N is an unsigned value be
to be encoded as a SE. The U method is relatively map for N bits, “1” as
the first bin and the last bin “0” as the terminator. The TU is similar to U increased hardware costs. The state-of-the-art design criteria are
with the difference that when (N + 1 = cMax), all bins are “1”. Here, throughput, hardware cost, and power consumption. However, they are
“cMax” is a variable and represents the maximum number of bins that contrary and have trade-offs during designing for specific applications
are allowed to be generated. The U and TU offer short initial codes with [7]. The prior works utilized several design strategies such as exploiting
linear growth for larger SEs. The FL uses a fixed length of bins which is to pipeline, parallelism, and multi-core architecture at the CABAC system
be known on both encoder and decoder sides and is given by: level. Some researchers improved and enhanced the CABAC throughput
by optimizing at the sub-blocks level, including the BN, CM, and BAE.
FL = ceil[log2 (cMax + 1)] (1) The BN is the first component in CABAC architecture where SEs from
This method is beneficial when the converted SE is a flag or uni­ previous stages are analyzed and converted into bins that feed the BAE.
formly distributed. The majority of SEs undergoes the FL method [8]. TR As a result, BN’s performance significantly impacts the entire CABAC.
method is used with a parameterized rice code composed of a prefix and In fact, few authors have proposed hardware architectures and
a suffix. The prefix is a TU string of value (N ≫ k), where k indicates the implementation techniques for exclusively the BN sub-block [9–16]. The
rice parameter (cRiceParam). The suffix is an FL binary representation of work in Ref. [9] designed the multi-BN architecture to keep the average
the least significant bits of N. The bin string of the TB process of a SE is processing capability of BN higher than BAE at all times. An improve­
shown in Fig. 4. ment in SE_type of CALR by parameterizing its calculation with cRice­
The TB has been used for only a few SEs, such as “palette_idx_idc”. The Param, which adaptively updates after each value coded, was reported
EGk is a variable-length code (VLC), and the bin length of the codeword [10]. A heterogeneous BN architecture was reported to support UHD
increases exponentially with the ‘k’ value. It consists of prefix and suffix applications and achieve a reduced area by 10 Kgates than other ar­
parts. This codeword is led by “M” bit separators of “1′′ or “0′′ , followed chitectures [11]. A multi-cycle BN architecture was proposed in
by “N” bit suffix code words that contain information. It is given by: Ref. [12]. A grouping SEs at the BN has been reported in Ref. [13], which
can process 2 SEs per clock cycle (SECC). Ref. [14], proposed a parallel
{0, 0, 0, …}
1[Info] (2) architecture to meet the throughput requirements of UHD video se­
M quences, which included the operand isolation low-power technique and
saved roughly 22% of power consumption. A BN architecture is pro­
Where M = floor (log2(N + 1)) and Info = (N + 1 – 2M). The length of
posed by Ref. [15] that supports regularity and modularity features,
each EGk coding is (2 M + 1). The EGk is mainly utilized for the SEs of
which can support multiple standards (HEVC/H.264). Some researchers
motion vector differences (MVDs).
[16] also focused on improving the throughput of Transform unit co­
efficients which accounts for ~ 94% (worst case) in total bins.
2.2. Literature survey However, most of the above works offer high throughput and high
performance at high hardware costs, which is inefficient for UHD ap­
The CABAC algorithm has data dependency due to serial processing, plications. Ideally, the CABAC-BN design should optimize all three
hence, it is difficult to increase more data bins processes to reach real- performance parameters, viz., throughput, performance, and hardware
time UHD constraints. In cases, it achieves performance at the cost of area. In reality, different applications will demand specific requirements
increased computational complexity. Thus, the CABAC throughput for the implementation. Focusing primarily on high-throughput for UHD
enhancement has been the main focus of the researchers for several will result in much hardware area consumption [17]. The correct
years. It has been implemented on different hardware platforms, in­ approach is a joint optimization that balances the tradeoffs subject to
cludes ASICs and FPGAs. The parallel processing approach has been application constraints. Therefore, a high throughput, area-efficient
used to improve performance and enhance throughput at the expense of

4
M. Nagaraju et al. Microelectronics Journal 123 (2022) 105425

Fig. 5. Distribution of SEs and average bins per pixel (a–d) [18].

hardware architecture of CABAC - Binarization has been proposed in this technique helps in saving of hardware area.
work, which offers better performance and meets UHD & beyond re­
quirements. This paper presents an architecture for CABAC-BN to ach­ 3. Probability analysis of SEs/Bins
ieve the highest possible throughput using massive parallelism with
balanced storage buffers. At the same time, using the resource sharing The workload of the BN in terms of SE types is a concerning aspect

5
M. Nagaraju et al. Microelectronics Journal 123 (2022) 105425

Table 4 configurations considered viz., Low Delay (LD) and Random Access (RA)
Significant of bins contribution in HEVC encoder [19]. at two quantization parameters (22 and 37). Table 4 and Fig. 5 show that
Bin types Common Test Condition the most occurrence and significant SEs have related to the transform
coefficients.
Low Delay (LD) Random Access (RA) Worst Case
Furthermore, the Transform Unit (TrU) related bins occupy a notable
Coding tree unit (CTU) 14.2% 10.7% 0.6% amount in the CABAC data, viz., approximately 67% on average and
Prediction unit (PU) 20.3% 18.7% 5.0%
Transform Unit (TrU) 64.5% 69.9% 94.0%
94% in the worst-case, as shown in Table 4. Hence, it is clear that the
Loop Filter (LF) 1.1% 0.7% 0.8% transform coefficients related to SEs significantly contribute to the final
generated bitstream. Therefore, the throughput of such SEs dramatically
influences the whole CABAC throughput.
that needs to be investigated to design efficient hardware. Table 3 shows The standard specifies that the transform coefficients have assembled
the different types of SEs at the BN stage. All these SEs specified in from the 4 × 4 block size [7]. For instance, a typical 4 × 4 Transform
standard [8] and the same used in this paper. The readable name of SEs Block (TrB) consisting of an array of signed 16-bit integer decimal co­
with underscore reflects respective functionalities. efficients can be seen in Fig. 6 (a to e).
As per the work reported in Ref. [18], an experimental test and The SEs extracted from the TrB are as follows:
analysis have made for the number of bins per pixel using the HEVC
Model (HM) reference software. • last_sig_coeff_x_prefix (LASTx) and last_sig_coeff_y_prefix (LASTy):
This data also gave information about the distribution of SEs and bins These SEs represent the position of the last significant coefficient
per pixel at the BN block for the test sequence. The experimental data is within the TrB from x and y-axis, respectively.
referred from Ref. [18] and reproduced here in Fig. 5. The two different

Fig. 6. General processing of transform coefficient and SEs generation (a to e).

Fig. 7. Test sequences used for the probability occurrence estimation of SEs.

6
M. Nagaraju et al. Microelectronics Journal 123 (2022) 105425

Fig. 8. The consecutive occurrence of (a) regular and (b) bypass bins.

probability analysis of SE to bin occurrence in the standard test se­


Table 5
quences is possible [7].
Statistics of bin types in the CABAC process [17].
This section presents the average and worst-case throughput analysis
Standard Common configuration Regular bin (%) Bypass bin (%) for the HEVC transform coefficient coding method. Three standard test
HEVC Low delay (LD) 78.2 20.8 sequences [34] are taken in two different HEVC configurations: (i) Low
Random access (RA) 73.0 26.4 Delay (LD) and (ii) Random Access (RA) [20], at the bottom (22) and top
(37) of Quantization Parameters (QPs) recommended [35]. The reason
for taking two configurations (LD and RA) because these are using in the
• sig_coeff_flag (SIG): This SE indicates whether the coefficient is sig­
UHDTV application for consumer electronics industry [7,9]. The test
nificant, yes ‘1’ or no ‘0’, considering only the coefficients that occur
sequences having different properties in the image content are consid­
before the LAST.
ered, as shown in Fig. 7. The processing is conducted on a high-end
• coeff_abs_level_greater1_flag (COEFF1): For every transformed residue
workstation (Intel Xeon CPU E5–1607 v3 @ 3.10 GHz, six cores, 48
significant before the LAST, if its absolute value is greater than one,
GB RAM). The results presented in Fig. 8(a and b) shows the occurrence
yes ‘1’ or no ‘0’.
of regular and bypass bins.
• coeff_abs_level_greater2_flag (COEFF2): For a single transformed res­
It is observed that most of the SEs have been encoded by the regular
idue before the last has an absolute value greater than 2 or not.
bin path. The total average bins consecutively occurred 12.12 (6.62-
• coeff_sign_flag (SIGN): For every significant coefficient, whether
regular and 5.50-bypass) having a QP of 22, whereas 8.99 (5.16-regular
positive or negative.
and 3.83-bypass) having a QP of 37. Furthermore, due to the standard
• coeff_abs_level_remaining (CALR): All significantly transformed resi­
CABAC data processing (Fig. 6), it has been noticed that the regular bins
dues shall undergo a decrement by a base value, depending on if they
have occurred more often and in groups than bypass bins. The work in
have a valid associated COEFF1 or/and COEFF2 to them.
Ref. [17] reported the total percentage of bin types in the bitstream, as
shown in Table 5.
The occurrence of most bins may vary according to the properties of
Nevertheless, The occurrence of the majority of the bins (regular or
a given video sequence. However, using the HM reference software, the
bypass) may vary according to the properties of a given video sequence.

Fig. 9. CABAC top-level architecture with the proposed BN block.

7
M. Nagaraju et al. Microelectronics Journal 123 (2022) 105425

Fig. 10. Proposed single-core BN architecture.

Hence, the throughput of CABAC is limited for regular bins due to the pairing-SEs scheme, the storage buffers, and a data router (Multiplexer).
data dependencies in context modeling. In contrast, it is easier to process The following sub-section has described the proposed design in detail.
bypass bins in parallel since they do not contain data dependencies [21]. The main controller takes the SE_type and generates parameters such
Thus, there is a potential to increase throughput if regular/bypass bin as binarization type, bin index (bin_idx), cRiceParam, and cMax. Then
related multiple SEs are combined and processed concurrently. This one of the processing cores for the EGk, TR, FL, TU, and custom formats
analysis helps to find the probability of SEs consecutive occurrence and is activated to convert the SE value to a sequence of bins. The five-core
guides the hardware design. BN architecture consists of an independent five single-core BN con­
The minimum bitrate for 8K UHD@60fps, according to the HEVC nected in parallel by the main controller. The introduction of five-core
standard, is 800 Mbits/s at level 6.2 high tier, which equals to an BN parallel architecture can process up to five SEs in a single clock
average bin rate (symbol rate) of about 1 Gbin/s [9,19]. In reality, the cycle. A pairing SEs scheme has been introduced before the five-core BN
number of bins fluctuates heavily between frames. As a result, a to process multiple SEs in parallel. These SEs from the other processes in
real-time CABAC encoder may be required to deliver bin rates exceeding HEVC architecture (transform coefficients, filter parameters, and Pre­
1 Gbin/s. All of these things alleviate the workload at the BN block, diction data, etc.) have to be buffered at the input and output of the BN.
which enhances the CABAC throughput. Hence, two storage buffers have been used before and after the five-core
This section work may be summarized as follows: BN to balance data flow between processes. The Serial In Parallel Out
(SIPO) buffer has been used before the five-core BN, which provides five
(i) The SEs, which occur consecutively as regular and bypass bins at buffered SEs by converting serial into parallel, while the Parallel In
the BN stage, have been grouped and processed concurrently. Parallel Out (PIPO) buffer has been used to fetch the binarized SEs.
(ii) The probability of an average number of regular/bypass bins has Once binarization is finished, the bins are split into two streams
been analyzed. composed of regular and bypass bins, as shown in Fig. 9. The output bin
(iii) The minimum workload at the BN block has been realized. string and their bin lengths are temporarily stored at the PIPO. Then,
depending on bin types, regular bin or bypass bin, the multiplexer
4. Proposed architecture for Binarization (BN) separates and routes them to the regular bin path or bypass bin path,
respectively. Encoding of bypass bin is simple, in which estimation of
The processing of more SEs becomes the bottleneck due to the their probability is not necessary; however, in regular bins, appropriate
inherent serial processing nature at BN, thus it limits the CABAC probability models for encoding are required. These output bins are
throughput. Other hands, a one-by-one SE binarization could not sup­ passed into the Bit generator to form the final output bitstream of the
port the throughput of subsequent blocks (CM & BAE), which have the complete encoder.
encoding capability of 4–4.95 BPCC [17]. Therefore, acceleration at the
binarization block for multi-SEs is required to enhance the throughput.
The proposed design features massively parallelized architecture to 4.1. Proposed single-core BN
achieve the highest possible throughput, and it has embedded with the
resource-sharing technique for area-efficient resources. The proposed The proposed single-core BN architecture, has been shown in Fig. 10.
BN architecture (dotted box) has considered a hardware acceleration for The BN module inputs are SE_value of 16 bits, SE_type of 9 bits, and mode
the CABAC, as shown in Fig. 9. It consists of the five-core BN blocks, the of 2 bits. It consists of different BN methods such as U, TU, TR, TB, EGk,
and FL. All the BN methods map the SEs into bins. The single format four

8
M. Nagaraju et al. Microelectronics Journal 123 (2022) 105425

Fig. 11. Proposed architecture of resource-sharing between U and TU methods.

BN methods (TB, U, FL, and EGk) are connected in parallel and work
independently, while two interdependent methods (TU and TR) are
derived from U and (TU + FL), respectively.
The main controller module activates BN methods based on the given
input SE_type. This SE_type determines which BN method has to convert
the input SE_value into a bin string (bin_string). For instance, refer to
Table 127 of [8], SE_type = 0 denotes the SE_value of “end_of_slice_one_­
bit”, which utilizes the FL method to convert into bins, additionally the
value of cMax is set as 1. Likewise, the cMax and cRiceParam are derived Fig. 13. Adaptable BN process for CALR [10].
from SE_type, and these parameters are required for most of all the SEs.
The BN formats are activated depending on the format signal from the
for all cases except for the case SE_value = cMax (bin string length =
main controller. All formats such as single, combined, and custom are
SE_value).
utilized based on format signal. The output of this BN process is in the
Another resource-sharing technique employed at the BN block for TR
form of a bin_string and bin_length.
is using the combination of TU and FL, as shown in Fig. 12.
The TR consists of two parts, viz., prefix and suffix in the bin string.
4.2. Resource-sharing technique for area-efficient in single-core BN The prefix part invokes the TU method, whereas the suffix part invokes
the FL method. The operators “>> and “<<” are bit-wise right and left
The resource-sharing technique is utilized in a single-core BN ar­ shift. For, cRiceParam = 0, TR remains the same as TU. This BN process is
chitecture. It allows for sharing the same hardware resource for also selected based on the SE_type. In some cases, BN is adaptable,
executing two or more BN methods/formats. The following resource depending on previously processed SEs.
sharing technique is proposed at the single-core BN architecture level to
reduce hardware area. The BN methods such as U and TU are for un­
signed SEs which share the same hardware resource, as shown in Fig. 11. 4.3. Adaptive BN
These BN methods bins are represented as a sequence of ‘1’ and termi­
nated by ‘0’, which depends on the cMax constraint, as shown in Table 2. The CALR SE_type is binarized using TR coding for the prefix, and EGk
The TU is a reduced form of U which generates a bin string of ‘1’ fol­ coding for the suffix. However, TR is a combination of TU and FL. The
lowed by a terminating ‘0’ at the end of bin string for the case of SE_value number of FL bins depends on the prefix value [10]. The number of FL
< cMax, and when SE_value reaches the cMax, terminator ‘0’ is removed bins also depends on parameter cRiceParam, which adaptively changes
from the bin string. The bin string length is calculated from SE_value + 1 based on the value of the previous nonzero coefficient level. The bin

Fig. 12. Proposed TR module with resource sharing technique.

9
M. Nagaraju et al. Microelectronics Journal 123 (2022) 105425

Fig. 14. Proposed parallel architecture.

Fig. 15. Proposed pairing-SEs scheme.

length is adaptively updated in accordance with the cRiceParam. It in­ of the test videos, the average throughput of CABAC is 0.99 bin/cycle,
troduces a dependency between coefficient levels. The main controller which is quite approximate to 1 bin/cycle [22]. However, the
generates parameters including context index (bin_idx), cRiceParam, and well-designed BAE block is capable of the maximum throughput of 4.94
cMax. The flowchart for the CALR process, has been shown in Fig. 13. bin/cycle [17] and is mainly influenced by the performance of the pre­
In this process, the total number of bins has been reduced by modi­ vious BN stage. Hence, the BN block is BPCC is estimated two times (more
fying the choice of the binarization algorithm for CALR [10]. The than 10 bins/cycle) to keep its processing capability higher than BAE at all
H.264/AVC utilizes TU and EGk methods, while HEVC uses TR (TU + the time. Since BN is a serial process (SE/bin) dependent operation, fixing
FL) and EGk, generating fewer bins [17]. It accounts for a significant its throughput in BPCC is difficult [9].
portion, on average 15–25% of the total bins. A straightforward approach is to binarize multiple SEs in parallel.
Hence, to achieve the maximum throughput, at least on average, 4-5 SEs
should be processed at a time. Here, the BN block is designed using five
4.4. Multi-SEs parallel processing for high-throughput
single-core BN blocks in parallel to process five SEs simultaneously, as
shown in Fig. 14. The main controller controls the core stages. The
CABAC has a required throughput of higher than 1 Gbin/s for 8K
resource-sharing technique has also been maintained within each core
UHD and beyond applications. This throughput depends on operating
stage.
frequency, and the number of processed BPCC, is given by:
( )
bins
Max. Throughput = Max. Operating Frequency (MHz) × Avg. BPCC
s 4.5. Pairing-SEs and SIPO/PIPO buffers for high performance
(3)
It is noticed that SEs are next to each other and advent order as
Where average BPCC is estimated on the basis of the functional simulation regular/bypass bin type (Section 3). The bin distribution analysis reveals

10
M. Nagaraju et al. Microelectronics Journal 123 (2022) 105425

Fig. 16. SIPO architecture.

Fig. 17. Proposed FSM-based main controller.

that the frequently occurred SEs may be paired before BN [13]. Hence,
Table 6
some SEs present together are paired and binarize unitedly [7,21]. This
The different standards support.
scheme is named “pairing-SEs,” in which two SEs are paired together in
a single cycle (Fig. 15 (a & b)). In Fig. 15(a) shows the SE order of de­ Mode Standards BN methods Combination Custom

livery to the CABAC encoding process, and Fig. 15(b) shows the 00 Default TU, TR, FL, TR + EGk for CALR Table mapping
pairing-SEs (P1 & P2) and corresponding processing BN core. (HEVC) and EGk and QP_Delta
01 H.264/AVC U, TU, FL, TU + EGk for CALR Table mapping
Considering data from Table 4, and Figs. 5, 6 and 8; the SE types such as
and EGk
last_sig_coeff_x/y_prefix (LAST) and coeff_abs_level_greater1/2_flag (COEFF) 10 H.265/ TU, TR, FL, TR + EGk for CALR Table mapping
appear frequently. The considered test configuration (LD_QP@22, HEVC and EGk and QP_Delta
LD_QP@37, RA_QP@22, and RA_QP@37) also shows a significant frequent 11 H.266/VVC TR, TB, FL, TR + EGk for CALR Table mapping
occurrence of LAST and COEFF SE_types accounting for approximately 30% and EGk and QP_Delta

of total bins (Figs. 5 and 8). Therefore, a specific acceleration process is


needed to speed up the processing of such SE types [17]. For fast encoding, of by the synchronous SIPO buffer, as shown in Fig. 16. A 16-bit shift register
SEs of coeff_abs_level_greater1_flag and coeff_abs_level_greater2_flag are paired is used to form a single SE. After the binarization process, the output is
together (P1) as “COEFF” and pass through dedicated single-core BN. The placed in a PIPO buffer.
SEs for both the horizontal (x) and vertical (y) components of last_sig_­
coeff_x_prefix and last_sig_coeff_y_prefix are grouped (P2) as “LAST” and pass
through another dedicated single-core BN. Thus, the pairing SEs scheme 4.6. Main controller
improves the BN processing and allows parallelism in the SEs parsing order
specified in the HEVC standard. The processing of pairing SEs is taken care Conventionally, the controller block is associated with the

11
M. Nagaraju et al. Microelectronics Journal 123 (2022) 105425

Fig. 18. The depiction of the experimental test setup.

Table 7
The performance of the proposed architecture in BPCC.
Test video sequences Encoder Configuration Bins per clock cycle (BPCC)

QP@22 QP@37

Regular Bypass Total Regular Bypass Total

Basketball LD 9 4 13 7 3 10
RA 5 5 10 5 4 9
PeopleOnStreet LD 8 4 12 7 3 10
RA 7 4 11 5 3 8
Traffic LD 6 4 10 5 4 9
RA 6 5 11 4 3 7
Average BPCC 6.83 4.33 11.16 5.50 3.33 8.83

standalone single-core BN block. However, in the present design, only a Each BN method has been tested using a test bench of 4 × 4 TrB (Fig. 6)
single main controller has been designed to control the entire five-core in Xilinx ISE 14.7 Simulator. The functional validity of the design has
BN. The main controller block instantiates an encoding request of one been verified and compared with the reference software HM-16.2 [20].
or more BN methods based on SE_type signals for binarization, as shown
in Fig. 10. The controller logic in the form of a finite state machine (FSM)
that controls the five-core BN is shown in Fig. 17. The main controller 5.1. Experimental test setup and RTL simulation with test sequences
also generates a selector (sel) signal that chooses the BN block’s output
using a 5:1 multiplexer after the PIPO buffer stage to give the final An illustration of the experimental test setup and verification pro­
output bin string. cess, including reference software and proposed hardware, is shown in
Fig. 18. The system-level structure of interfacing between hardware and
software with tools environment has also been presented.
4.7. Multi-standard support Further, to analyze the correctness of the proposed design, the RTL
has been simulated with the same test video sequences used in Section 3
Finally, the HEVC, VVC, and H.264/AVC formats are supported by (Fig. 7) for the probability analysis of SEs, and obtained results have
using a two-bit mode signal at the top level of five-core BN. Comparing been presented in Table 7. Here, the regular and bypass bins have been
HEVC with H.264/AVC, the basic BN methods are the same except that recorded separately for the three UHD video sequences, with the
used for combination and custom methods, while VVC requires an extra encoder configurations of LD and RA, at the QPs of 22 and 37.
TB BN method. In addition, the proposed five-core BN architecture The average BPCC has been achieved, 11.16 and 8.83 for QP@22 and
supports the multi-standards, as shown in Table 6. The BN processes SEs QP@37, respectively, as the proposed architecture has the ability to
order defined in the standards. process five SEs in a clock cycle (SECC). The timing analysis of the
parallel process has also been provided to demonstrate its impact on the
5. Experimental test setup and hardware implementation throughput. Fig. 19 shows the timing diagram of the proposed five-core
BN for a 4 × 4 TrB (Fig. 15) in terms of input SECC and output BPCC. It
The proposed BN architecture has been described in Verilog RTL. explains the process of converting SEs into bin strings by the proposed

12
M. Nagaraju et al. Microelectronics Journal 123 (2022) 105425

Fig. 19. The timing diagram of the proposed five-core BN design. A striped blue block denotes an idle state, whereas a white block represents the working state. The
output symbols in white blocks specify the status of the bin.

single-core (Fig. 10) and five-core (Fig. 14). The FL method has been
Table 8 used by the majority of the SEs having equal probability. The synthesis
FPGA post-implementation results. netlist shows that the FL block is a combinational circuit, and it has also
U/ TR TB FL EGk Single- Five- been noticed that the five-core BN consumes approximately five times
TU core core the area compared to single-core BN.
No. of Slices 72 56 48 32 119 327 1694
Max. Operating 466 644 852 920 343 384 124
5.3. ASIC implementation
Frequency (MHz)

The ASIC synthesis has been done with the Cadence Genus tool using
BN architecture. 90 nm CMOS technology. Here, the four BN architecture versions have
been considered for synthesis viz., (I) A baseline single-SE BN (Single-
5.2. FPGA prototype core) without resource sharing, named as ‘SBN’, (II) A single-SE BN with
the resource-sharing approach, named as ‘SBN_RS’, (III) A baseline
The proposed design has been synthesized, implemented, and pro­ multi-SE BN (Five-core) without resource sharing, called as ‘MBN’, and
totyped on the 28 nm Artix7 FPGA (Nexys4 DDR board), and on-chip (IV) A multi-SE BN with the resource sharing, called as ‘MBN_RS’.
hardware debugging has been performed using the logic analyzer tool Table 9 presents the synthesis results of these architectures in terms
(Xilinx Chip-Scope pro) [23]. The FPGA post Place and Route (P&R) of maximum SECC, maximum operating frequency, maximum
results of each BN method in terms of the occupied area (No. of Slices) throughput, logic gate count, and obtained average BPCC. The proposed
and Maximum operating frequency have been shown in Table 8. Two MBN_RS architecture has been synthesized using five SBN_RS while
versions of the proposed BN architectures have been implemented as a other core stages have also been considered. The design has achieved an

Table 9
ASIC Synthesis of the four BN architectures.
Versions BN Architectural Max. SEs/cycle Max. Operating Max. throughput Gate count Avg. Bins/cycle (BPCC)
Features (SECC) frequency (MHz) (Gbin/s) (Kgates) @QP = 22

Without Resource Sharing Single-core (SBN) 1 732 1.13 1.92 1.40


technique Five-core (MBN) 5 302 3.37 9.48 11.16
With Resource Sharing Single-core (SBN_RS) 1 720 1.00 1.25 1.40
technique Two-core 2 603 1.54 2.91 2.56
Three-core 3 410 2.32 4.35 5.68
Four-core 4 352 3.01 5.83 8.56
Five-core (MBN_RS) 5 282 3.14 6.61 11.16

13
M. Nagaraju et al. Microelectronics Journal 123 (2022) 105425

Table 10
ASIC P&R implementation of the five-core BN architecture.
Technology (nm) 90

Max. operating frequency (MHz) 282


Min-Max. processing SE per clock cycle (SECC) 1–5
Min-Max. processing bins per clock cycle (BPCC) 1.40–11.16
Max. Throughput (Gbin/s) 3.14
Gate count module-wise (No. of gates) Five-core BN 6615
SIPO 243
PIPO(16b × 80) 232
Total gate count 7090

Fig. 21. The trade-off between SECC, Throughput, and Frequency.

Fig. 20. Chip layout of the proposed five-core BN.


Fig. 22. Parallelism levels in the architecture.

average of 1.40 BPCC for single-core and 11.16 BPCC for five-core BN
effective way to increase the throughput, but the hardware area rises
with resource sharing. Here, the five-core architecture with resource
accordingly. In CABAC, it is essential to consider the possibility of
sharing has been implemented in the proposed design for ASIC layout
avoiding the main potential bottleneck at the CM/BAE block. Hence, the
using the Cadence Innovus 16.8 (RTL-GDSII flow).
primary BN block should cater to the high throughput requirements by
The post P&R implementation results have been presented in
processing multiple SECC.
Table 10. The total gate count is 7090 gates, including the storage
The proposed work mainly focuses on improving the throughput by
buffers. The maximum throughput of 3.14 Gbin/s has been obtained at
processing multiple SECC at the BN stage. The proposed five-core BN
an operating frequency of 282 MHz.
architecture processes multiple SEs in parallel and generates multiple
Furthermore, several in-built optimization techniques provided by
bins to enhance throughput. The resource-sharing technique in the ar­
EDA tools were also used to achieve better performance and low hard­
chitecture has been used to reduce hardware area. In addition, the
ware area, such as logic balancing, clock scheduling, and retiming
pairing of the SEs scheme has been used based on the probability of SE’s
techniques. The RTL-GDSII has no timing errors and passed the DRC/
consecutive occurrences (Section 3), which can process two or more SEs
LVS checks. Fig. 20 shows the layout view of the proposed design and
per clock cycle. The storage buffers used for the SEs distribute workload
has a positive slack at the GDSII Stage. The Cadence Virtuoso tool has
in parallel processing.
been used for the chip-level layout implementation by creating the
input-output (IO) pad rings. The post layout checks such as LVS (Layout
vs. Schematic) have been passed. The total chip area is 2.01 mm2 with
extra space to integrate other CABAC sub-blocks in the future.

6. Results and discussions


Table 11
Proposed architecture scalability with core stages and supported resolution.
The CABAC is a well-known throughput bottleneck and computa­
BN core stages Max. Throughput (Gbin/s) Supported Resolutiona
tional intensive block in the video encoder. It needs several clock cycles
to process binary symbols. Therefore, the throughput of CABAC archi­ Single-core 1.00 2K/4K
tecture is relatively low (Average ~1 BPCC [22]). Therefore, optimiza­ Two-core 1.54 4K/8K
Three-core 2.32 8K
tion is done to get high throughput and area-efficient hardware Four-core 3.01 16K
architectures. The hardware acceleration techniques may be used to Five-core 3.14 16K and Beyond
achieve higher throughput for UHD applications. Also, parallelism is an a
4K and above - UHD.

14
M. Nagaraju et al. Microelectronics Journal 123 (2022) 105425

Table 12
Impact of operating frequency on throughput.
BN core stages Operating frequency Overall throughput
Decrement (%) Increment (%)

(Single-core) Vs 59.16 69.51


(Five-core)

6.1. Throughput and operating frequency analysis

The throughput exploration has been done for different core stage
architectures. The synthesis results (Table 9) help understand the core
stages’ efficiency required to process UHD and beyond. The relationship
between SECC, maximum throughput, and Operating frequency, is
shown in Fig. 21. It has been observed that any increment in the SECC
allows high throughput.
Fig. 24. The hierarchy of the resources sharing in the single-core BN.
Furthermore, high throughput is not always correlated with the
levels of parallelism. It also depends strongly on the SEs workload bal­
ance between parallel cores. If the workload is not equally distributed,
some cores will be idle, and the throughput is reduced (i.e., N parallel
Table 13
hardware blocks may not increase N times throughput). Performance parameters of the proposed architecture.
Fig. 22 illustrates the levels of parallelism in the proposed architec­
Proposed four BN architectures
ture at the system level. Moreover, the proposed architecture also pro­
vides a flexible choice of selecting the appropriate number of core stages SBN (Single- SBN_RS MBN (Five- MBN_RS (Five-
core) (Single- core) core) core)
for required resolution, as shown in Table 11.
The maximum throughput of all the proposed design cores has been Area Saving 34% 25%
achieved between 1 and 3.14 Gbin/s. Hence, this architecture is able to (%)
FoM 72.9 112 117 157
reach the requirements for UHD videos (more than 1 Gbin/s). In prac­
Design 0.59 0.80 0.43 0.46
tical, the BPCC fluctuates heavily between CTUs and frames. Therefore, efficiency
a real-time CABAC encoder may have to deliver even higher throughput
than the actual requirement [9].
Generally, the throughput is proportional to operating frequency parallel path receives data separately (Fig. 19).
(Eq. (3)). In the proposed architecture, the throughput decreases as the Consequently, the proposed storage buffers limit the performance of
operating frequency increases (Fig. 21); however, the system’s overall the proposed system to a certain extent. Therefore, there is a trade-off
throughput increases due to processing multiple bins per clock cycle between high throughput and high performance over increased SECC/
(Table 9). The throughput impact on operating frequency has been BPCC. However, the requirements may vary with application. For
shown in Table 12. instance, high throughput is a critical factor in UHD applications and
It happens due to the delay caused by storage buffers used in the requires more SECC/BPCC than the operating frequency [17].
processing path of the BN process. The critical path delay was optimized
by using the multi-cycle technique. In this technique, the simple BN 6.2. Hardware area and performance analysis
methods such as FL and TU use single-cycle while other critical methods
such as TR, TB, and EGk have the multi-cycle constraints [12]. The A 4-level hierarchy of the resource-sharing technique has been
processing delay of each SE (16-bits) at the SIPO buffer is shown in employed in the single-core architecture, as shown in Fig. 24. Hence, a
Fig. 23. The SIPO buffer stage requires 16 clock cycles to process a single significant impact on the area has been noticed between SBN and
SE in a single-core BN. This delay would keep increasing along with the SBN_RS (Table 9). It is also learned that the single-SE versions (SBN and
degree of parallel processing levels (Five cores). The PIPO stage also SBN_RS) have almost five times less gates than the multi-SE versions
causes a delay, used to separate regular and bypass bins. Thus, each (MBN and MBN_RS). Nevertheless, the multi-SE version architectures
can process up to five SEs/cycle.

Fig. 23. The timing diagram of SEs processing delay at SIPO.

15
M. Nagaraju et al. Microelectronics Journal 123 (2022) 105425

Fig. 25. The timing diagram of the proposed BN design. A striped blue block denotes an idle state, whereas a white block represents the working state. The output
symbols in white blocks specify the status of the output.

Data from Table 9, the proposed versions could help understand the 6.3. Overall design performance evaluation
proposed resources sharing (RS) technique impact on the area/
throughput and their trade-off. The percentage area saving, Figure of Furthermore, in this work, the proposed architecture and techniques
Merit (FoM), and Design efficiency (DE) of the proposed system are have been analyzed on the statistics and characteristics of TrU SEs.
computed below, and Table 13 shows the performance parameters of the These data occupy (Fig. 1) a notable amount of SEs, 75% on the average
proposed architecture. and 94% in the worst-case of the total bins of CABAC [7,16]. Thus,
efficient coding of this type of SEs would contribute to the whole CABAC
Without RS − With RS
Area Saving (%) = × 100 (4) throughput. The proposed scheme of pairing-SEs is re-arranging to
Without RS
support parallel architecture at the BN stage. In the present imple­
No. of bins process per clock cycle (BPCC) mentation, the 4 × 4 TrB contains seven types of SEs [16], and these
FoM = × 100 (5) have been processed in two cycles rather than the original seven cycles,
Total gate count (Kgates)
as illustrated in Fig. 25(a–e). The SIPO provides buffered SEs to the BN in
Max. Throughput (Gbin/s) a clock cycle. In the proposed architectures, the baseline single-core BN
Design Efficiency = (6) would process 1 SE/cycle, whereas five-core BN would process 5 SEs/­
Total Gate count (Kgates)
cycle in parallel. Therefore, the overall throughput of 5 SEs/cycle
At the BN system top level, an average of 25% hardware area is saved instead of 1 SE/cycle when considered five-core BN as the final version.
using the resource sharing technique. The MBN_RS achieves the highest The timing diagram is shown in Fig. 25(a–e). The TrB (4 × 4) block
FoM value (157) among all the above architectures. The design effi­ has been shown in Fig. 25(a), and the baseline single-core binarization
ciency comes to 0.80 of the SBN_RS version, which is 1.74 times the table is shown in Fig. 25(b). The proposed five-core binarization table is
efficiency of the MBN_RS version (0.46). Therefore, any further incre­ shown in Fig. 25(c). Fig. 25(d) shows the timing diagram of a single-core
ment in the core stage results in a decrease in the design efficiency. BN with 1 SE/cycle, while Fig. 25(e) shows the timing diagram of a five-
core BN with 5 SEs/cycle. The proposed five-core BN design could
process five SEs in a clock cycle and are fetched at the 2nd clock cycle, as

Table 14
Comparison of proposed work FPGA results with prior works.
[24] [12] [10] [25] [26] [15] [27] This work
2006 2010 2013 2013 2014 2016 2020
Single-SE Multi-SE

FPGA Device Virtex-II Virtex-II Virtex 6 Virtex-II Virtex 7 Spartan 6 Arria II Artix 7
Slices used 403 212 1196a 400 394 376 9833a 327 1694
Max. Frequency(MHz) 185 247.5 – 145.15 267 287 – 384 124
Throughput Max. SECC at the input 1 1 4×4 1 1 1 4×4 1 5
Avg. BPCC at the output 2 0.42 1.18 – – – 13.08 1.40 11.16
CABAC Block BN BN BN + CM BN BN BN BN + CM BN
Standard support H.264 H.264 HEVC H.264 H.264 H.264/HEVC HEVC H.264/HEVC/VVC
a
BN and CM combined.

16
M. Nagaraju et al. Microelectronics Journal 123 (2022) 105425

Table 15
Comparison of proposed work ASIC with literature.
Design [24] [28] [12] [29] [13] [30] [10] [31] [9] [32] [19] [33] [27] This work
Parameters 2006 2007 2010 2010 2011 2011 2013 2014 2015 2016 2019 2020 2020
Single- Multi-SE
SE

Max. SEs/cycle 2 1 1 1 4 3 1 1 4 1 4×4 3.5 4×4 1 5


(SECC)
Avg. bins/cycle 2.30 0.67 0.42 1.42 4.11 5.60 1.18 1 18.00 1 16 3.05 13.08 1.40 11.16
(BPCC)
Max. Frequency 343 200 370 222 250 279 357 200 420a 158a 191 500 570a 720 282
(MHz)
Throughput 0.78 0.13 0.15 0.31 1.02 1.56 0.42 0.20 – 0.15 3.05 1.52 – 1.00 3.14
(Gbin/s)
Gate Count 6.50 12.01 1.80 26.16 19.30 9.18 23.21 1.67 23.50 3.40 13.82 6.41 56.37 1.25 7.09
(Kgates)
Design 0.12 0.01 0.08 0.01 0.05 0.16 0.01 0.11 – 0.04 0.22 0.23 – 0.80 0.46
Efficiency
(Gbin/Kgate)
Storage type FIFO PISO ROM PIPO RAM FIFO PISO – PIPO FIFO FIFO FIFO FIFO – SIPO &
& PIPO
PISO
Implemented BN BN & BN BN & BN BN BN & BN BN & BN BN BN BN & BN BN
CABAC block CM CM CM CM CM
Support HD HDTV Full QFHD HD HDTV 2K 1920 UHD 1920 2560 UHD 8K 2K Beyond
resolution HD × × × 8K
1080 1080 1600
Support H.264 H.264 H.264 H.264 H.264 H.264 HEVC HEVC H.264/ HEVC HEVC HEVC HEVC H.264/HEVC/VVC
Standard HEVC
Technology 350 130 180 130 90 90 130 45 90 180 65 45 90 90
(nm)
a
CABAC system level.

shown in Fig. 25(e). As a result, multiple SEs are processed in parallel in


order to save clock cycles.

6.4. Comparison with the related works

Many authors have reported the different architectures and imple­


mentation of CABAC on different platforms (FPGA and ASIC) in the past.
However, it is unfair to compare all these architectures because they
have been implemented on different FPGA platforms or ASIC technol­
ogies. Nevertheless, the number of processing input SECC and obtained
BPCC are comparable as these are independent of the implementation
platforms. The use of FPGA for implementing BN methods is the right
option as it well supports parallelism and has the advantage of being
reconfigurable. The proposed work mainly utilizes parallel processing,
and FPGA implementation gives a significant advantage due to its
inherent hardware nature and optimized on-chip registers for SIPO and
PIPO buffers. It helps to assess performance and throughput with very
reasonable hardware resources. Here, the works related to CABAC-BN
for multiple standards have been compared with the proposed archi­
tecture, as shown in Table 14.
The primary comparison parameters considered for comparison are
Fig. 26. Comparison considering area and operating frequency.
hardware area (FPGA Slices), maximum operating frequency (MHz),
and throughput (Average BPCC at output). The works presented in Refs.
[12,15,24–26] are the Single-SE, whereas architectures reported in Refs. architecture FPGA result outperforms all other reported works.
[10,27] are the Multi-SEs implementation. Thus, two versions of the The ASIC implementation of the proposed work is compared with
proposed architectures, viz. single-SE (single-core) and multi-SE (five-­ prior works in Table 15. The works [12,13,19,24,30–32], and [33] have
core), have been listed in Table 15. Comparing with single-SE works, the reported exclusively BN architectures and other works [9,10,27–29]
result of the proposed design has a higher operating frequency and have reported as BN + CM blocks combination. In these works also, both
considerably few hardware slices except [12]. The average BPCC of the Single-SE and Multi-SE designs were reported. The proposed work has
proposed design is better than all except [24]. However, these works been compared with Single-SE architectures [10,12,28,29,31,32] and
[12,24] reported not supporting all BN methods or multi-standard Multi-SE architectures [9,13,19,24,27,30,33]. The proposed Single-SE
support like the proposed design. has achieved better performance (720 MHz) with few hardware re­
In the case of Multi-SE works [27], has better BPCC than the pro­ sources (1.25 Kgates) than others. It also supports various BN methods
posed work. However, it has been reported at the CABAC level, and the used in popular video compression standards (H.264/HEVC/VVC).
BN was combined with CM to process in parallel. Nevertheless, the While comparing the multi-SE architectures, the proposed BN ob­
proposed design supports many BN methods and supports multiple tained higher throughput (3.14 Gbin/s). However, the works reported in
standards as compared to other works. Overall, the proposed Refs. [9,27] offer better Avg. BPCC than the proposed work, but BN +

17
M. Nagaraju et al. Microelectronics Journal 123 (2022) 105425

CM blocks are combined and processed in parallel in those architectures. [6] M. Orlandić, K. Svarstad, An efficient hardware architecture of CAVLC encoder
based on stream processing, Microelectron. J. 67 (2017) 43–49, https://doi.org/
The result [19] also offers a better BPCC, but its maximum throughput is
10.1016/j.mejo.2017.07.013.
less than the proposed work. In terms of hardware, the proposed work is [7] V. Sze, M. Budagavi (Eds.), High-Efficiency Video Coding (HEVC)-Algorithms and
compared with works [13,30] because these works implemented the BN Architectures, 2014, https://doi.org/10.1007/978-3-319-06895-4. G.J.S. Editors.
architecture at the 90 nm technology node (Fig. 26). [8] VVC/H.266: Versatile video coding standard ISO/IEC 23090-3, ISO/IEC JTC 1,
H.266 (08/2020). https://www.itu.int/rec/T-REC-H.266-202008-I/en.
It is observed that the proposed design occupies less hardware area [9] D. Zhou, J. Zhou, W. Fei, S. Goto, Ultra-high-throughput VLSI architecture of
and achieved better operating frequency due to the effective usage of H.265/HEVC CABAC encoder for UHDTV applications, IEEE Trans. Circ. Syst.
resource sharing technique and storage buffers. Sometimes, the extra Video Technol. 25 (2015) 497–507, https://doi.org/10.1109/
TCSVT.2014.2337572.
hardware area depends on memories, which is common for all archi­ [10] B. Peng, D. Ding, X. Zhu, L. Yu, A hardware CABAC encoder for HEVC, in: Proc.
tectures. Using the storage buffers in parallelism becomes more efficient IEEE Int. Symp. Circ. Syst., 2013, pp. 1372–1375, https://doi.org/10.1109/
than the existing architecture. Furthermore, for better comparison, all ISCAS.2013.6572110.
[11] B. Vizzotto, V. Mazui, S. Bampi, Area efficient and high throughput CABAC
the works are scaled down using the design efficiency equation (Eq. (6)). encoder architecture for HEVC, in: Proc. IEEE Int. Conf. Electron. Circuits, Syst.
The proposed design outperforms all the existing architectures. Never­ 2016-March, 2016, pp. 572–575, https://doi.org/10.1109/ICECS.2015.7440381.
theless, the proposed multi-SEs architecture seamlessly supports the [12] A.L.D.M. Martins, V. Rosa, S. Bampi, A low-cost hardware architecture binarizer
design for the H.264/AVC CABAC entropy coding, in: 2010 IEEE Int. Conf.
multiple standards than other architectures. To summarize, the present Electron. Circuits, Syst. ICECS 2010 - Proc., 2010, pp. 392–395, https://doi.org/
work has achieved a smaller hardware area to meet the throughput 10.1109/ICECS.2010.5724535.
constraints of 8K UHD. [13] Y. Liu, T. Song, T. Shimamoto, High performance binarizer for H.264/AVC CABAC,
in: 2011 Int. Conf. Electr. Inf. Control Eng. ICEICE 2011 - Proc, 2011,
pp. 2237–2240, https://doi.org/10.1109/ICEICE.2011.5777972.
7. Conclusion [14] C. De Matos Alonso, F.L. Livi Ramos, B. Zatt, M. Porto, S. Bampi, Low-power HEVC
Binarizer architecture for the CABAC block targeting UHD video processing, in:
Proc. - 30th Symp. Integr. Circuits Syst. Des. Chip Sands, SBCCI, 2017, pp. 30–35,
A high-throughput and area-efficient hardware architecture is pro­ https://doi.org/10.1145/3109984.3109988.
posed in this work, to meet UHD and beyond requirements. The [15] N. Neji, M. Jridi, A. Alfalou, N. Masmoudi, FPGA implementation of improved
throughput improvement is achieved at the architectural level with the binarizer design for context-based adaptive binary arithmetic coder, in: IPAS 2016
- 2nd Int. Image Process. Appl. Syst. Conf., 2017, pp. 1–4, https://doi.org/
multiple SEs parallel processing, while area-efficiency is achieved using
10.1109/IPAS.2016.7880123.
the multi-level resource-sharing technique. The storage buffers and [16] V. Sze, M. Budagavi, High throughput CABAC entropy coding in HEVC, IEEE Trans.
pairing SEs are used to support the parallel processing. In addition, the Circ. Syst. Video Technol. 22 (2012) 1778–1791, https://doi.org/10.1109/
TCSVT.2012.2221526.
computational demand of the binarization, probability occurrence of
[17] D.-L. Tran, V.-H. Pham, H.K. Nguyen, X.-T. Tran, A survey of high-efficiency
SEs, and the need for hardware acceleration are also explored. The context-adaptive binary arithmetic coding hardware implementations in high-
design has used an adaptive binarization method and scalability in the efficiency video coding standard, VNU J. Sci. Comput. Sci. Commun. Eng. 35
core stages, which supports the CABAC process used in the popular video (2019) 1–22, https://doi.org/10.25073/2588-1086/vnucsce.233.
[18] J. Lainema, K. Ugur, A. Hallapuro, in: Single Entropy Coder for HEVC with a High
coding standards (H.264/HEVC/VVC). The proposed binarization block Throughput Binarization Mode, JCTVC-G569, 7th Joint Collaborative Team on
is designed in Verilog and implemented on both FPGA and ASIC plat­ Video Coding (JCT-VC) Meeting, JCT-VC Document Management System, Geneva,
forms, respectively. The proposed architecture offers a throughput of Switzerland, 2011. Nov. http://phenix.it-sudparis.eu/jct/. (Accessed 5 October
2021).
3.14 Gbin/s at the operating frequency of 282 MHz. The hardware area [19] F.L.L. Ramos, A.V.P. Saggiorato, B. Zatt, M. Porto, S. Bampi, Residual syntax
has saved significantly in comparison to state-of-the-art architecture. elements analysis and design targeting high-throughput HEVC CABAC, IEEE Trans.
The proposed architecture of binarization is suitable for UHDTV Circuits Syst. I Regul. Pap. 67 (2020) 475–488, https://doi.org/10.1109/
TCSI.2019.2932891.
application. [20] F. Bossen, Common test conditions and software reference configurations, JCTVC-
L1100 12 (7) (2013). Jan 23, https://hevc.hhi.fraunhofer.de/. (Accessed 5 October
2021).
Declaration of competing interest [21] N. Mamidi, S.K. Gupta, V. Bhadauria, Design and implementation of parallel
bypass bin processing for CABAC encoder, Adv. Electr. Electron. Eng. 19 (2021)
243–257, https://doi.org/10.15598/aeee.v19i3.4010.
The authors declare that they have no known competing financial
[22] X. Tian, T.M. Le, Y. Lian, Entropy Coders of the H. 264/AVC Standard: Algorithms
interests or personal relationships that could have appeared to influence and VLSI Architectures, Springer Science & Business Media, 2010, https://doi.org/
the work reported in this paper. 10.1007/978-3-642-14703-6. Oct 17.
[23] DigilentInc, Nexys4 DDR TM FPGA Board Reference Manual, Nexys 4 DDR Artix-7
FPGA Train. Board Recomm. ECE Curric.- Digilent, 2014, pp. 1–29. https://refer
Acknowledgment ence.digilentinc.com/_media/nexys4-ddr:nexys4ddr_rm.pdf. (Accessed 5 October
2021).
[24] R.R. Osorio, J.D. Bruguera, High-throughput architecture for H.264/AVC CABAC
This work was carried out with the resources of the “Special compression system, IEEE Trans. Circ. Syst. Video Technol. 16 (2006) 1376–1384,
Manpower Development Program for Chip to System Design (SMDP- https://doi.org/10.1109/TCSVT.2006.883508.
C2SD)” project funded by the Ministry of Electronics and Information [25] N. Jarray, S. Dhahri, M. Elhaji, A. Zitouni, A high level hardware architecture
binarizer for H. 264/AVC CABAC encoder, in: Proceedings Engineering &
Technology (MeitY), Government of India.
Technology- vol. 3, 2013, pp. 216–218.
[26] A. Ben Hmida, S. Dhahri, A. Zitouni, A hardware architecture binarizer design for
References the H.264/AVC CABAC entropy coding, 2014, in: 2014 Int. Conf. Electr. Sci.
Technol. Maghreb, Cist., 2014, pp. 1–4, https://doi.org/10.1109/
CISTEM.2014.7076749.
[1] Cisco Visual Networking Index (VNI), Forecast and Trends, 2017–2022, 2019 white
[27] G. Pastuszak, Multisymbol architecture of the entropy coder for H.265/HEVC video
paper Feb;17:13.
encoders, IEEE Trans. Very Large Scale Integr. Syst. 28 (2020) 2573–2583, https://
[2] T. Wiegand, G.J. Sullivan, S. Member, G. Bjøntegaard, A. Luthra, S. Member,
doi.org/10.1109/TVLSI.2020.3016386.
Overview of the H.264/AVC video coding standard, IEEE Trans. Circ. Syst. Video
[28] P.S. Liu, J.W. Chen, Y.L. Lin, A hardwired context-based adaptive binary arithmetic
Technol. 13 (2003) 560–576, https://doi.org/10.1109/TCSVT.2003.815165.
encoder for H.264 advanced video coding, in: 2007 Int. Symp. VLSI Des. Autom.
[3] G.J. Sullivan, J.R. Ohm, W.J. Han, T. Wiegand, Overview of the high efficiency
Test, VLSI-DAT 2007 - Proc. Tech. Pap., 2007, p. 936349, https://doi.org/
video coding (HEVC) standard, IEEE Trans. Circ. Syst. Video Technol. 22 (2012)
10.1109/VDAT.2007.373239.
1649–1668, https://doi.org/10.1109/TCSVT.2012.2221191.
[29] J.W. Chen, L.C. Wu, P.S. Liu, Y.L. Lin, A high-throughput fully hardwired CABAC
[4] B. Bross, Y.K. Wang, Y. Ye, S. Liu, J. Chen, G.J. Sullivan, J.R. Ohm, Overview of the
encoder for QFHD H.264/AVC main profile video, IEEE Trans. Consum. Electron.
versatile video coding (VVC) standard and its applications, IEEE Trans. Circ. Syst.
56 (2010) 2529–2536, https://doi.org/10.1109/TCE.2010.5681137.
Video Technol. 31 (2021) 3736–3764, https://doi.org/10.1109/
[30] W. Fei, D. Zhou, S. Goto, A 1 Gbin/s CABAC encoder for H . 264/AVC Wei Fei,
TCSVT.2021.3101953.
Dajiang Zhou, and Satoshi Goto, 2011 19th, Eur. Signal Process. Conf. (2011)
[5] D. Marpe, H. Schwarz, T. Wiegand, Context-based adaptive binary arithmetic
1524–1528, https://doi.org/10.5281/zenodo.42535.
coding in the H.264/AVC video compression standard, IEEE Trans. Circ. Syst.
[31] D.H. Pham, J. Moon, S. Lee, Hardware implementation of HEVC CABAC binarizer,
Video Technol. 13 (2003) 620–636, https://doi.org/10.1109/
J. IKEEE. 18 (2014) 356–361, https://doi.org/10.7471/ikeee.2014.18.3.356.
TCSVT.2003.815173.

18
M. Nagaraju et al. Microelectronics Journal 123 (2022) 105425

[32] D. Kim, J. Moon, S. Lee, Hardware implementation of HEVC CABAC encoder, [34] Grois D, Nguyen T, Marpe D. Performance comparison of AV1, JEM, VP9, and
ISOCC 2015 - Int. SoC Des. Conf. SoC Internet Everything (2016) 183–184, https:// HEVC encoders. In Applications of Digital Image Processing XL 2018 Feb 8 (Vol.
doi.org/10.1109/ISOCC.2015.7401779. vol. 10396, p. 103960L). International Society for Optics and Photonics.
[33] D.L. Tran, X.T. Tran, D.H. Bui, C.K. Pham, An efficient hardware implementation of [35] F.L.L. Ramos, B. Zatt, M.S. Porto, S. Bampi, Novel multiple bypass bin scheme and
residual data binarization in HEVC CABAC encoder, Electronics 9 (4) (2020) 684, low-power approach for HEVC CABAC binary arithmetic encoder, J. Integr.
https://doi.org/10.3390/electronics9040684. Apr. Circuits Syst. 13 (2018) 1–11, https://doi.org/10.29292/JICS.V13I3.3.

19

You might also like