0% found this document useful (0 votes)
38 views12 pages

Handwriting Augmentation

The paper presents Fine-grained Automatic Augmentation (FgAA) for improving handwritten character recognition by addressing limitations of existing data augmentation methods. FgAA utilizes Bézier curves to perform fine-grained transformations on individual strokes of handwritten words, allowing for more diverse sample generation. Experiments on seven handwriting datasets demonstrate that FgAA significantly enhances recognition performance compared to traditional methods.

Uploaded by

yasin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views12 pages

Handwriting Augmentation

The paper presents Fine-grained Automatic Augmentation (FgAA) for improving handwritten character recognition by addressing limitations of existing data augmentation methods. FgAA utilizes Bézier curves to perform fine-grained transformations on individual strokes of handwritten words, allowing for more diverse sample generation. Experiments on seven handwriting datasets demonstrate that FgAA significantly enhances recognition performance compared to traditional methods.

Uploaded by

yasin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Pattern Recognition 159 (2025) 111079

Contents lists available at ScienceDirect

Pattern Recognition
journal homepage: www.elsevier.com/locate/pr

Fine-grained Automatic Augmentation for handwritten character recognition


Wei Chen, Xiangdong Su ∗, Hongxu Hou
College of Computer Science, Inner Mongolia University, China
National & Local Joint Engineering Research Center of Intelligent Information Processing Technology for Mongolian, Huhhot, China

ARTICLE INFO ABSTRACT

Keywords: With the advancement of deep learning-based character recognition models, the training data size has
Data automatic augmentation become a crucial factor in improving the performance of handwritten text recognition. For languages with
Handwritten character recognition low-resource handwriting samples, data augmentation methods can effectively scale up the data size and
Bézier curve
improve the performance of handwriting recognition models. However, existing data augmentation methods for
Bayesian optimization
handwritten text face two limitations: (1) Methods based on global spatial transformations typically augment
the training data by transforming each word sample as a whole but ignore the potential to generate fine-grained
transformation from local word areas, limiting the diversity of the generated samples; (2) It is challenging to
adaptively choose a reasonable augmentation parameter when applying these methods to different language
datasets. To address these issues, this paper proposes Fine-grained Automatic Augmentation (FgAA) for
handwritten character recognition. Specifically, FgAA views each word sample as composed of multiple strokes
and achieves data augmentation by performing fine-grained transformations on the strokes. Each word is
automatically segmented into various strokes, and each stroke is fitted with a Bézier curve. On such a basis,
we define the augmentation policy related to the fine-grained transformation and use Bayesian optimization
to select the optimal augmentation policy automatically, thereby achieving the automatic augmentation of
handwriting samples. Experiments on seven handwriting datasets of different languages demonstrate that
FgAA achieves the best augmentation effect for handwritten character recognition. Our code is available at
https://github.com/IMU-MachineLearningSXD/Fine-grained-Automatic-Augmentation

1. Introduction the Thin Plate Spline transformation to introduce variance between


words samples. Although the application of these basic methods is
In recent years, deep learning-based handwritten text recognition simple and effective, selecting the optimal magnitude of operation for
(HTR) models [1–3] have made significant progress and shown surpris- a specific dataset requires time-consuming ablation experiments and
ing performance on recognition accuracy. One key factor contributing high manual costs [10]. Even for some recently proposed pretrained
to the success of these models is the large scale of training data avail- HTR models [1], determining optimal augmentation parameters is still
able. However, the data sparsity problem [4,5] of HTR still exists due challenging. To address this issue, AutoAugment (AA) [11] proposes
to the high cost of collection and annotation, as well as limited digital obtaining adaptive data augmentation policies from a policy library
resources for some minority languages. When there are not enough using reinforcement learning [12], but it takes thousands of GPU hours
training samples, the recognition accuracy will be significantly reduced to train a model due to the expansive search space. Fig. 1 illustrates
since deep learning-based HTR models usually require extensive scale
the augmentation effects of AA’s policy library on handwritten text.
training data to optimize their parameters.
Samples generated using the color space adjustment operations are
To alleviate this problem, researchers have proposed various data
incapable of fitting various handwriting styles, making them nearly
augmentation methods [6,7] to expand the training dataset for HTR
ineffective in enhancing the performance of the recognizer.
tasks. For instance, standard Affine transformation [8] introduce mor-
To overcome the abovementioned limitations, this paper draws in-
phological changes to text images through global matrix operations. It
spiration from AA and proposes Fine-grained Automatic Augmentation
typically augments the training data by transforming each word sample
as a whole but ignore the potential to generate fine-grained transfor- (FgAA) for HTR based on Script-level Augmentation (SlA) [10]. In
mations from local word areas, limiting the diversity of the generated contrast to conventional spatial transformation-based methods [13],
samples. Pippi et al. [9] propose a finer-grained warping approach via we use Bézier curves [14] to achieve fine-grained augmentation for

∗ Corresponding author.
E-mail addresses: chenw@mail.imu.edu.cn (W. Chen), cssxd@imu.edu.cn (X. Su), cshhx@imu.edu.cn (H. Hou).

https://doi.org/10.1016/j.patcog.2024.111079
Received 5 September 2024; Accepted 9 October 2024
Available online 22 October 2024
0031-3203/© 2024 Elsevier Ltd. All rights are reserved, including those for text and data mining, AI training, and similar technologies.
W. Chen et al. Pattern Recognition 159 (2025) 111079

Fig. 1. Augmented samples of the handwritten word ‘my’. The augmentation operations used are all from the AutoAugment policy library [11].

handwritten text. Specifically, FgAA automatically segments each word • We employ Bayesian optimization to search for the optimal aug-
sample into multiple strokes, each of which is fitted with Bézier curves. mentation policy in handwritten text augmentation, reducing the
By manipulating the control points of these curves, all strokes are number of iterations of the model search. Simultaneously, the
reconstructed and reassembled to create new handwritten samples. The simplified search space accelerates the convergence of FgAA.
process above allows us to generate fine-grained variations for each • Extensive experiments conducted on seven different handwrit-
word stroke rather than employing a coarse global processing approach. ten datasets demonstrate that FgAA significantly improves the
Building upon this foundation, we introduce an iterative optimization performance of text recognizers compared to the baselines.
framework to autonomously learn the optimal policy, which is utilized
to constrain the movement radius of control points. In each iteration, The rest of the paper is organized as follows. In Section 2, we re-
FgAA samples a policy from the policy library based on historical ex- view the three categories of handwritten text augmentation techniques
ploration results. Subsequently, FgAA generates training data using the proposed in recent years. In Section 3, we introduce the details of the
sampled policy. The recognition network utilizes this data for training proposed approach. In Section 4, we present the experimental setup.
and outputs the recognition loss, which is used to update the policy In Section 5, we conduct extensive experiments to demonstrate the
with Bayesian-based policy optimizer. After training, FgAA automati- effectiveness of FgAA and show some augmented samples generated
cally and adaptively learns the optimal augmentation policy for specific by baselines and FgAA. Finally, in Section 6, we conclude the paper.
datasets without any manual intervention. Experiments conducted on
seven handwriting datasets of different languages demonstrate that 2. Related work
FgAA achieves the best augmentation effects for handwritten character
recognition. 2.1. Basic methods and automated augmentations
There are four main differences between our proposed FgAA and
AA. The most significant difference lies in the augmentation methods Researchers have proposed numerous methods for data augmenta-
within the policy library. As shown in Fig. 1, AA directly incorporates tion [7]. Basic methods primarily rely on matrix and spatial transfor-
commonly used image augmentation methods, which mainly rely on mations to generate images. The most widely used is Affine transforma-
global processing and color space transformations. In contrast, FgAA tion [8], which involves a combination of rotations, scaling, translation,
leverage the free transformation properties of Bézier curves for fine- and shearing to globally transform images. There are also methods for
grained augmentation on text strokes, enabling augmented samples fine-grained transformations, such as TPS [9], which alters the image
to fit a broader range of handwriting styles. The second is the size with part-specific transformations. Additionally, SlA [10] generates
of the search space. In the policy library of AA, each policy is a handwritten samples by curve fitting and reassembling strokes, which
triplet comprising (1) one of 16 augmentation operations, (2) the is the basis of our proposed approach. While effective, it lacks the abil-
application probability, and (3) the magnitude of the operation. The ity to automatically determine the optimal manipulation magnitude,
complexity of this search space significantly hampers the convergence requiring extensive ablation experiments. Although they are effective,
speed of AA. In contrast, the policy library of FgAA comprises only it is time-consuming to determine an appropriate basic approach and
one text augmentation, and its sampling policy includes solely the the magnitude of operation for a particular dataset. Automated aug-
magnitude of operation, excluding other factors. Consequently, FgAA mentation offers a solution to this problem. AA [11] is a pioneering
is relatively lightweight in terms of automated search frameworks. effort in automated augmentation, utilizing iterative reinforcement
Moreover, instead of using the reinforcement learning employed by learning [12] to train search networks. However, its main drawback
AA as the search algorithm, we employ the Bayesian optimization to is that it takes thousands of GPU hours to converge and get the aug-
search for optimal augmentation policy with fewer iterations and less mentation strategy. To address this, Fast AutoAugment (Fast AA) [15]
time. Finally, most existing work [15–17], including AA, primarily employs a density-matching strategy and Bayesian optimization to
focuses on image classification. Our proposed framework fills the gap
exponentially reduce the search time. RandAugment [16] proposes to
in handwriting recognition in the automated augmentation paradigm.
preserve the randomness of the augmentation method and reduce the
In summary, the key contributions of this paper are as follows:
search space. TrivialAugment (TA) [17] is a simple, parameterless auto-
• We propose Fine-grained Automatic Augmentation (FgAA) for augment baseline, applying a single augment to each image. While
handwritten character recognition. FgAA can automatically learn automated augmentations exhibit exciting performance, their policy li-
an optimal augmentation policy tailored to different handwrit- braries are characterized by large parameter quantities and fail to meet
ten datasets and generate augmented samples with richer diver- the specific augmentation requirements for handwritten images. In
sity for handwritten text, which benefits handwritten character contrast, our policy library is low parameter and specifically designed
recognition. for handwritten text.
• We propose to employ Bézier curves to mimic diversified hand-
writing styles in handwritten text augmentation, introducing fine- 2.2. Deep generative models
grained variations in the strokes of the text. This approach fun-
damentally differs from traditional spatial transformation-based Samples generated by existing methods are constrained by the rules
augmentations, such as rotation and scaling. of the method itself. Therefore, researchers turned to deep generative

2
W. Chen et al. Pattern Recognition 159 (2025) 111079

models [6] for data augmentation. ScrabbleGAN [18] utilizes semi- 𝑡, the policy optimizer calculates the maximum expected improvement
supervised training to synthesize handwritten text images, alleviating (EI) based on the historical exploration results to select the policy 𝑡
the burden of labeling data. Learn to Augment (L2A) [19] is the from the policy library that is most expected to be explored. Then, the
first automated augmentation paradigm for handwriting recognition, text augmentation employs the policy 𝑡 to augment the training set,
using joint adversarial training. Alonso et al. [20] propose a network thereby obtaining the augmented data. Subsequently, the augmented
architecture based on GAN and bidirectional LSTMs to generate real- data is combined with the original training set and fed into the recog-
istic images of French and Arabic. GANwriting [21] produces reliable nizer for fine-tuning. We utilize CRNN [30] as the text recognizer and
handwritten word images by conditioning the generative process with we pre-train the text recognizer with the original training set. Finally,
calligraphic style features and textual content. HiGAN [22] generates the policy optimizer updates the sampling policy 𝑡+1 for the next
variable-length handwritten text based on arbitrary textual content, not iteration based on the historical recognition loss of the recognizer on
bound by predefined corpus or extra-lexical words. WordStylist [23] the validation set. The specific details of the policy optimization process
proposes a latent diffusion-based method and uses class index styles will be described in Section 3.3. After 𝑇 iterations of search, FgAA
and text content prompts to generate realistic word image samples for can provide an optimal augmentation policy tailored to the specific
different writer styles. However, these methods often require a large handwritten dataset.
amount of data to train the generator. To address the issue of few-shot
handwritten text generation, HWT [24] proposes to use a transformer- 3.2. Text augmentation
based network for handwritten text generation. Moreover, VATr [25]
generates samples by employing a novel text content representation and Since the variety of handwriting styles and the scarcity of training
extracting styles from pre-training on a large synthetic dataset. Addi- data are the performance bottlenecks of character recognizers, we aim
tionally, to generate samples of unseen character types, GC-DDPM [26] to generate more diverse training samples to train the recognition
utilizes a denoising diffusion probabilistic model on Chinese datasets models. Concerning the strong controllability and high flexibility of
to convert font-generated Chinese character images into handwritten Bézier curves, we propose to segment the word samples into several
ones. Ding et al. [27] propose a progressive data filtering strategy Bézier curves and generate more diverse samples by moving the con-
to alleviate the mismatch between the text content and the glyph trol points of the Bézier curves, thus improving the performance of
conditional images synthesized by GC-DDPM. Zdenek et al. [28] use recognizers. The controllability and flexibility of Bézier curves have
a style encoder based on a vision transformer network that encodes also been proved in the HTR [10], text detection [14] and lane detec-
handwriting style from reference images and allows the generator to tion [31]. Our proposed text augmentation algorithm consists of three
imitate it. main sections: stroke segmentation, control field computation, and
transformation. In Fig. 2, we provide the pipeline and one example of
3. Methodology the text augmentation and summarize the overall process in Algorithm
1.
To generate more diverse handwritten samples through fine-grained Firstly, we segment handwritten text into multiple strokes based
transformation and automatically select appropriate augmentation pa- on its skeleton and connected domain, with each stroke fitted as a
rameters, we design Fine-grained Automatic Augmentation (FgAA) quadratic Bézier curve. Subsequently, we compute the control points
paradigm, which generates the augmented samples through Bézier and control field for each curve. Afterward, we allow each control point
curves and automatically selects the optimal augmentation policy to move randomly within its control field. Finally, we reconstruct the
through Bayesian optimization. Section 3.1 briefly outlines the overall Bézier curves based on the new control point positions and concate-
framework of FgAA. Section 3.2 introduces the Bézier curve-based text nate these curves to obtain a new handwritten sample. To assess the
augmentation method employed in FgAA. Finally, Section 3.3 provides efficiency of our text augmentation, we conduct tests on the CVL [32]
a detailed explanation of how the policy optimizer searches for the public dataset using an Inter(R) Xeon(R) Gold 5218R CPU, and the
optimal policy throughout the entire search process. average time required to augment one sample is only 3.2 ms. The
following sections provide a detailed explanation of the three sections
3.1. Overall framework of our text augmentation.

The proposed FgAA belongs to a Sequential Model-Based Optimiza- 3.2.1. Stroke segmentation
tion (SMBO) [29] framework. As illustrated in Fig. 2, FgAA is primarily To apply Bézier curve deformation on handwritten text, we first
composed of a policy library and four operations: sampling policy, need to segment the handwritten text into multiple strokes. In the initial
text augmentation, fine-tuning recognizer, and updating policy step, the handwritten text image 𝐼 undergoes convolution with a (3 × 3)
optimizer. The number of parameters for each policy  in the policy blur kernel to smooth out the sharp edges of the text. Subsequently,
library is solely dependent on the number 𝑁 of augmentations applied we apply the image thinning algorithm [33] to the blurred image 𝐼𝑏 .
to the training data, as shown in Eq. (1). This step iteratively corrodes handwritten text until the skeleton 𝐼𝑠𝑘𝑒
is obtained. Next, we calculate the sum of eight neighborhood pixels
 = {𝑟1 , ⋯ , 𝑟𝑖 , ⋯ , 𝑟𝑁 } 𝑟 ∈ [0, 1] (1)
for each point 𝑃𝑖 on the skeleton. Points that conform to Eq. (2) are
The parameter 𝑟 is used to control the movement radius of the control defined as branch points.
points in text augmentation. Each sampled policy  in every iteration ∑
𝑛=8
is employed for text augmentation. We utilize Bézier curves to mimic 𝑃𝑖 ≥ 4 (2)
various handwriting styles at the stroke-level in text augmentation, and 𝑖=1

the details will be elaborated in Section 3.2. The text recognizer, which Now, we obtain a collection 𝑆𝑏 of branch points from the skeleton,
is trained on the augmented samples, feeds back the recognition perfor- which serves as the segmentation points for stroke segmentation. Next,
mance on the validation set to the policy optimizer. The Bayesian-based we remove the branch points 𝑆𝑏 from the skeleton, making the skeleton
policy optimizer is used to adaptively update the optimal policy  𝐼𝑠𝑘𝑒 no longer be a single, connected domain. Subsequently, we em-
from the policy library according to the historical performance of the ploy the Two-Pass [34] algorithm to label the different strokes in the
recognizer. ′ . This algorithm can identify and label all the connected
skeleton 𝐼𝑠𝑘𝑒
FgAA aims to find the optimal policy  within a limited number domains in the image during two scanning processes. Pixels belonging
of iterations 𝑇 , which can be used to generate diverse handwriting to the same connected domain are assigned the same label. And we
samples and improve the performance of the recognizer. At iteration adopt the 8-neighborhood pattern of the Two-Pass algorithm. Through

3
W. Chen et al. Pattern Recognition 159 (2025) 111079

Fig. 2. Overview of the proposed framework. In each iteration, the policy optimizer samples a policy based on the historical search results. Subsequently, text augmentation is
performed on the training data. The recognizer then undergoes fine-tuning on the augmented data and the training set. Finally, the policy optimizer updates the sampled policy
for the next iteration based on the historical recognition loss. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this
article.)

Algorithm 1 Text Augmentation As for 𝑃𝑏1 and 𝑃𝑏2 , we compute the sum of eight neighborhood pixels
of each point in the subset in turn, and the points that conform to
Input: Handwritten image: 𝐼; Sampling policy:  = {𝑟1 , ..., 𝑟𝑁 };
Eq. (3) are the curve’s endpoints.
Augment times: 𝑁;
Output: Augmented sample set 𝑆𝑎𝑢𝑔 ∑
𝑛=8
𝑃𝑖 = 3 (3)
1: 𝐼𝑏 ← Blur 𝐼; ⊳ Step 1: Stroke segmentation 𝑖=1
2: 𝐼𝑠𝑘𝑒 ← Thinning 𝐼𝑏 ;
3: 𝑆𝑏 ← Calculate branch point set; ⊳ Eq. (2) Then, we calculate the third control point 𝑃𝑏3 of the curve based on
′ ← Delete 𝑆 in 𝐼 the obtained endpoints. According to the definition of the Bézier curve,
4: 𝐼𝑠𝑘𝑒 𝑏 𝑠𝑘𝑒 ;
5: 𝑆𝑠𝑘𝑒 ← Two-Pass 𝐼𝑠𝑘𝑒 ′ ; the position of 𝑃𝑏3 is related to the position of 𝑃𝑏1 , 𝑃𝑏2 , and 𝑃𝑓 , the
6: for 𝑖 = 1 to 𝑙𝑒𝑛(𝑆𝑠𝑘𝑒 ) do ⊳ Step 2: Control field computation farthest point of the curve. We calculate 𝑃𝑓 on the skeleton according
7: 𝑃𝑏1 , 𝑃𝑏2 ← Calculate two-end control points; ⊳ Eq. (3) to the distance formula Eq. (4) of the line connected from a point to
8: 𝑃𝑓 ← Calculate the point furthest from 𝑃𝑏1 and 𝑃𝑏2 on 𝑆𝑠𝑘𝑒𝑖 ; ⊳ two points,
(| )
Eq. (4) (𝑥2 − 𝑥1 ) × 𝑦3 + (𝑦1 − 𝑦2 ) × 𝑥3 + 𝑥1 𝑦2 − 𝑥2 𝑦1 ||
9: 𝑃𝑏3 ← Calculate the third control point; ⊳ Eq. (5) 𝑃𝑓 (𝑥𝑓 , 𝑦𝑓 ) = arg max | √
𝑥3 ,𝑦3 (𝑥2 − 𝑥1 )2 + (𝑦2 − 𝑦1 )2
10: 𝐹𝑏1 , 𝐹𝑏2 , 𝐹𝑏3 ← Calculate control field; ⊳ Eq. (6)-(8)
11: end for (4)
12: 𝑆𝑓 𝑖𝑒𝑙𝑑 ← Get the control field set;
where (𝑥1 , 𝑦1 ) and (𝑥2 , 𝑦2 ) represent the coordinates of 𝑃𝑏1 and 𝑃𝑏2
13: for 𝑖 = 1 to 𝑁 do ⊳ Step 3: Transformation
respectively. (𝑥3 , 𝑦3 ) represents the coordinates of the remaining points
14: for 𝑗 = 1 to 𝑙𝑒𝑛(𝑆𝑓 𝑖𝑒𝑙𝑑 ) do
′ , 𝑃 ′ , 𝑃 ′ ← Move control points within control fields; in the stroke. After that, the third control point 𝑃𝑏3 is calculated
15: 𝑃𝑏1 𝑏2 𝑏3 according to Eq. (5).
16: 𝑃𝑡 ← Calculate the points on the new Bezier curve;⊳ Eq. (9)
1
17: end for 𝑃𝑏3 (𝑥, 𝑦) = 2𝑃𝑓 (𝑥𝑓 , 𝑦𝑓 ) − (𝑃𝑏1 (𝑥1 , 𝑦1 ) + 𝑃𝑏2 (𝑥2 , 𝑦2 )) (5)
2
18: 𝐼𝑎𝑢𝑔𝑖 ← Assemble curves and generate an augmented sample;
19: end for
20: 𝑆𝑎𝑢𝑔 ← Get augmented sample set {𝐼𝑎𝑢𝑔1 , ..., 𝐼𝑎𝑢𝑔𝑛 }; The three control points 𝑃𝑏1 , 𝑃𝑏2 , and 𝑃𝑏3 of the quadratic Bézier
curve are obtained through the above process. Moreover, the control
field of the control points is calculated below. Each control field is set
to a circle centered on the control point. First of all, for the control field
the algorithm above, we obtain a pixel subset for each stroke, with 𝐹𝑏1 and 𝐹𝑏2 of 𝑃𝑏1 and 𝑃𝑏2 , in order to introduce as few error samples
each subset containing all pixels of a segmented stroke. These subsets as possible, we stipulate that the movement of endpoints should not
form a set 𝑆𝑠𝑘𝑒 . As shown in the text segmentation section in Fig. 2, cross the position of other endpoints. Therefore, we define the distance
the handwritten word ‘my’ is divided into 6 strokes, corresponding to between 𝑃𝑏1 , 𝑃𝑏2 and their nearest control points as the maximum
radius of 𝐹𝑏1 , 𝐹𝑏2 . 𝐹𝑏1 , 𝐹𝑏2 are calculated as shown in Eqs. (6) and (7).
6 subsets, and each stroke is represented with a different color. (√ )
𝐹𝑏1 = 𝑚𝑖𝑛 (𝑥1 − 𝑥𝑐 )2 + (𝑦1 − 𝑦𝑐 )2 (6)
𝑥𝑐 ,𝑦𝑐
(√ )
3.2.2. Control field computation 𝐹𝑏2 = 𝑚𝑖𝑛 (𝑥2 − 𝑥𝑐 )2 + (𝑦2 − 𝑦𝑐 )2 (7)
𝑥𝑐 ,𝑦𝑐
It is necessary to obtain the positions of control points and define
their moving regions when using Bézier curves to mimic the strokes (𝑥𝑐 , 𝑦𝑐 ) represents the coordinates of the other endpoints. Next, we
and generate various handwritten characters. If the movement of the define 𝐹𝑏3 of the third control point 𝑃𝑏3 , and we restrict the maximum
control points is unconstrained, a grotesquely shaped curve will be radius of 𝐹𝑏3 to not exceed the distance from 𝑃𝑏3 to the line connected
by 𝑃𝑏1 and 𝑃𝑏2 . The radius of 𝐹𝑏3 is calculated in Eq. (8).
generated, ultimately affecting the quality and the correctness of the
|(𝑥2 − 𝑥1 ) × 𝑦 + (𝑦1 − 𝑦2 ) × 𝑥 + 𝑥1 𝑦2 − 𝑥2 𝑦1 |
augmented sample. Therefore, it is essential to define an appropriate 𝐹𝑏3 = | √
| (8)
control field. The shape of a quadratic Bézier curve is controlled by (𝑥2 − 𝑥1 )2 + (𝑦2 − 𝑦1 )2
three control points: 𝑃𝑏1 , 𝑃𝑏2 , and 𝑃𝑏3 . 𝑃𝑏1 , 𝑃𝑏2 are endpoints, and 𝑃𝑏3 (𝑥, 𝑦) represents the coordinates of 𝑃𝑏3 . After traversing all strokes of
is the third control point. 𝑆𝑠𝑘𝑒 , we get the control field of all strokes.

4
W. Chen et al. Pattern Recognition 159 (2025) 111079

3.2.3. Transformation Algorithm 2 Fine-grained Automatic Augmentation Algorithm


Since we mimic handwritten strokes using Bézier curves, in the
Input: (0 , 𝑓 , 𝑇 , 𝑡𝑟𝑎𝑖𝑛 , 𝑣𝑎𝑙 );
transformation phase, we need to generate strokes by moving control
1: Randomly select 5  to build 0 and ; ⊳ Initialization
points and then assemble these strokes to form handwritten samples.
2: for 𝑡 = 1 to 𝑇 do
The movement of control points is controlled by the policy  =
3: 𝑡 ← ar g max 𝐸 𝐼(); ⊳ Eq. (15)
{𝑟1 , ⋯ , 𝑟𝑖 , ⋯ , 𝑟𝑁 } described in Section 3.1. Policy  includes the same 
number of parameters as the number of augmentations, with 𝑟𝑖 being 4: Get 𝑎𝑢𝑔 (𝑡 ) from 𝑡𝑟𝑎𝑖𝑛 and 𝑡 ;
a coefficient that determines the size of the control field radius for the 5: Train 𝑓𝑡 in 𝑎𝑢𝑔 (𝑡 ); ⊳ Eq. (11)
control points. 6: Evaluate  in 𝑣𝑎𝑙 ; ⊳ Eq. (12)
We specify that control points move within their corresponding 7:  ←  ∪ (𝑓𝑡 , 𝑡 , );
control field. Firstly, the parameters for the movement of control 8: Fit a new model 𝑡 to ;
points include the direction and distance of the movement. Inspired by 9: end for
RandomAugment, we retain the randomness in the movement direction 10:  ∗ = ar g min ; ⊳ Eq. (13)
∈
of the control points. Preserving a certain level of randomness not 11: return  ∗
only reduces the search space for FgAA but also increases the diversity
of augmented samples. Then, we use the parameter 𝑟𝑖 to define the
distance of control point movement within the control field. In the
actual calculation, we iterate through the set of control field 𝑆𝑓 𝑖𝑒𝑙𝑑 and 𝑟𝑖 is the coefficient determining the radius of the control domain for
use Eq. (9) to compute the three new control points, 𝑃𝑏1 ′ , 𝑃 ′ , and 𝑃 ′ , the control points, as indicated in Eq. (9). 𝑁 represents the number of
𝑏2 𝑏3
for each strokes. augmentations. After that, the FgAA generates an augmented dataset
′ ′ ′ using text augmentation and policy . Simultaneously, we combine the
{𝑃𝑏1 , 𝑃𝑏2 , 𝑃𝑏3 } = 𝑟𝑖 × {𝐹𝑏1 , 𝐹𝑏2 , 𝐹𝑏3 } 𝑟 ∈ [0, 1] (9)
original dataset with the augmented dataset to create 𝑎𝑢𝑔 (). At this
As mentioned earlier, 𝑟𝑖 is the coefficient determining the radius of point, we fine-tune a pre-trained submodel 𝑓 using 𝑎𝑢𝑔 (). We employ
the control domain for the control points. Since a conservative 𝑟𝑖 CRNN as our text recognition network, and the loss function 𝑐 𝑡𝑐 during
may restrict the diversity of augmented samples, and an aggressive the training stage is shown in Eq. (11).
𝑟𝑖 can lead to the generation of incorrect samples, in Section 3.3, we ∑
𝑐 𝑡𝑐 = − ln 𝑝(𝑙|𝑥) (11)
describe the process by which FgAA searches for the optimal policy
(𝑥,𝑧)∈𝑎𝑢𝑔 ()
 = {𝑟1 , ⋯ , 𝑟𝑖 , ⋯ , 𝑟𝑁 }. Subsequently, we utilize the newly generated
three control points 𝑃𝑏1 ′ , 𝑃 ′ , and 𝑃 ′ to calculate the points 𝑃 on the Here, 𝑥 denotes the input sequence, 𝑧 represents the target sequence,
𝑏2 𝑏3 𝑡
new Bézier curve, as shown in Eq. (10). and 𝑙 denotes the label text.
𝑃𝑡 = (1 − 𝑡)2 𝑃𝑏1
′ ′
+ 2𝑡(1 − 𝑡)𝑃𝑏3 + 𝑡2 𝑃𝑏2

𝑡 ∈ [0, 1] (10) Since we aim to find the optimal policy , we can transform
the augmentation policy optimization problem into a hyper-parameter
optimization problem [29]. To achieve this goal, we introduce 
After traversing all strokes, we generate all the new Bézier curves. as a metric to evaluate the model’s performance after augmenta-
Finally, these curves are concatenated to form a new handwritten tion.  indirectly reflects the quality of augmented data, as shown in
sample 𝐼𝑎𝑢𝑔 . If multiple samples are augmented, only the transfor- Eq. (12).
mation section needs to be looped. The first two sections no longer
need to be repeated, significantly reducing the generation time. The  ∶= (𝑓 , 𝑎𝑢𝑔 (), 𝑣𝑎𝑙 ) (12)
transformation section in Fig. 2 demonstrates the influence of moving
Here,  represents the recognition loss that the model 𝑓 achieves on the
control points on the shape of the curve and the augmented sample of
validation set 𝑣𝑎𝑙 after training on the augmented set 𝑎𝑢𝑔 . Assuming
‘my’. Additionally, since our method is based on image reconstruction,
P is the policy search space, we can obtain the optimal augmentation
the proposed text augmentation allows for arbitrary changes in the font
policy  ∗ by minimizing .
color, font weight, and background color of the image according to the
user’s needs.  ∗ = arg min (𝑓 , 𝑎𝑢𝑔 (), val ) (13)
∈P

3.3. Augmentation policy optimization


Therefore, it is essential to select an appropriate policy  for each
Augmentation policy optimization is a crucial step of achieving iteration based on historical loss function . Efficient exploration can
automated augmentation. The reinforcement learning utilized by AA expedite the convergence of FgAA. To model the relationship between
often requires thousands of iterations, and the computational cost of the policy  and the historical loss function , as depicted in Eq. (12),
evaluating augmentation policies is the most expensive within the we employ a Tree-structured Parzen Estimator (TPE) [29] as a sur-
entire search framework. Consequently, we incorporate a Bayesian rogate model . TPE defines a Gaussian distribution by computing
optimization algorithm into our policy optimizer for efficient searching. the covariance matrix among data points within the training dataset,
In this section, we elaborate on the detailed optimization process of the thereby establishing a joint probability distribution between input 
proposed method. and output . To use the Expected Improvement (EI) optimization
The optimization goal of FgAA is to find the optimal policy  that scheme, TPE utilizes two density functions to define 𝑝( ∣ ), as
maximizes the recognition accuracy. To achieve this goal, we construct outlined in Eq. (14).
our FgAA model based on the Sequential Model-Based Optimization {
𝑙() if  > ∗
(SMBO) framework. Additionally, for ease of explanation, we formalize 𝑝𝑀 ( ∣ ) = (14)
𝑔() if  < ∗
the FgAA framework in Algorithm 2.
In the beginning, we randomly select 5 policies to initialize the Here, ∗ represents the quantile of losses , and 𝑙() models the distri-
mathematical model 0 and the observation history . The data bution of previous sampled policies whose loss is below the threshold
employed in this process consists of the training set 𝑡𝑟𝑎𝑖𝑛 and the ∗ . Meanwhile, 𝑔() represents the density formed using the remaining
validation set 𝑣𝑎𝑙 . Moreover, we define the number of iterations 𝑇 observations. TPE models the conditional probability distribution 𝑝( ∣
and introduce the recognition model 𝑓 . Subsequently, we introduce a ) and the marginal probability distribution 𝑝() separately. Subse-
policy denoted as  = {𝑟1 , ⋯ , 𝑟𝑖 , ⋯ , 𝑟𝑁 }, as mentioned in Eq. (1). Here, quently, it employs the Bayes Rule to calculate the posterior probability

5
W. Chen et al. Pattern Recognition 159 (2025) 111079

Table 1 and evaluate all the models in the same setups. Affine transforma-
The information of the experimental datasets.
tion1 [8] uses rotation, translation, scaling, and shearing to generate
Dataset Languages Level Authors Charset # Train/Val/Test new samples. It is one of the most widely used data augmentation
CVL English/German Word Many 52 12.1k/5.5k/80.3k methods and is able to generate a rich diversity of samples. SlA2 [10]
IFN/ENIT Arabic Word Many 297 6.5k/6.0k/6.7k
is a stroke-level augmentation approach, which is the basis of our
HKR Russian/Kazakh Word Many 84 6.2k/3.5k/55.2k
Mongolian Mongolian Word Many 79 10.0k/5.0k/10.0k
proposed FgAA. TPS3 [9] is a fine-grained image warping method,
which alters the image with part-specific transformations. VATr4 [25]
Word 53.8k/16.5k/17.6k
IAM English Many 79 is a transformer-based network, which generates samples by employ-
Line 6.5k/1.0k/2.9k
Saint Gall Latin Line One 49 468/235/707
ing a novel text content representation and extracting styles from
RIMES French Line Many 95 10.2k/1.1k/0.8k pre-training on a large synthetic dataset. ScrabbleGAN5 [18] is a semi-
supervised generative adversarial network for generating handwritten
text with an arbitrary length. It adopts a semi-supervised approach
to train the model, which helps reduce the amount of labeled data
distribution 𝑝( ∣ ). From this, we can derive the formula for 𝐸 𝐼(), required for the task. L2A6 [19] is based on similarity transformations.
as illustrated in Eq. (15). It employs an agent network to learn from the output of the recognition
∗ model and manipulate the fiducial points to generate more adversarial
𝐸 𝐼() = (∗ − ) 𝑝𝑀 ( ∣ )𝑑
∫−∞ augmentation parameters, rather than directly generating handwritten
∗
𝑝𝑀 ( ∣ )𝑝() images. AA7 [11] uses reinforcement learning to search augmentation
= (∗ − ) 𝑑 policies. These policies consist of an image processing operation such as
∫−∞ 𝑝()
𝑙() ∗ rotation or coloring, and the probability and magnitude with which the
= (∗ − ) 𝑝()𝑑 operation is applied. Fast AA8 [15] uses a density matching strategy
𝑝() ∫−∞
[ ∗ ] to reduce the training cost of AA, and improve the generalization
𝑙()
= 𝛾∗ − 𝑝()𝑑 performance of a given network. TA9 [17] applies an augmentation
𝑝() ∫−∞
[ ∗ ] ( )−1 operation in a simple and random way, with an operation amplitude
𝑙() 𝑔(𝑃 )
= 𝛾∗ − 𝑝()𝑑 ∝ 𝛾 + (1 − 𝛾) randomly sampled from {0, … , 30}. It does not need to search for any
𝛾 𝑙() + (𝟣 − 𝛾)𝑔() ∫−∞ 𝑙(𝑃 )
parameters.
(15)
4.3. Metrics and implementation details
Therefore, TPE maximizes 𝐸 𝐼() by maximizing 𝑙()∕𝑔(). The policy
 is ideally characterized by a high probability under 𝑙(𝑃 ) and a low
This study utilizes the Word Accuracy Rate (WAR), Character Error
probability under 𝑔(𝑃 ). In each iteration, TPE updates the probability
Rate (CER), and Word Error Rate (WER) as evaluation metrics to vali-
model 𝑡 of the objective function based on historical observations
date the effects of different augmentation methods in the experiment.
and returns the maximum 𝐸 𝐼().
We use both the CRNN and VAN [2] models to experimentally validate
the proposed method. For the VAN model, our data processing and
4. Experimental settings model parameter settings strictly follow the original paper. Moreover,
we do not use the data augmentation module provided by the VAN
4.1. Datasets model. Due to the size requirements of the CRNN input layer, all images
are rescaled to 160 × 32. If the aspect ratio of the original handwritten
To comprehensively validate the performance and generalization of image is greater than 160/32, we primarily do zero padding at the top
FgAA, we conduct experiments on seven handwritten datasets. Table 1 and bottom of the handwritten image, and then resize it to 160 × 32. If
presents information about the datasets used in the experiments. The the aspect ratio of the original handwritten image is less than 160/32,
CVL [32] is a public handwritten database for writer retrieval, writer we primarily do zero padding at the left and right of the image, and
identification and word spotting. It is produced by 311 different writers then resize it to 160 × 32. Besides, to ensure the fairness of comparison,
and consists of 7 different handwritten texts (1 German and 6 English the learning rate and batch size of the CRNN are uniformly set to 1𝑒−4
texts). The IFN/ENIT [35] is a handwritten Arabic dataset, which and 32. In each search iteration, the pre-trained CRNN is fine-tuned
contains the names of 26,459 cities and villages. A total of 411 authors 15 epochs on the augmented data before being used for validation.
contributed to the production of this dataset. The HKR [36] is the The pre-training data is the original training set. All the baselines are
first version of a database that contains Russian and Kazakh words implemented according to their original papers and the open-source
for offline handwriting recognition. It is produced by approximately code. Moreover, our framework is based on Optuna,10 which is an open-
200 different writers. The Mongolian is a handwritten Mongolian source hyperparameter optimization framework that allows us to easily
dataset from Inner Mongolia University. The way of word building change various search algorithms. Due to the low search cost of the
of Mongolian makes it particularly difficult to be recognized. The proposed FgAA, we do not adopt distributed training in the experiments
IAM [37] is an English handwritten dataset, which is based on the and we conduct all the experiments on a single NVIDIA Tesla P100.
Lancaster-Oslo/Bergen (LOB) corpus. It includes 1066 forms produced
by approximately 400 different writers. The Saint Gall [38] dataset 1
https://pytorch.org/vision/stable/index.html.
contains 60 pages of a handwritten historical manuscript written in 2
https://github.com/IMU-MachineLearningSXD/script-level_aug_
Latin by a single author at the end of the 9th Century. The RIMES [39] ICFHR2022.
database consists of 12 723 pages of scanned letters written in French 3
https://docs.opencv.org/3.4/dc/d18/classcv_1_
by 1300 different authors. 1ThinPlateSplineShapeTransformer.html.
4
https://github.com/aimagelab/VATr.
5
4.2. Baselines https://github.com/amzn/convolutional-handwriting-gan.
6
https://github.com/Canjie-Luo/Text-Image-Augmentation.
7
https://github.com/DeepVoltaire/AutoAugment.
To compare the FgAA with the previous augmentation methods, we 8
https://github.com/kakaobrain/fast-autoaugment.
select nine representative baselines of data augmentation methods. And 9
https://github.com/automl/trivialaugment.
10
we use the official implementation and weights released by the authors https://optuna.org/.

6
W. Chen et al. Pattern Recognition 159 (2025) 111079

Table 2 Table 4
Word-level recognition results of VAN without augmentation. Word-level recognition results of FgAA compared to the baselines on the IFN/ENIT and
No augmentation HKR datasets.
IFN/ENIT HKR
Dataset WER (%)↓ CER (%)↓
Augmentation times/# augmented samples
CVL 22.74 ± 0.45 9.16 ± 0.21
IFN/ENIT 36.97 ± 0.37 10.16 ± 0.26 10/71.9k 10/67.9k
HKR 33.07 ± 0.63 13.52 ± 0.31
Method WER (%)↓ CER (%)↓ WER (%)↓ CER (%)↓
Mongolian 55.96 ± 0.77 16.30 ± 0.34
IAM 15.36 ± 0.28 5.51 ± 0.15 Affine [8] 28.41 ± 0.31 6.34 ± 0.11 17.56 ± 0.21 8.36 ± 0.12
SlA [10] 24.99 ± 0.13 5.94 ± 0.03 15.80 ± 0.29 7.67 ± 0.14
TPS [9] 30.07 ± 0.36 7.45 ± 0.14 29.05 ± 0.32 11.88 ± 0.17
Table 3 ScrabbleGan [18] 36.01 ± 0.30 8.56 ± 0.22 31.13 ± 0.18 12.75 ± 0.15
Word-level recognition results of FgAA compared to the baselines on the CVL VATr [25] 34.62 ± 0.27 8.23 ± 0.19 30.64 ± 0.30 12.53 ± 0.23
dataset. L2A [19] 25.70 ± 0.26 6.11 ± 0.18 16.11 ± 0.16 7.92 ± 0.20
AA [11] 26.82 ± 0.16 6.13 ± 0.07 21.22 ± 0.42 9.49 ± 0.17
CVL
Fast AA [15] 25.91 ± 0.28 6.25 ± 0.24 21.83 ± 0.33 9.86 ± 0.21
Augmentation times/# augmented samples TA [17] 27.54 ± 0.32 6.57 ± 0.18 20.59 ± 0.39 9.38 ± 0.31
10/133.6k 15/194.4k FgAA (ours) 24.23 ± 0.15 5.76 ± 0.08 15.09 ± 0.24 7.42 ± 0.11

Method WER (%)↓ CER (%)↓ WER (%)↓ CER (%)↓


Affine [8] 12.21 ± 0.20 4.87 ± 0.12 11.58 ± 0.16 4.62 ± 0.09
SlA [10] 10.95 ± 0.31 4.37 ± 0.21 10.65 ± 0.24 4.25 ± 0.05 Table 5
TPS [9] 19.67 ± 0.30 7.99 ± 0.19 18.88 ± 0.12 7.67 ± 0.11 Word-level recognition results of FgAA compared to the baselines on the Mongolian
ScrabbleGan [18] 21.54 ± 0.43 8.75 ± 0.27 21.32 ± 0.36 8.66 ± 0.19 and IAM datasets.
VATr [25] 20.99 ± 0.37 8.53 ± 0.20 20.72 ± 0.24 8.42 ± 0.16
Mongolian IAM
L2A [19] 11.73 ± 0.18 4.68 ± 0.08 11.01 ± 0.31 4.39 ± 0.09
AA [11] 12.98 ± 0.14 5.19 ± 0.11 12.95 ± 0.20 5.34 ± 0.12 Augmentation times/# augmented samples
Fast AA [15] 12.93 ± 0.22 5.32 ± 0.17 11.92 ± 0.08 5.24 ± 0.21 10/110.0k 10/592.2k
TA [17] 13.23 ± 0.16 5.33 ± 0.05 12.64 ± 0.19 5.03 ± 0.11
FgAA (ours) 10.15 ± 0.23 4.05 ± 0.13 9.74 ± 0.30 3.89 ± 0.15 Method WER (%)↓ CER (%)↓ WER (%)↓ CER (%)↓
Affine [8] 48.05 ± 0.37 12.83 ± 0.11 12.71 ± 0.21 4.52 ± 0.07
SlA [10] 46.10 ± 0.22 12.31 ± 0.13 11.80 ± 0.11 4.23 ± 0.09
TPS [9] 52.27 ± 0.31 13.97 ± 0.16 14.16 ± 0.18 5.12 ± 0.11
5. Results and discussion ScrabbleGan [18] 54.68 ± 0.36 14.88 ± 0.18 15.18 ± 0.28 5.43 ± 0.10
VATr [25] 54.33 ± 0.47 14.46 ± 0.22 14.90 ± 0.24 5.35 ± 0.15
L2A [19] 45.87 ± 0.18 12.63 ± 0.09 12.14 ± 0.15 4.35 ± 0.09
5.1. Comparison with baselines AA [11] 46.57 ± 0.25 12.72 ± 0.19 13.31 ± 0.29 4.76 ± 0.09
Fast AA [15] 46.54 ± 0.40 12.95 ± 0.31 13.72 ± 0.22 4.87 ± 0.10
To comprehensively compare FgAA, in this section, we compare TA [17] 47.51 ± 0.33 13.10 ± 0.28 13.58 ± 0.14 4.81 ± 0.12
FgAA (ours) 42.82 ± 0.24 11.79 ± 0.12 11.10 ± 0.10 3.97 ± 0.06
FgAA against nine baselines across five datasets. These baselines in-
clude classical methods based on matrix transformations, methods
employing fine-grained local transformations, deep generative mod-
els, and automatic augmentations. We report the mean and standard AA increased by 2.9%, when the augmentation times is increased from
deviation of five repeated experiments. 10 to 15. The reason is that some aggressive augmentation parameters
We first report the experimental results for the VAN recognizer may lead to the generation of erroneous samples in augmentation
model without augmentation, as shown in Table 2. We find that the methods. For instance, translation or masking may cause the loss of
model’s performance on low-resource datasets is not satisfactory. For handwritten characters, resulting in omissions in character recognition.
example, the CER on the Mongolian dataset only reaches 16.3%, which This phenomenon implies that some ineffective or incorrect augmented
is due to the fact that there is only one training sample per word samples fail to assist the model in training and sometimes further
in the Mongolian dataset. Therefore, it is necessary to apply data harm the model. In contrast, FgAA consistently generates highly diverse
augmentation to such data-scarce datasets. augmented samples without encountering the issues above.
In addition, we report the comparison results with baselines in
Table 3, Table 4, and Table 5. We underline the best-performing Furthermore, we find that the classes of augmentation operations
baseline (SlA). SlA consistently achieves the best performance among in the augmentation methods cannot fully reflect the abilities of these
the baselines. It is the foundation for FgAA but cannot automatically methods to mimic handwriting styles. In other words, having a greater
acquire the optimal operation magnitude, requiring extensive ablation variety of augmentation operation types does not necessarily result in
experiments. FgAA outperforms SlA on average by 6.5% in WER and better augmentation performance for HTR. For instance, in the policy
5.4% in CER. Among the deep generative models (ScrabbleGan, VATr, libraries of AA, Fast AA, and TA, there are 16 augmentation operations,
L2A), L2A consistently achieves the best performance. FgAA outper- while FgAA only includes one augmentation operation based on Bézier
forms L2A on average by 8.7% in WER and 7.6% in CER. In the case of curves. Experimental results on five datasets indicate that FgAA out-
automatic augmentations (AA, Fast AA, TA), AA performs best in most performs AA, Fast AA, and TA in terms of WER by 23.5%, 22.3%, and
cases. FgAA outperforms AA on average by 23.5% in WER and 21.3% in 23.9%, respectively. And FgAA surpasses AA, Fast AA, and TA in terms
CER. These results indicate that our proposed method achieves the best of CER by 21.3%, 23.3%, and 22.3%, respectively. This is because the
performance across the five datasets and various augmentation times, augmentation operations in the baselines are not specifically designed
and the augmented samples generated by FgAA can better capture a or selected for HTR. For example, the operations based on color space
broader range of handwriting styles, leading to significantly improved transformations cannot mimic various handwriting styles. Simultane-
performance of recognition models. This result is due to the advantage ously, it demonstrates that FgAA can generate more conducive samples
of fine-grained augmentation of FgAA and its ability to obtain the for HTR.
optimal augmentation policy. Finally, as for generative adversarial networks, they often require
Moreover, we observe that increasing the augmentation times not a substantial amount of data to train a well-performing generator to
always bring the performance improvement and sometimes negatively generate augmented samples, which involves a high data collection
affects the performance. For example, on the CVL dataset, the CER of and annotation costs. Therefore, their performance in low-resource

7
W. Chen et al. Pattern Recognition 159 (2025) 111079

Table 6 outperforms Random search by an average of 0.38%. It can precisely


Policy library for FgAA.
search the optimal movement region for control points and demon-
Policy CVL IFN/ENIT HKR Mongolian strates excellent convergence and exploration ability. Therefore, we use
0.290 0.336 0.641 0.471 0.070 0.237 0.106 0.282 TPE algorithm in FgAA.
0.284 0.256 0.346 0.442 0.486 0.170 0.128 0.274
0.250 0.185 0.292 0.261 0.129 0.118 0.296 0.157
0.302 0.134 0.128 0.102 0.138 0.215 0.179 0.264 5.3.2. Ablation on number of iterations
𝑟1 ∼𝑟20
0.126 0.252 0.165 0.142 0.249 0.261 0.185 0.140 To explore the search cost of FgAA, we conduct ablation experi-
0.256 0.114 0.185 0.173 0.447 0.272 0.223 0.389 ments on the number of iterations and the GPU hours in searching the
0.344 0.291 0.145 0.226 0.129 0.186 0.086 0.196
0.290 0.379 0.240 0.3150 0.278 0.357 0.151 0.256
optimal augmentation policy. To prevent missing the optimal policy,
0.324 0.137 0.424 0.443 0.142 0.124 0.316 0.222 we set the maximum number of iterations to 150 and chose a moderate
0.322 0.253 0.101 0.132 0.119 0.130 0.116 0.443 augmentation times of 8. We report the experimental results every 25
epochs in Table 8.
On the CVL, IFN/ENIT, HKR, and Mongolian datasets, FgAA obtains
Table 7
Comparison of different search algorithms on FgAA. the optimal policy at 74th, 39th, 60th, and 19th iteration epoch,
CVL IFN/ENIT HKR Mongolian
respectively. On average, FgAA only requires 48 iterations to complete
the entire training process on a single NVIDIA Tesla P100.
Random search [40] 58.61 64.69 55.57 11.30
NSGA-II [41] 58.79 64.87 55.74 11.34
CMA-ES [42] 58.82 64.86 55.74 11.39 5.4. Evaluation on single-author and large datasets
TPE (ours) 58.84 64.86 55.76 11.44

To evaluate our method’s performance on single-author and large


datasets, we conduct experiments on the Saint Gall and RIMES datasets.
scenarios is not satisfying. L2A is the most stable and effective one The Saint Gall dataset consists of handwritten text lines from a single
among the baselines. It relies on similarity transformation and gener- author, while the RIMES dataset contains 10,188 handwritten text lines
ates parameters of augmentation operations rather than the augmented in its training set. We perform the same proportional data augmentation
samples. The experimental results of FgAA and L2A demonstrate that on both datasets using FgAA and two other top-performing baselines.
the methods capable of inducing morphological transformations in The augmented datasets are then used to train the VAN recognizer. We
handwritten text are more effective in improving the performance of report the line-level recognition results of the VAN recognizer on the
HTR. test sets of the Saint Gall and RIMES datasets. The results are shown in
Table 9.
5.2. Policy library analysis For the Saint Gall dataset, without augmentation, the CER and WER
values are 9.53% and 33.15%, respectively. The scarcity of data in the
After the optimization process, we obtain the augmentation policies original training set limits the performance of the recognizer, making
of FgAA. Table 6 shows the learned optimal policies on the four data augmentation significantly beneficial for model improvement. The
experimental datasets. Each policy contains 20 values from 𝑟1 to 𝑟20 . CER of L2 A, SIA, and FgAA decreases by 1.73%, 1.64%, and 1.88%,
The number of 𝑟 contained in each policy is equal to the augmentation respectively, but their performance improvements are similar. The
times used in the experiments. 𝑟𝑖 stands for the augmentation parameter reason is that the single-author dataset provides only one handwriting
that determines the movement of the control point for generating the style, which restricts the diversity of augmentation methods.
𝑖th samples. The distribution of 𝑟 is mainly concentrated between 0.1 The RIMES dataset’s CER and WER values are 3.70% and 11.03%,
and 0.5. Slight differences exist in the upper and lower limits of 𝑟 for respectively, without augmentation. The ample data provided by the
different language datasets. A few scatter points occur between 0 and large dataset allows the recognizer to be fully trained, achieving ex-
0.1 and between 0.5 and 0.7. In addition, no scatter points are located cellent recognition performance. However, data augmentation on large
at coordinates above 0.7. This result shows that moderate deformation datasets does not significantly improve recognition performance. The
is more likely to generate high-quality augmented samples. Radical CER of L2A, SIA, and FgAA decrease by 0.31%, 0.34%, and 0.47%,
parameters tend to produce incorrect samples, which is ineffective and respectively. Although our method performs the best, the improvement
can be harmful to recognition model. on large datasets is limited. Additionally, augmenting large datasets
greatly increases the training burden of the model. We only augmented
5.3. Ablation study the RIMES dataset twice, but the training time for the recognizer
increased from 2.4 days to 8.7 days.
5.3.1. Ablation on search algorithms
As described in Section 3.3, we use the Tree-structured Parzen 5.5. Runtime of FgAA
Estimator (TPE) [29] optimization algorithm in FgAA. In fact, other
optimization algorithms can also be used in the policy optimization To investigate the relationship between the number of control points
section of FgAA, including Random search [40], NSGA-II [41], and and augmentation time, we conduct tests on the word dataset of CVL us-
CMA-ES [42]. In this section, we compare the performance of these ing an Inter(R) Xeon(R) Gold 5218R CPU. We employ multiprocessing
four optimization algorithms. We report the recognition accuracies to augment each training sample 40 times and record both the number
on the test set after fine-tuning the CRNN recognizer for 15 epochs of control points and augmentation time for each sample. The results
with different optimization algorithms in Table 7. The training data is are shown in Fig. 3. (a) represents the average time consumed for
augmented 5 times. augmenting a sample once with different numbers of control points. (b)
Firstly, the uncertainty and directionlessness of Random search lead illustrates the proportion of samples with different numbers of control
to the worst performance. In addition, CMA-ES achieves suboptimal points in the dataset. For each additional control point, the average
performance in most cases. Its advantage is the ability to select sam- augmentation time for one image increases by approximately 0.06 ms.
pling points for the next iteration based on historical observation. Furthermore, most samples have control point numbers ranging from
NSGA-II obtains the optimal performance on IFN/ENIT dataset. TPE 5 to 30, corresponding to augmentation times of 2.42 to 3.92 ms.
obtains the best performance on the CVL, HKR and Mongolian datasets Through statistical analysis, the average time for augmenting one image
and achieves the suboptimal performance on IFN/ENIT dataset. TPE on the CVL dataset using our method is 3.2 ms.

8
W. Chen et al. Pattern Recognition 159 (2025) 111079

Table 8
Ablation experiments on FgAA iterations. Optimal epochs represent the iteration epoch in which the optimal policy
is found within the maximum epochs.
CVL IFN/ENIT
Maximum epochs 25 50 75 100 25 50 75 100
Optimal epochs 20 41 74 74 9 39 39 39
Accuracy 62.69 62.75 62.83 62.83 68.08 68.21 68.21 68.21
HKR Mongolian
Maximum epochs 25 50 75 100 25 50 75 100
Optimal epochs 1 28 60 60 19 19 19 19
Accuracy 62.69 62.95 63.13 63.13 15.21 15.21 15.21 15.21

Fig. 3. (a) The average time consumed for augmenting one image of an original sample with different numbers of control points. (b) The proportion of samples with different
numbers of control points in the CVL word dataset.

Table 9 Table 10
Line-level recognition results of FgAA compared to the baselines on single-author and Comparison of runtime for FgAA and AA across five datasets, reported in days.
large datasets. CVL IFN/ENIT HKR Mongolian IAM Average
Dataset Method CER (%)↓ WER (%)↓ #train Train.time
AA [11] 41.8d 35.7d 37.4d 32.2d 39.1d 36.8d
No aug 9.53 33.15 468 0.2d FgAA (ours) 2.3d 1.1d 1.7d 0.7d 2.6d 1.7d
L2A [19] 7.80 26.69 2,808 0.9d
Saint Gall
SIA [10] 7.89 27.44 2,808 0.9d
FgAA (ours) 7.68 26.30 2,808 0.9d
No aug 3.70 11.03 10,188 2.4d performance on three different training datasets of the same size at line-
L2A [19] 3.39 10.07 30,564 8.7d level and word-level, including IAM + synthetic data, IAM + real data,
RIMES
SIA [10] 3.36 9.92 30,564 8.7d
and IAM + our augmented data. The synthetic data is generated by the
FgAA (ours) 3.23 9.66 30,564 8.7d
publicly available TRDG11 model. The real data of line-level is from
the training dataset of the RIMES, ICFHR14,12 and NorHand13 datasets.
The real data of word-level is the CVL dataset. Our augmented data is
Additionally, to compare the runtime of the proposed FgAA with generated from the IAM. All experiments are done on a single NVIDIA
the baseline AA in searching for the optimal policies, we conduct P100 GPU. We report the CER values on the IAM test set, the sizes of
experiments on five datasets for both FgAA and AA. For AA, we follow the above three training datasets, and the convergence time of each
the experimental settings from the original paper. All the experiments HTR model on these three datasets in Table 11.
are conducted on the NVIDIA Tesla P100. We report the runtime of From Table 11, we find that increasing the training dataset by
these two methods in Table 10. AA requires 41.8, 35.7, 37.4, 32.2, and adding one of the three datasets (synthetic data, real data, and aug-
39.1 days to search for the optimal policy on the CVL, IFN/ENIT, HKR, mented data) to the IAM training set brings improvement to HTR
Mongolian, and IAM datasets, respectively. FgAA requires 2.3, 1.1, 1.7, performance. The CER on the line level is reduced by 0.77%, 1.06%,
0.7, and 2.6 days for the same datasets. The average search time for AA and 0.90% for the three datasets, respectively. The CER on the word
and FgAA are 36.8 and 1.7 days, respectively. The longer search time level is reduced by 0.76%, 1.39%, and 1.17% for the three datasets,
for AA is due to its larger search space and the use of a reinforcement respectively. The experimental results indicate that FgAA improves the
learning algorithm with a higher number of iterations. In contrast, recognizer more than synthetic data. This is because our augmenta-
FgAA only learns the range of control point movements, resulting in tion method can generate more diverse augmented samples by using
significantly fewer parameters. Additionally, we use the faster TPE Bézier curves to fit handwriting strokes and the Bayesian optimiza-
algorithm for searching. The experimental results demonstrate that tion approach to automatically learn the operation magnitude, and
FgAA has a substantial advantage over AA in terms of runtime. there are limits to the diversity of the synthetic data. In addition,

5.6. Comparison with synthetic and real data


11
https://github.com/Belval/TextRecognitionDataGenerator.
To compare the effects among our augmented data, synthetic data 12
https://zenodo.org/records/44519.
13
and real data, we select the VAN model [2] and compare the models’ https://zenodo.org/records/6542056.

9
W. Chen et al. Pattern Recognition 159 (2025) 111079

Fig. 4. Augmented samples generated by our model for handwritten text lines.

Fig. 5. Augmented samples generated by our model on handwritten words in different languages.

Table 11 dataset. In summary, our augmented data brings obvious performance


Word-level and line-level recognition performance with different data added to the IAM
improvement for the representative HTR models.
dataset, including synthetic data, real data, and our augmented data. The test dataset
is the test set of IAM.
HTR level Training data #train GPU hours CER (%)↓
5.7. Case study
IAM 6,482 33 4.98
IAM+Synthetic 45,521 250 4.21
Line
IAM+Real 45,521 228 3.92 To better visualize the effect of the proposed augmentation method,
IAM+FgAA (ours) 45,521 283 4.08
we present augmented samples generated by FgAA for handwritten
IAM 53,839 35 5.51 text lines and handwritten words in Figs. 4 and 5, respectively. By
IAM+Synthetic 151,843 84 4.75
Word manipulating control points and reconstructing strokes using Bézier
IAM+Real 151,843 81 4.12
IAM+FgAA (ours) 151,843 85 4.34 curves, the proposed FgAA can generate samples that reflect a broader
range of handwriting styles. FgAA tends to induce local deformations
in each stroke of the handwritten text rather than global scaling or
rotation. Additionally, it employs the TPE algorithm to automatically
the improvement from our augmented data is less than that from the learn optimal augmentation policies, ensuring the controllability and
real data. The reason is that the real data is written by hundreds rationality of the sample generation. These augmented samples validate
of different authors and is consistent with the real distribution. On the effectiveness of FgAA and its universality across multiple languages.
the contrary, our augmented data is generated from the IAM training More cases are available at our GitHub repository.

10
W. Chen et al. Pattern Recognition 159 (2025) 111079

6. Conclusion References

This paper designs Fine-grained Automatic Augmentation (FgAA) to [1] V. Pippi, S. Cascianelli, C. Kermorvant, R. Cucchiara, How to choose pretrained
generate diverse, high-quality training samples for handwritten charac- handwriting recognition models for single writer fine-tuning, in: Proceedings of
the ICDAR, 2023, Vol. 14188, Springer, 2023, pp. 330–347.
ter recognition. The FgAA employs Bézier curves to simulate handwrit-
[2] D. Coquenet, C. Chatelain, T. Paquet, End-to-end handwritten paragraph text
ten script and introduces fine-grain deformations through control points recognition using a vertical attention network, IEEE Trans. Pattern Anal. Mach.
movement. It employs the TPE optimization algorithm to automatically Intell. 45 (1) (2023) 508–524.
adjust the range of control point movements and obtain the optimal [3] A.K. Bhunia, A. Sain, P.N. Chowdhury, Y. Song, Text is text, no matter what:
augmentation policy, ensuring controlled and reasonable sample gen- Unifying text recognition using knowledge distillation, in: Proceedings of the
ICCV, 2021, pp. 963–972.
eration. FgAA completes the search process in an average of 1.7 days
[4] L. Kang, P. Riba, M. Rusiñol, A. Fornés, M. Villegas, Content and style aware
on a single NVIDIA Tesla P100, requiring no manual intervention. generation of text-line images for handwriting recognition, IEEE Trans. Pattern
The experimental results on CVL, IFN/ENIT, HKR, Mongolian and Anal. Mach. Intell. 44 (12) (2022) 8846–8860.
IAM datasets indicate that FgAA can generate samples that reflect a [5] M.A. Souibgui, A.F. Biten, S. Dey, A. Fornés, Y. Kessentini, L. Gómez, D. Karatzas,
broader range of handwriting styles. FgAA outperforms various types J. Lladós, One-shot compositional data generation for low resource handwritten
text recognition, in: Proceedings of the WACV, 2022, IEEE, 2022, pp. 2563–2571.
of baselines, providing the maximum improvement to the performance
[6] A.F.d. Neto, B.L.D. Bezerra, G.C.D. de Moura, A.H. Toselli, Data augmentation for
of HTR models. Additionally, the experimental results demonstrate that offline handwritten text recognition: A systematic literature review, SN Comput.
the TPE algorithm exhibits superior performance on FgAA compared to Sci. 5 (2) (2024) 258.
other optimization algorithms. [7] M. Xu, S. Yoon, A. Fuentes, D.S. Park, A comprehensive survey of image
Although our method has achieved promising results on various augmentation techniques for deep learning, Pattern Recognit. 137 (2023)
benchmarks, the proposed FgAA still has some limitations. For instance, 109347.
[8] Q. Lin, C. Luo, L. Jin, S. Lai, STAN: a sequential transformation attention-based
unclear or noisy handwritten text may affect the extraction of con-
network for scene text recognition, Pattern Recognit. 111 (2021) 107692.
trol points, thereby reducing the quality of augmented samples. In [9] V. Pippi, S. Cascianelli, L. Baraldi, R. Cucchiara, Evaluating synthetic pre-training
future research, we will first improve the image preprocessing steps. for handwriting processing tasks, Pattern Recognit. 172 (2023) 44–50.
Additionally, although we have implemented an automatic search for [10] W. Chen, X. Su, H. Zhang, Script-level word sample augmentation for few-shot
the optimal operation magnitude of FgAA, we cannot dynamically handwritten text recognition, in: Proceedings of the ICFHR, 2022, Vol. 13639,
Springer, 2022, pp. 316–330.
evaluate the generated samples to select those that are more helpful
[11] E.D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, Q.V. Le, Autoaugment: Learning
in improving the performance of the HTR model. Through further augmentation strategies from data, in: Proceedings of the CVPR, 2019, pp.
exploration, we have found that the influence function can be used 113–123.
to measure the impact of samples on model parameters. Therefore, [12] X. Du, H. Chen, C. Wang, Y. Xing, J. Yang, P.S. Yu, Y. Chang, L. He, Robust
the influence function may help us explore the importance of different multi-agent reinforcement learning via Bayesian distributional value estimation,
Pattern Recognit. 145 (2024) 109917.
generated samples from the perspective of model interpretability. This
[13] R. Atienza, Data augmentation for scene text recognition, in: Proceedings of the
is also our future research direction. IEEE/CVF International Conference on Computer Vision, 2021, pp. 1561–1570.
[14] Y. Liu, H. Chen, C. Shen, T. He, L. Jin, L. Wang, ABCNet: Real-time scene text
CRediT authorship contribution statement spotting with adaptive bezier-curve network, in: Proceedings of the CVPR, 2020,
Computer Vision Foundation / IEEE, 2020, pp. 9806–9815.
Wei Chen: Methodology, Validation, Writing – original draft, Writ- [15] S. Lim, I. Kim, T. Kim, C. Kim, S. Kim, Fast autoaugment, Adv. Neural Inf.
ing – review & editing. Xiangdong Su: Funding acquisition, Investiga- Process. Syst. 32 (2019) 6662–6672.
[16] E.D. Cubuk, B. Zoph, J. Shlens, Q.V. Le, Randaugment: Practical automated data
tion, Methodology, Writing – review & editing. Hongxu Hou: Funding
augmentation with a reduced search space, in: Proceedings of the the CVPR
acquisition. Workshops, 2020, pp. 702–703.
[17] S.G. Müller, F. Hutter, Trivialaugment: Tuning-free yet state-of-the-art data
Declaration of competing interest augmentation, in: Proceedings of the ICCV, 2021, pp. 774–782.
[18] S. Fogel, H. Averbuch-Elor, S. Cohen, S. Mazor, R. Litman, Scrabblegan: Semi-
The authors declare that they have no known competing finan- supervised varying length handwritten text generation, in: Proceedings of the
cial interests or personal relationships that could have appeared to CVPR, 2020, 2020, pp. 4324–4333.
influence the work reported in this paper. [19] C. Luo, Y. Zhu, L. Jin, Y. Wang, Learn to augment: Joint data augmentation and
network optimization for text recognition, in: Proceedings of the CVPR, 2020,
pp. 13746–13755.
Acknowledgments
[20] E. Alonso, B. Moysset, R. Messina, Adversarial generation of handwritten text
images conditioned on sequences, in: Proceedings of the ICDAR, 2019, IEEE,
This work was funded by National Natural Science Foundation of 2019, pp. 481–486.
China (Grant No. 62366036), National Education Science Planning [21] L. Kang, P. Riba, Y. Wang, M. Rusiñol, A. Fornés, M. Villegas, GANwrit-
Project, China (Grant No. BIX230343), Key R&D and Achievement ing: Content-conditioned generation of styled handwritten word images, in:
Proceedings of the ECCV, 2020, Vol. 12368, Springer, 2020, pp. 273–289.
Transformation Program of Inner Mongolia Autonomous Region, China
[22] J. Gan, W. Wang, HiGAN: Handwriting imitation conditioned on arbitrary-length
(Grant No. 2022YFHH0077), The Central Government Fund for Pro- texts and disentangled styles, in: Proceedings of the AAAI Conference on Artificial
moting Local Scientific and Technological Development, China (Grant Intelligence, Vol. 35, 2021, pp. 7484–7492.
No. 2022ZY0198), Program for Young Talents of Science and Tech- [23] K. Nikolaidou, G. Retsinas, V. Christlein, M. Seuret, G. Sfikas, E.B. Smith, H.
nology in Universities of Inner Mongolia Autonomous Region, China Mokayed, M. Liwicki, WordStylist: Styled verbatim handwritten text generation
with latent diffusion models, in: Proceedings of the ICDAR, 2023, Vol. 14188,
(Grant No. NJYT24033), Inner Mongolia Autonomous Region Science
2023, pp. 384–401.
and Technology Planning Project, China (Grant No. 2023YFSH0017), [24] A.K. Bhunia, S.H. Khan, H. Cholakkal, R.M. Anwer, F.S. Khan, M. Shah, Hand-
The Fund of Supporting the Reform and Development of Local Uni- writing transformers, in: 2021 IEEE/CVF International Conference on Computer
versities (Disciplinary Construction) and The Special Research Project Vision, ICCV 2021, IEEE, 2021, pp. 1066–1074.
of First-class Discipline of Inner Mongolia A. R. of China (Grant No. [25] V. Pippi, S. Cascianelli, R. Cucchiara, Handwritten text generation from visual
YLXKZX-ND-036). archetypes, in: Proceedings of the CVPR, 2023, IEEE, 2023, pp. 22458–22467.
[26] D. Gui, K. Chen, H. Ding, Q. Huo, Zero-shot generation of training data
with denoising diffusion probabilistic model for handwritten Chinese character
Data availability
recognition, in: Proceedings of the ICDAR, 2023, Vol. 14188, 2023, pp. 348–365.
[27] H. Ding, B. Luan, D. Gui, K. Chen, Q. Huo, Improving handwritten OCR with
Data will be made available on request. training samples generated by glyph conditional denoising diffusion probabilistic
model, in: Proceedings of the ICDAR, 2023, Vol. 14190, 2023, pp. 20–37.

11
W. Chen et al. Pattern Recognition 159 (2025) 111079

[28] J. Zdenek, H. Nakayama, Handwritten text generation with character-specific [42] N. Hansen, The CMA evolution strategy: a comparing review, in: Towards
encoding for style imitation, in: Proceedings of the ICDAR, 2023, Vol. 14188, a New Evolutionary Computation: Advances in the Estimation of Distribution
Springer, 2023, pp. 313–329. Algorithms, Springer, 2006, pp. 75–102.
[29] J. Bergstra, R. Bardenet, Y. Bengio, B. Kégl, Algorithms for hyper-parameter
optimization, in: Advances in Neural Information Processing Systems, 2011, pp.
2546–2554. Wei Chen received his B.E. degree in computer science and
[30] B. Shi, X. Bai, C. Yao, An end-to-end trainable neural network for image-based B.E. degree in information engineering from the University
sequence recognition and its application to scene text recognition, IEEE Trans. of Emergency Management (China) in 2020. He is currently
Pattern Anal. Mach. Intell. 39 (11) (2017) 2298–2304. pursuing a Ph.D. in computer science at Inner Mongolia
[31] Z. Feng, S. Guo, X. Tan, K. Xu, M. Wang, L. Ma, Rethinking efficient lane University (China). His research focuses on OCR, few-shot
detection via curve modeling, in: Proceedings of the CVPR, 2022, IEEE, 2022, learning and data augmentation.
pp. 17041–17049.
[32] F. Kleber, S. Fiel, M. Diem, R. Sablatnig, Cvl-database: An off-line database for
writer retrieval, writer identification and word spotting, in: Proceedings of the
ICDAR, 2013, IEEE, 2013, pp. 560–564.
[33] K. Saeed, M. Tabedzki, M. Rybnik, M. Adamski, K3M: A universal algorithm for
image skeletonization and a review of thinning techniques, Int. J. Appl. Math.
Comput. Sci. 20 (2) (2010) 317–335. Xiangdong Su received his B.E. degree and Ph.D. degree
[34] K. Wu, E.J. Otoo, K. Suzuki, Optimizing two-pass connected-component labeling from Inner Mongolia University (China) in 2007 and 2016,
algorithms, Pattern Anal. Appl. 12 (2) (2009) 117–135. respectively. He is currently an associate professor at the
[35] M. Pechwitz, S.S. Maddouri, V. Märgner, N. Ellouze, H. Amiri, et al., IFN/ENIT- College of Computer Science, Inner Mongolia University. He
database of handwritten Arabic words, in: Proc. CIFED, 2, Citeseer, 2002, pp. main focus on OCR, medical visual question answering and
127–136. knowledge graph. He has authored more than 20 papers in
[36] D.B. Nurseitov, K. Bostanbekov, D. Kurmankhojayev, A. Alimova, A. Abdallah, R. peer-reviewed journals and international conferences since
Tolegenov, Handwritten kazakh and Russian (HKR) database for text recognition, 2019.
Multim. Tools Appl. 80 (21) (2021) 33075–33097.
[37] U. Marti, H. Bunke, The IAM-database: an english sentence database for offline
handwriting recognition, Int. J. Document Anal. Recognit. 5 (1) (2002) 39–46.
[38] A. Fischer, V. Frinken, A. Fornés, H. Bunke, Transcription alignment of latin
manuscripts using hidden Markov models, in: Proceedings of the 2011 Workshop
Hongxu Hou received his B.E. degree and Master’s degree
on Historical Document Imaging and Processing, ACM, 2011, pp. 29–36.
from Inner Mongolia University (China) in 1993 and 2000,
[39] E. Augustin, M. Carré, E. Grosicki, J.-M. Brodin, E. Geoffrois, F. Prêteux, RIMES
respectively. He received his Ph.D. degree from the Univer-
evaluation campaign for handwritten mail processing, in: International Workshop
sity of Chinese Academy of Sciences in 2008. At present,
on Frontiers in Handwriting Recognition, IWFHR’06, 2006, pp. 231–235.
he is working as a professor and doctoral supervisor in the
[40] J. Bergstra, Y. Bengio, Random search for hyper-parameter optimization, J. Mach.
College of Computer Science of Inner Mongolia University.
Learn. Res. 13 (2012) 281–305.
His research interests include natural language processing,
[41] K. Deb, S. Agrawal, A. Pratap, T. Meyarivan, A fast and elitist multiobjective
information retrieval, and machine translation.
genetic algorithm: NSGA-II, IEEE Trans. Evol. Comput. 6 (2) (2002) 182–197.

12

You might also like