Handwriting Augmentation
Handwriting Augmentation
Pattern Recognition
journal homepage: www.elsevier.com/locate/pr
Keywords: With the advancement of deep learning-based character recognition models, the training data size has
Data automatic augmentation become a crucial factor in improving the performance of handwritten text recognition. For languages with
Handwritten character recognition low-resource handwriting samples, data augmentation methods can effectively scale up the data size and
Bézier curve
improve the performance of handwriting recognition models. However, existing data augmentation methods for
Bayesian optimization
handwritten text face two limitations: (1) Methods based on global spatial transformations typically augment
the training data by transforming each word sample as a whole but ignore the potential to generate fine-grained
transformation from local word areas, limiting the diversity of the generated samples; (2) It is challenging to
adaptively choose a reasonable augmentation parameter when applying these methods to different language
datasets. To address these issues, this paper proposes Fine-grained Automatic Augmentation (FgAA) for
handwritten character recognition. Specifically, FgAA views each word sample as composed of multiple strokes
and achieves data augmentation by performing fine-grained transformations on the strokes. Each word is
automatically segmented into various strokes, and each stroke is fitted with a Bézier curve. On such a basis,
we define the augmentation policy related to the fine-grained transformation and use Bayesian optimization
to select the optimal augmentation policy automatically, thereby achieving the automatic augmentation of
handwriting samples. Experiments on seven handwriting datasets of different languages demonstrate that
FgAA achieves the best augmentation effect for handwritten character recognition. Our code is available at
https://github.com/IMU-MachineLearningSXD/Fine-grained-Automatic-Augmentation
∗ Corresponding author.
E-mail addresses: chenw@mail.imu.edu.cn (W. Chen), cssxd@imu.edu.cn (X. Su), cshhx@imu.edu.cn (H. Hou).
https://doi.org/10.1016/j.patcog.2024.111079
Received 5 September 2024; Accepted 9 October 2024
Available online 22 October 2024
0031-3203/© 2024 Elsevier Ltd. All rights are reserved, including those for text and data mining, AI training, and similar technologies.
W. Chen et al. Pattern Recognition 159 (2025) 111079
Fig. 1. Augmented samples of the handwritten word ‘my’. The augmentation operations used are all from the AutoAugment policy library [11].
handwritten text. Specifically, FgAA automatically segments each word • We employ Bayesian optimization to search for the optimal aug-
sample into multiple strokes, each of which is fitted with Bézier curves. mentation policy in handwritten text augmentation, reducing the
By manipulating the control points of these curves, all strokes are number of iterations of the model search. Simultaneously, the
reconstructed and reassembled to create new handwritten samples. The simplified search space accelerates the convergence of FgAA.
process above allows us to generate fine-grained variations for each • Extensive experiments conducted on seven different handwrit-
word stroke rather than employing a coarse global processing approach. ten datasets demonstrate that FgAA significantly improves the
Building upon this foundation, we introduce an iterative optimization performance of text recognizers compared to the baselines.
framework to autonomously learn the optimal policy, which is utilized
to constrain the movement radius of control points. In each iteration, The rest of the paper is organized as follows. In Section 2, we re-
FgAA samples a policy from the policy library based on historical ex- view the three categories of handwritten text augmentation techniques
ploration results. Subsequently, FgAA generates training data using the proposed in recent years. In Section 3, we introduce the details of the
sampled policy. The recognition network utilizes this data for training proposed approach. In Section 4, we present the experimental setup.
and outputs the recognition loss, which is used to update the policy In Section 5, we conduct extensive experiments to demonstrate the
with Bayesian-based policy optimizer. After training, FgAA automati- effectiveness of FgAA and show some augmented samples generated
cally and adaptively learns the optimal augmentation policy for specific by baselines and FgAA. Finally, in Section 6, we conclude the paper.
datasets without any manual intervention. Experiments conducted on
seven handwriting datasets of different languages demonstrate that 2. Related work
FgAA achieves the best augmentation effects for handwritten character
recognition. 2.1. Basic methods and automated augmentations
There are four main differences between our proposed FgAA and
AA. The most significant difference lies in the augmentation methods Researchers have proposed numerous methods for data augmenta-
within the policy library. As shown in Fig. 1, AA directly incorporates tion [7]. Basic methods primarily rely on matrix and spatial transfor-
commonly used image augmentation methods, which mainly rely on mations to generate images. The most widely used is Affine transforma-
global processing and color space transformations. In contrast, FgAA tion [8], which involves a combination of rotations, scaling, translation,
leverage the free transformation properties of Bézier curves for fine- and shearing to globally transform images. There are also methods for
grained augmentation on text strokes, enabling augmented samples fine-grained transformations, such as TPS [9], which alters the image
to fit a broader range of handwriting styles. The second is the size with part-specific transformations. Additionally, SlA [10] generates
of the search space. In the policy library of AA, each policy is a handwritten samples by curve fitting and reassembling strokes, which
triplet comprising (1) one of 16 augmentation operations, (2) the is the basis of our proposed approach. While effective, it lacks the abil-
application probability, and (3) the magnitude of the operation. The ity to automatically determine the optimal manipulation magnitude,
complexity of this search space significantly hampers the convergence requiring extensive ablation experiments. Although they are effective,
speed of AA. In contrast, the policy library of FgAA comprises only it is time-consuming to determine an appropriate basic approach and
one text augmentation, and its sampling policy includes solely the the magnitude of operation for a particular dataset. Automated aug-
magnitude of operation, excluding other factors. Consequently, FgAA mentation offers a solution to this problem. AA [11] is a pioneering
is relatively lightweight in terms of automated search frameworks. effort in automated augmentation, utilizing iterative reinforcement
Moreover, instead of using the reinforcement learning employed by learning [12] to train search networks. However, its main drawback
AA as the search algorithm, we employ the Bayesian optimization to is that it takes thousands of GPU hours to converge and get the aug-
search for optimal augmentation policy with fewer iterations and less mentation strategy. To address this, Fast AutoAugment (Fast AA) [15]
time. Finally, most existing work [15–17], including AA, primarily employs a density-matching strategy and Bayesian optimization to
focuses on image classification. Our proposed framework fills the gap
exponentially reduce the search time. RandAugment [16] proposes to
in handwriting recognition in the automated augmentation paradigm.
preserve the randomness of the augmentation method and reduce the
In summary, the key contributions of this paper are as follows:
search space. TrivialAugment (TA) [17] is a simple, parameterless auto-
• We propose Fine-grained Automatic Augmentation (FgAA) for augment baseline, applying a single augment to each image. While
handwritten character recognition. FgAA can automatically learn automated augmentations exhibit exciting performance, their policy li-
an optimal augmentation policy tailored to different handwrit- braries are characterized by large parameter quantities and fail to meet
ten datasets and generate augmented samples with richer diver- the specific augmentation requirements for handwritten images. In
sity for handwritten text, which benefits handwritten character contrast, our policy library is low parameter and specifically designed
recognition. for handwritten text.
• We propose to employ Bézier curves to mimic diversified hand-
writing styles in handwritten text augmentation, introducing fine- 2.2. Deep generative models
grained variations in the strokes of the text. This approach fun-
damentally differs from traditional spatial transformation-based Samples generated by existing methods are constrained by the rules
augmentations, such as rotation and scaling. of the method itself. Therefore, researchers turned to deep generative
2
W. Chen et al. Pattern Recognition 159 (2025) 111079
models [6] for data augmentation. ScrabbleGAN [18] utilizes semi- 𝑡, the policy optimizer calculates the maximum expected improvement
supervised training to synthesize handwritten text images, alleviating (EI) based on the historical exploration results to select the policy 𝑡
the burden of labeling data. Learn to Augment (L2A) [19] is the from the policy library that is most expected to be explored. Then, the
first automated augmentation paradigm for handwriting recognition, text augmentation employs the policy 𝑡 to augment the training set,
using joint adversarial training. Alonso et al. [20] propose a network thereby obtaining the augmented data. Subsequently, the augmented
architecture based on GAN and bidirectional LSTMs to generate real- data is combined with the original training set and fed into the recog-
istic images of French and Arabic. GANwriting [21] produces reliable nizer for fine-tuning. We utilize CRNN [30] as the text recognizer and
handwritten word images by conditioning the generative process with we pre-train the text recognizer with the original training set. Finally,
calligraphic style features and textual content. HiGAN [22] generates the policy optimizer updates the sampling policy 𝑡+1 for the next
variable-length handwritten text based on arbitrary textual content, not iteration based on the historical recognition loss of the recognizer on
bound by predefined corpus or extra-lexical words. WordStylist [23] the validation set. The specific details of the policy optimization process
proposes a latent diffusion-based method and uses class index styles will be described in Section 3.3. After 𝑇 iterations of search, FgAA
and text content prompts to generate realistic word image samples for can provide an optimal augmentation policy tailored to the specific
different writer styles. However, these methods often require a large handwritten dataset.
amount of data to train the generator. To address the issue of few-shot
handwritten text generation, HWT [24] proposes to use a transformer- 3.2. Text augmentation
based network for handwritten text generation. Moreover, VATr [25]
generates samples by employing a novel text content representation and Since the variety of handwriting styles and the scarcity of training
extracting styles from pre-training on a large synthetic dataset. Addi- data are the performance bottlenecks of character recognizers, we aim
tionally, to generate samples of unseen character types, GC-DDPM [26] to generate more diverse training samples to train the recognition
utilizes a denoising diffusion probabilistic model on Chinese datasets models. Concerning the strong controllability and high flexibility of
to convert font-generated Chinese character images into handwritten Bézier curves, we propose to segment the word samples into several
ones. Ding et al. [27] propose a progressive data filtering strategy Bézier curves and generate more diverse samples by moving the con-
to alleviate the mismatch between the text content and the glyph trol points of the Bézier curves, thus improving the performance of
conditional images synthesized by GC-DDPM. Zdenek et al. [28] use recognizers. The controllability and flexibility of Bézier curves have
a style encoder based on a vision transformer network that encodes also been proved in the HTR [10], text detection [14] and lane detec-
handwriting style from reference images and allows the generator to tion [31]. Our proposed text augmentation algorithm consists of three
imitate it. main sections: stroke segmentation, control field computation, and
transformation. In Fig. 2, we provide the pipeline and one example of
3. Methodology the text augmentation and summarize the overall process in Algorithm
1.
To generate more diverse handwritten samples through fine-grained Firstly, we segment handwritten text into multiple strokes based
transformation and automatically select appropriate augmentation pa- on its skeleton and connected domain, with each stroke fitted as a
rameters, we design Fine-grained Automatic Augmentation (FgAA) quadratic Bézier curve. Subsequently, we compute the control points
paradigm, which generates the augmented samples through Bézier and control field for each curve. Afterward, we allow each control point
curves and automatically selects the optimal augmentation policy to move randomly within its control field. Finally, we reconstruct the
through Bayesian optimization. Section 3.1 briefly outlines the overall Bézier curves based on the new control point positions and concate-
framework of FgAA. Section 3.2 introduces the Bézier curve-based text nate these curves to obtain a new handwritten sample. To assess the
augmentation method employed in FgAA. Finally, Section 3.3 provides efficiency of our text augmentation, we conduct tests on the CVL [32]
a detailed explanation of how the policy optimizer searches for the public dataset using an Inter(R) Xeon(R) Gold 5218R CPU, and the
optimal policy throughout the entire search process. average time required to augment one sample is only 3.2 ms. The
following sections provide a detailed explanation of the three sections
3.1. Overall framework of our text augmentation.
The proposed FgAA belongs to a Sequential Model-Based Optimiza- 3.2.1. Stroke segmentation
tion (SMBO) [29] framework. As illustrated in Fig. 2, FgAA is primarily To apply Bézier curve deformation on handwritten text, we first
composed of a policy library and four operations: sampling policy, need to segment the handwritten text into multiple strokes. In the initial
text augmentation, fine-tuning recognizer, and updating policy step, the handwritten text image 𝐼 undergoes convolution with a (3 × 3)
optimizer. The number of parameters for each policy in the policy blur kernel to smooth out the sharp edges of the text. Subsequently,
library is solely dependent on the number 𝑁 of augmentations applied we apply the image thinning algorithm [33] to the blurred image 𝐼𝑏 .
to the training data, as shown in Eq. (1). This step iteratively corrodes handwritten text until the skeleton 𝐼𝑠𝑘𝑒
is obtained. Next, we calculate the sum of eight neighborhood pixels
= {𝑟1 , ⋯ , 𝑟𝑖 , ⋯ , 𝑟𝑁 } 𝑟 ∈ [0, 1] (1)
for each point 𝑃𝑖 on the skeleton. Points that conform to Eq. (2) are
The parameter 𝑟 is used to control the movement radius of the control defined as branch points.
points in text augmentation. Each sampled policy in every iteration ∑
𝑛=8
is employed for text augmentation. We utilize Bézier curves to mimic 𝑃𝑖 ≥ 4 (2)
various handwriting styles at the stroke-level in text augmentation, and 𝑖=1
the details will be elaborated in Section 3.2. The text recognizer, which Now, we obtain a collection 𝑆𝑏 of branch points from the skeleton,
is trained on the augmented samples, feeds back the recognition perfor- which serves as the segmentation points for stroke segmentation. Next,
mance on the validation set to the policy optimizer. The Bayesian-based we remove the branch points 𝑆𝑏 from the skeleton, making the skeleton
policy optimizer is used to adaptively update the optimal policy 𝐼𝑠𝑘𝑒 no longer be a single, connected domain. Subsequently, we em-
from the policy library according to the historical performance of the ploy the Two-Pass [34] algorithm to label the different strokes in the
recognizer. ′ . This algorithm can identify and label all the connected
skeleton 𝐼𝑠𝑘𝑒
FgAA aims to find the optimal policy within a limited number domains in the image during two scanning processes. Pixels belonging
of iterations 𝑇 , which can be used to generate diverse handwriting to the same connected domain are assigned the same label. And we
samples and improve the performance of the recognizer. At iteration adopt the 8-neighborhood pattern of the Two-Pass algorithm. Through
3
W. Chen et al. Pattern Recognition 159 (2025) 111079
Fig. 2. Overview of the proposed framework. In each iteration, the policy optimizer samples a policy based on the historical search results. Subsequently, text augmentation is
performed on the training data. The recognizer then undergoes fine-tuning on the augmented data and the training set. Finally, the policy optimizer updates the sampled policy
for the next iteration based on the historical recognition loss. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this
article.)
Algorithm 1 Text Augmentation As for 𝑃𝑏1 and 𝑃𝑏2 , we compute the sum of eight neighborhood pixels
of each point in the subset in turn, and the points that conform to
Input: Handwritten image: 𝐼; Sampling policy: = {𝑟1 , ..., 𝑟𝑁 };
Eq. (3) are the curve’s endpoints.
Augment times: 𝑁;
Output: Augmented sample set 𝑆𝑎𝑢𝑔 ∑
𝑛=8
𝑃𝑖 = 3 (3)
1: 𝐼𝑏 ← Blur 𝐼; ⊳ Step 1: Stroke segmentation 𝑖=1
2: 𝐼𝑠𝑘𝑒 ← Thinning 𝐼𝑏 ;
3: 𝑆𝑏 ← Calculate branch point set; ⊳ Eq. (2) Then, we calculate the third control point 𝑃𝑏3 of the curve based on
′ ← Delete 𝑆 in 𝐼 the obtained endpoints. According to the definition of the Bézier curve,
4: 𝐼𝑠𝑘𝑒 𝑏 𝑠𝑘𝑒 ;
5: 𝑆𝑠𝑘𝑒 ← Two-Pass 𝐼𝑠𝑘𝑒 ′ ; the position of 𝑃𝑏3 is related to the position of 𝑃𝑏1 , 𝑃𝑏2 , and 𝑃𝑓 , the
6: for 𝑖 = 1 to 𝑙𝑒𝑛(𝑆𝑠𝑘𝑒 ) do ⊳ Step 2: Control field computation farthest point of the curve. We calculate 𝑃𝑓 on the skeleton according
7: 𝑃𝑏1 , 𝑃𝑏2 ← Calculate two-end control points; ⊳ Eq. (3) to the distance formula Eq. (4) of the line connected from a point to
8: 𝑃𝑓 ← Calculate the point furthest from 𝑃𝑏1 and 𝑃𝑏2 on 𝑆𝑠𝑘𝑒𝑖 ; ⊳ two points,
(| )
Eq. (4) (𝑥2 − 𝑥1 ) × 𝑦3 + (𝑦1 − 𝑦2 ) × 𝑥3 + 𝑥1 𝑦2 − 𝑥2 𝑦1 ||
9: 𝑃𝑏3 ← Calculate the third control point; ⊳ Eq. (5) 𝑃𝑓 (𝑥𝑓 , 𝑦𝑓 ) = arg max | √
𝑥3 ,𝑦3 (𝑥2 − 𝑥1 )2 + (𝑦2 − 𝑦1 )2
10: 𝐹𝑏1 , 𝐹𝑏2 , 𝐹𝑏3 ← Calculate control field; ⊳ Eq. (6)-(8)
11: end for (4)
12: 𝑆𝑓 𝑖𝑒𝑙𝑑 ← Get the control field set;
where (𝑥1 , 𝑦1 ) and (𝑥2 , 𝑦2 ) represent the coordinates of 𝑃𝑏1 and 𝑃𝑏2
13: for 𝑖 = 1 to 𝑁 do ⊳ Step 3: Transformation
respectively. (𝑥3 , 𝑦3 ) represents the coordinates of the remaining points
14: for 𝑗 = 1 to 𝑙𝑒𝑛(𝑆𝑓 𝑖𝑒𝑙𝑑 ) do
′ , 𝑃 ′ , 𝑃 ′ ← Move control points within control fields; in the stroke. After that, the third control point 𝑃𝑏3 is calculated
15: 𝑃𝑏1 𝑏2 𝑏3 according to Eq. (5).
16: 𝑃𝑡 ← Calculate the points on the new Bezier curve;⊳ Eq. (9)
1
17: end for 𝑃𝑏3 (𝑥, 𝑦) = 2𝑃𝑓 (𝑥𝑓 , 𝑦𝑓 ) − (𝑃𝑏1 (𝑥1 , 𝑦1 ) + 𝑃𝑏2 (𝑥2 , 𝑦2 )) (5)
2
18: 𝐼𝑎𝑢𝑔𝑖 ← Assemble curves and generate an augmented sample;
19: end for
20: 𝑆𝑎𝑢𝑔 ← Get augmented sample set {𝐼𝑎𝑢𝑔1 , ..., 𝐼𝑎𝑢𝑔𝑛 }; The three control points 𝑃𝑏1 , 𝑃𝑏2 , and 𝑃𝑏3 of the quadratic Bézier
curve are obtained through the above process. Moreover, the control
field of the control points is calculated below. Each control field is set
to a circle centered on the control point. First of all, for the control field
the algorithm above, we obtain a pixel subset for each stroke, with 𝐹𝑏1 and 𝐹𝑏2 of 𝑃𝑏1 and 𝑃𝑏2 , in order to introduce as few error samples
each subset containing all pixels of a segmented stroke. These subsets as possible, we stipulate that the movement of endpoints should not
form a set 𝑆𝑠𝑘𝑒 . As shown in the text segmentation section in Fig. 2, cross the position of other endpoints. Therefore, we define the distance
the handwritten word ‘my’ is divided into 6 strokes, corresponding to between 𝑃𝑏1 , 𝑃𝑏2 and their nearest control points as the maximum
radius of 𝐹𝑏1 , 𝐹𝑏2 . 𝐹𝑏1 , 𝐹𝑏2 are calculated as shown in Eqs. (6) and (7).
6 subsets, and each stroke is represented with a different color. (√ )
𝐹𝑏1 = 𝑚𝑖𝑛 (𝑥1 − 𝑥𝑐 )2 + (𝑦1 − 𝑦𝑐 )2 (6)
𝑥𝑐 ,𝑦𝑐
(√ )
3.2.2. Control field computation 𝐹𝑏2 = 𝑚𝑖𝑛 (𝑥2 − 𝑥𝑐 )2 + (𝑦2 − 𝑦𝑐 )2 (7)
𝑥𝑐 ,𝑦𝑐
It is necessary to obtain the positions of control points and define
their moving regions when using Bézier curves to mimic the strokes (𝑥𝑐 , 𝑦𝑐 ) represents the coordinates of the other endpoints. Next, we
and generate various handwritten characters. If the movement of the define 𝐹𝑏3 of the third control point 𝑃𝑏3 , and we restrict the maximum
control points is unconstrained, a grotesquely shaped curve will be radius of 𝐹𝑏3 to not exceed the distance from 𝑃𝑏3 to the line connected
by 𝑃𝑏1 and 𝑃𝑏2 . The radius of 𝐹𝑏3 is calculated in Eq. (8).
generated, ultimately affecting the quality and the correctness of the
|(𝑥2 − 𝑥1 ) × 𝑦 + (𝑦1 − 𝑦2 ) × 𝑥 + 𝑥1 𝑦2 − 𝑥2 𝑦1 |
augmented sample. Therefore, it is essential to define an appropriate 𝐹𝑏3 = | √
| (8)
control field. The shape of a quadratic Bézier curve is controlled by (𝑥2 − 𝑥1 )2 + (𝑦2 − 𝑦1 )2
three control points: 𝑃𝑏1 , 𝑃𝑏2 , and 𝑃𝑏3 . 𝑃𝑏1 , 𝑃𝑏2 are endpoints, and 𝑃𝑏3 (𝑥, 𝑦) represents the coordinates of 𝑃𝑏3 . After traversing all strokes of
is the third control point. 𝑆𝑠𝑘𝑒 , we get the control field of all strokes.
4
W. Chen et al. Pattern Recognition 159 (2025) 111079
5
W. Chen et al. Pattern Recognition 159 (2025) 111079
Table 1 and evaluate all the models in the same setups. Affine transforma-
The information of the experimental datasets.
tion1 [8] uses rotation, translation, scaling, and shearing to generate
Dataset Languages Level Authors Charset # Train/Val/Test new samples. It is one of the most widely used data augmentation
CVL English/German Word Many 52 12.1k/5.5k/80.3k methods and is able to generate a rich diversity of samples. SlA2 [10]
IFN/ENIT Arabic Word Many 297 6.5k/6.0k/6.7k
is a stroke-level augmentation approach, which is the basis of our
HKR Russian/Kazakh Word Many 84 6.2k/3.5k/55.2k
Mongolian Mongolian Word Many 79 10.0k/5.0k/10.0k
proposed FgAA. TPS3 [9] is a fine-grained image warping method,
which alters the image with part-specific transformations. VATr4 [25]
Word 53.8k/16.5k/17.6k
IAM English Many 79 is a transformer-based network, which generates samples by employ-
Line 6.5k/1.0k/2.9k
Saint Gall Latin Line One 49 468/235/707
ing a novel text content representation and extracting styles from
RIMES French Line Many 95 10.2k/1.1k/0.8k pre-training on a large synthetic dataset. ScrabbleGAN5 [18] is a semi-
supervised generative adversarial network for generating handwritten
text with an arbitrary length. It adopts a semi-supervised approach
to train the model, which helps reduce the amount of labeled data
distribution 𝑝( ∣ ). From this, we can derive the formula for 𝐸 𝐼(), required for the task. L2A6 [19] is based on similarity transformations.
as illustrated in Eq. (15). It employs an agent network to learn from the output of the recognition
∗ model and manipulate the fiducial points to generate more adversarial
𝐸 𝐼() = (∗ − ) 𝑝𝑀 ( ∣ )𝑑
∫−∞ augmentation parameters, rather than directly generating handwritten
∗
𝑝𝑀 ( ∣ )𝑝() images. AA7 [11] uses reinforcement learning to search augmentation
= (∗ − ) 𝑑 policies. These policies consist of an image processing operation such as
∫−∞ 𝑝()
𝑙() ∗ rotation or coloring, and the probability and magnitude with which the
= (∗ − ) 𝑝()𝑑 operation is applied. Fast AA8 [15] uses a density matching strategy
𝑝() ∫−∞
[ ∗ ] to reduce the training cost of AA, and improve the generalization
𝑙()
= 𝛾∗ − 𝑝()𝑑 performance of a given network. TA9 [17] applies an augmentation
𝑝() ∫−∞
[ ∗ ] ( )−1 operation in a simple and random way, with an operation amplitude
𝑙() 𝑔(𝑃 )
= 𝛾∗ − 𝑝()𝑑 ∝ 𝛾 + (1 − 𝛾) randomly sampled from {0, … , 30}. It does not need to search for any
𝛾 𝑙() + (𝟣 − 𝛾)𝑔() ∫−∞ 𝑙(𝑃 )
parameters.
(15)
4.3. Metrics and implementation details
Therefore, TPE maximizes 𝐸 𝐼() by maximizing 𝑙()∕𝑔(). The policy
is ideally characterized by a high probability under 𝑙(𝑃 ) and a low
This study utilizes the Word Accuracy Rate (WAR), Character Error
probability under 𝑔(𝑃 ). In each iteration, TPE updates the probability
Rate (CER), and Word Error Rate (WER) as evaluation metrics to vali-
model 𝑡 of the objective function based on historical observations
date the effects of different augmentation methods in the experiment.
and returns the maximum 𝐸 𝐼().
We use both the CRNN and VAN [2] models to experimentally validate
the proposed method. For the VAN model, our data processing and
4. Experimental settings model parameter settings strictly follow the original paper. Moreover,
we do not use the data augmentation module provided by the VAN
4.1. Datasets model. Due to the size requirements of the CRNN input layer, all images
are rescaled to 160 × 32. If the aspect ratio of the original handwritten
To comprehensively validate the performance and generalization of image is greater than 160/32, we primarily do zero padding at the top
FgAA, we conduct experiments on seven handwritten datasets. Table 1 and bottom of the handwritten image, and then resize it to 160 × 32. If
presents information about the datasets used in the experiments. The the aspect ratio of the original handwritten image is less than 160/32,
CVL [32] is a public handwritten database for writer retrieval, writer we primarily do zero padding at the left and right of the image, and
identification and word spotting. It is produced by 311 different writers then resize it to 160 × 32. Besides, to ensure the fairness of comparison,
and consists of 7 different handwritten texts (1 German and 6 English the learning rate and batch size of the CRNN are uniformly set to 1𝑒−4
texts). The IFN/ENIT [35] is a handwritten Arabic dataset, which and 32. In each search iteration, the pre-trained CRNN is fine-tuned
contains the names of 26,459 cities and villages. A total of 411 authors 15 epochs on the augmented data before being used for validation.
contributed to the production of this dataset. The HKR [36] is the The pre-training data is the original training set. All the baselines are
first version of a database that contains Russian and Kazakh words implemented according to their original papers and the open-source
for offline handwriting recognition. It is produced by approximately code. Moreover, our framework is based on Optuna,10 which is an open-
200 different writers. The Mongolian is a handwritten Mongolian source hyperparameter optimization framework that allows us to easily
dataset from Inner Mongolia University. The way of word building change various search algorithms. Due to the low search cost of the
of Mongolian makes it particularly difficult to be recognized. The proposed FgAA, we do not adopt distributed training in the experiments
IAM [37] is an English handwritten dataset, which is based on the and we conduct all the experiments on a single NVIDIA Tesla P100.
Lancaster-Oslo/Bergen (LOB) corpus. It includes 1066 forms produced
by approximately 400 different writers. The Saint Gall [38] dataset 1
https://pytorch.org/vision/stable/index.html.
contains 60 pages of a handwritten historical manuscript written in 2
https://github.com/IMU-MachineLearningSXD/script-level_aug_
Latin by a single author at the end of the 9th Century. The RIMES [39] ICFHR2022.
database consists of 12 723 pages of scanned letters written in French 3
https://docs.opencv.org/3.4/dc/d18/classcv_1_
by 1300 different authors. 1ThinPlateSplineShapeTransformer.html.
4
https://github.com/aimagelab/VATr.
5
4.2. Baselines https://github.com/amzn/convolutional-handwriting-gan.
6
https://github.com/Canjie-Luo/Text-Image-Augmentation.
7
https://github.com/DeepVoltaire/AutoAugment.
To compare the FgAA with the previous augmentation methods, we 8
https://github.com/kakaobrain/fast-autoaugment.
select nine representative baselines of data augmentation methods. And 9
https://github.com/automl/trivialaugment.
10
we use the official implementation and weights released by the authors https://optuna.org/.
6
W. Chen et al. Pattern Recognition 159 (2025) 111079
Table 2 Table 4
Word-level recognition results of VAN without augmentation. Word-level recognition results of FgAA compared to the baselines on the IFN/ENIT and
No augmentation HKR datasets.
IFN/ENIT HKR
Dataset WER (%)↓ CER (%)↓
Augmentation times/# augmented samples
CVL 22.74 ± 0.45 9.16 ± 0.21
IFN/ENIT 36.97 ± 0.37 10.16 ± 0.26 10/71.9k 10/67.9k
HKR 33.07 ± 0.63 13.52 ± 0.31
Method WER (%)↓ CER (%)↓ WER (%)↓ CER (%)↓
Mongolian 55.96 ± 0.77 16.30 ± 0.34
IAM 15.36 ± 0.28 5.51 ± 0.15 Affine [8] 28.41 ± 0.31 6.34 ± 0.11 17.56 ± 0.21 8.36 ± 0.12
SlA [10] 24.99 ± 0.13 5.94 ± 0.03 15.80 ± 0.29 7.67 ± 0.14
TPS [9] 30.07 ± 0.36 7.45 ± 0.14 29.05 ± 0.32 11.88 ± 0.17
Table 3 ScrabbleGan [18] 36.01 ± 0.30 8.56 ± 0.22 31.13 ± 0.18 12.75 ± 0.15
Word-level recognition results of FgAA compared to the baselines on the CVL VATr [25] 34.62 ± 0.27 8.23 ± 0.19 30.64 ± 0.30 12.53 ± 0.23
dataset. L2A [19] 25.70 ± 0.26 6.11 ± 0.18 16.11 ± 0.16 7.92 ± 0.20
AA [11] 26.82 ± 0.16 6.13 ± 0.07 21.22 ± 0.42 9.49 ± 0.17
CVL
Fast AA [15] 25.91 ± 0.28 6.25 ± 0.24 21.83 ± 0.33 9.86 ± 0.21
Augmentation times/# augmented samples TA [17] 27.54 ± 0.32 6.57 ± 0.18 20.59 ± 0.39 9.38 ± 0.31
10/133.6k 15/194.4k FgAA (ours) 24.23 ± 0.15 5.76 ± 0.08 15.09 ± 0.24 7.42 ± 0.11
7
W. Chen et al. Pattern Recognition 159 (2025) 111079
8
W. Chen et al. Pattern Recognition 159 (2025) 111079
Table 8
Ablation experiments on FgAA iterations. Optimal epochs represent the iteration epoch in which the optimal policy
is found within the maximum epochs.
CVL IFN/ENIT
Maximum epochs 25 50 75 100 25 50 75 100
Optimal epochs 20 41 74 74 9 39 39 39
Accuracy 62.69 62.75 62.83 62.83 68.08 68.21 68.21 68.21
HKR Mongolian
Maximum epochs 25 50 75 100 25 50 75 100
Optimal epochs 1 28 60 60 19 19 19 19
Accuracy 62.69 62.95 63.13 63.13 15.21 15.21 15.21 15.21
Fig. 3. (a) The average time consumed for augmenting one image of an original sample with different numbers of control points. (b) The proportion of samples with different
numbers of control points in the CVL word dataset.
Table 9 Table 10
Line-level recognition results of FgAA compared to the baselines on single-author and Comparison of runtime for FgAA and AA across five datasets, reported in days.
large datasets. CVL IFN/ENIT HKR Mongolian IAM Average
Dataset Method CER (%)↓ WER (%)↓ #train Train.time
AA [11] 41.8d 35.7d 37.4d 32.2d 39.1d 36.8d
No aug 9.53 33.15 468 0.2d FgAA (ours) 2.3d 1.1d 1.7d 0.7d 2.6d 1.7d
L2A [19] 7.80 26.69 2,808 0.9d
Saint Gall
SIA [10] 7.89 27.44 2,808 0.9d
FgAA (ours) 7.68 26.30 2,808 0.9d
No aug 3.70 11.03 10,188 2.4d performance on three different training datasets of the same size at line-
L2A [19] 3.39 10.07 30,564 8.7d level and word-level, including IAM + synthetic data, IAM + real data,
RIMES
SIA [10] 3.36 9.92 30,564 8.7d
and IAM + our augmented data. The synthetic data is generated by the
FgAA (ours) 3.23 9.66 30,564 8.7d
publicly available TRDG11 model. The real data of line-level is from
the training dataset of the RIMES, ICFHR14,12 and NorHand13 datasets.
The real data of word-level is the CVL dataset. Our augmented data is
Additionally, to compare the runtime of the proposed FgAA with generated from the IAM. All experiments are done on a single NVIDIA
the baseline AA in searching for the optimal policies, we conduct P100 GPU. We report the CER values on the IAM test set, the sizes of
experiments on five datasets for both FgAA and AA. For AA, we follow the above three training datasets, and the convergence time of each
the experimental settings from the original paper. All the experiments HTR model on these three datasets in Table 11.
are conducted on the NVIDIA Tesla P100. We report the runtime of From Table 11, we find that increasing the training dataset by
these two methods in Table 10. AA requires 41.8, 35.7, 37.4, 32.2, and adding one of the three datasets (synthetic data, real data, and aug-
39.1 days to search for the optimal policy on the CVL, IFN/ENIT, HKR, mented data) to the IAM training set brings improvement to HTR
Mongolian, and IAM datasets, respectively. FgAA requires 2.3, 1.1, 1.7, performance. The CER on the line level is reduced by 0.77%, 1.06%,
0.7, and 2.6 days for the same datasets. The average search time for AA and 0.90% for the three datasets, respectively. The CER on the word
and FgAA are 36.8 and 1.7 days, respectively. The longer search time level is reduced by 0.76%, 1.39%, and 1.17% for the three datasets,
for AA is due to its larger search space and the use of a reinforcement respectively. The experimental results indicate that FgAA improves the
learning algorithm with a higher number of iterations. In contrast, recognizer more than synthetic data. This is because our augmenta-
FgAA only learns the range of control point movements, resulting in tion method can generate more diverse augmented samples by using
significantly fewer parameters. Additionally, we use the faster TPE Bézier curves to fit handwriting strokes and the Bayesian optimiza-
algorithm for searching. The experimental results demonstrate that tion approach to automatically learn the operation magnitude, and
FgAA has a substantial advantage over AA in terms of runtime. there are limits to the diversity of the synthetic data. In addition,
9
W. Chen et al. Pattern Recognition 159 (2025) 111079
Fig. 4. Augmented samples generated by our model for handwritten text lines.
Fig. 5. Augmented samples generated by our model on handwritten words in different languages.
10
W. Chen et al. Pattern Recognition 159 (2025) 111079
6. Conclusion References
This paper designs Fine-grained Automatic Augmentation (FgAA) to [1] V. Pippi, S. Cascianelli, C. Kermorvant, R. Cucchiara, How to choose pretrained
generate diverse, high-quality training samples for handwritten charac- handwriting recognition models for single writer fine-tuning, in: Proceedings of
the ICDAR, 2023, Vol. 14188, Springer, 2023, pp. 330–347.
ter recognition. The FgAA employs Bézier curves to simulate handwrit-
[2] D. Coquenet, C. Chatelain, T. Paquet, End-to-end handwritten paragraph text
ten script and introduces fine-grain deformations through control points recognition using a vertical attention network, IEEE Trans. Pattern Anal. Mach.
movement. It employs the TPE optimization algorithm to automatically Intell. 45 (1) (2023) 508–524.
adjust the range of control point movements and obtain the optimal [3] A.K. Bhunia, A. Sain, P.N. Chowdhury, Y. Song, Text is text, no matter what:
augmentation policy, ensuring controlled and reasonable sample gen- Unifying text recognition using knowledge distillation, in: Proceedings of the
ICCV, 2021, pp. 963–972.
eration. FgAA completes the search process in an average of 1.7 days
[4] L. Kang, P. Riba, M. Rusiñol, A. Fornés, M. Villegas, Content and style aware
on a single NVIDIA Tesla P100, requiring no manual intervention. generation of text-line images for handwriting recognition, IEEE Trans. Pattern
The experimental results on CVL, IFN/ENIT, HKR, Mongolian and Anal. Mach. Intell. 44 (12) (2022) 8846–8860.
IAM datasets indicate that FgAA can generate samples that reflect a [5] M.A. Souibgui, A.F. Biten, S. Dey, A. Fornés, Y. Kessentini, L. Gómez, D. Karatzas,
broader range of handwriting styles. FgAA outperforms various types J. Lladós, One-shot compositional data generation for low resource handwritten
text recognition, in: Proceedings of the WACV, 2022, IEEE, 2022, pp. 2563–2571.
of baselines, providing the maximum improvement to the performance
[6] A.F.d. Neto, B.L.D. Bezerra, G.C.D. de Moura, A.H. Toselli, Data augmentation for
of HTR models. Additionally, the experimental results demonstrate that offline handwritten text recognition: A systematic literature review, SN Comput.
the TPE algorithm exhibits superior performance on FgAA compared to Sci. 5 (2) (2024) 258.
other optimization algorithms. [7] M. Xu, S. Yoon, A. Fuentes, D.S. Park, A comprehensive survey of image
Although our method has achieved promising results on various augmentation techniques for deep learning, Pattern Recognit. 137 (2023)
benchmarks, the proposed FgAA still has some limitations. For instance, 109347.
[8] Q. Lin, C. Luo, L. Jin, S. Lai, STAN: a sequential transformation attention-based
unclear or noisy handwritten text may affect the extraction of con-
network for scene text recognition, Pattern Recognit. 111 (2021) 107692.
trol points, thereby reducing the quality of augmented samples. In [9] V. Pippi, S. Cascianelli, L. Baraldi, R. Cucchiara, Evaluating synthetic pre-training
future research, we will first improve the image preprocessing steps. for handwriting processing tasks, Pattern Recognit. 172 (2023) 44–50.
Additionally, although we have implemented an automatic search for [10] W. Chen, X. Su, H. Zhang, Script-level word sample augmentation for few-shot
the optimal operation magnitude of FgAA, we cannot dynamically handwritten text recognition, in: Proceedings of the ICFHR, 2022, Vol. 13639,
Springer, 2022, pp. 316–330.
evaluate the generated samples to select those that are more helpful
[11] E.D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, Q.V. Le, Autoaugment: Learning
in improving the performance of the HTR model. Through further augmentation strategies from data, in: Proceedings of the CVPR, 2019, pp.
exploration, we have found that the influence function can be used 113–123.
to measure the impact of samples on model parameters. Therefore, [12] X. Du, H. Chen, C. Wang, Y. Xing, J. Yang, P.S. Yu, Y. Chang, L. He, Robust
the influence function may help us explore the importance of different multi-agent reinforcement learning via Bayesian distributional value estimation,
Pattern Recognit. 145 (2024) 109917.
generated samples from the perspective of model interpretability. This
[13] R. Atienza, Data augmentation for scene text recognition, in: Proceedings of the
is also our future research direction. IEEE/CVF International Conference on Computer Vision, 2021, pp. 1561–1570.
[14] Y. Liu, H. Chen, C. Shen, T. He, L. Jin, L. Wang, ABCNet: Real-time scene text
CRediT authorship contribution statement spotting with adaptive bezier-curve network, in: Proceedings of the CVPR, 2020,
Computer Vision Foundation / IEEE, 2020, pp. 9806–9815.
Wei Chen: Methodology, Validation, Writing – original draft, Writ- [15] S. Lim, I. Kim, T. Kim, C. Kim, S. Kim, Fast autoaugment, Adv. Neural Inf.
ing – review & editing. Xiangdong Su: Funding acquisition, Investiga- Process. Syst. 32 (2019) 6662–6672.
[16] E.D. Cubuk, B. Zoph, J. Shlens, Q.V. Le, Randaugment: Practical automated data
tion, Methodology, Writing – review & editing. Hongxu Hou: Funding
augmentation with a reduced search space, in: Proceedings of the the CVPR
acquisition. Workshops, 2020, pp. 702–703.
[17] S.G. Müller, F. Hutter, Trivialaugment: Tuning-free yet state-of-the-art data
Declaration of competing interest augmentation, in: Proceedings of the ICCV, 2021, pp. 774–782.
[18] S. Fogel, H. Averbuch-Elor, S. Cohen, S. Mazor, R. Litman, Scrabblegan: Semi-
The authors declare that they have no known competing finan- supervised varying length handwritten text generation, in: Proceedings of the
cial interests or personal relationships that could have appeared to CVPR, 2020, 2020, pp. 4324–4333.
influence the work reported in this paper. [19] C. Luo, Y. Zhu, L. Jin, Y. Wang, Learn to augment: Joint data augmentation and
network optimization for text recognition, in: Proceedings of the CVPR, 2020,
pp. 13746–13755.
Acknowledgments
[20] E. Alonso, B. Moysset, R. Messina, Adversarial generation of handwritten text
images conditioned on sequences, in: Proceedings of the ICDAR, 2019, IEEE,
This work was funded by National Natural Science Foundation of 2019, pp. 481–486.
China (Grant No. 62366036), National Education Science Planning [21] L. Kang, P. Riba, Y. Wang, M. Rusiñol, A. Fornés, M. Villegas, GANwrit-
Project, China (Grant No. BIX230343), Key R&D and Achievement ing: Content-conditioned generation of styled handwritten word images, in:
Proceedings of the ECCV, 2020, Vol. 12368, Springer, 2020, pp. 273–289.
Transformation Program of Inner Mongolia Autonomous Region, China
[22] J. Gan, W. Wang, HiGAN: Handwriting imitation conditioned on arbitrary-length
(Grant No. 2022YFHH0077), The Central Government Fund for Pro- texts and disentangled styles, in: Proceedings of the AAAI Conference on Artificial
moting Local Scientific and Technological Development, China (Grant Intelligence, Vol. 35, 2021, pp. 7484–7492.
No. 2022ZY0198), Program for Young Talents of Science and Tech- [23] K. Nikolaidou, G. Retsinas, V. Christlein, M. Seuret, G. Sfikas, E.B. Smith, H.
nology in Universities of Inner Mongolia Autonomous Region, China Mokayed, M. Liwicki, WordStylist: Styled verbatim handwritten text generation
with latent diffusion models, in: Proceedings of the ICDAR, 2023, Vol. 14188,
(Grant No. NJYT24033), Inner Mongolia Autonomous Region Science
2023, pp. 384–401.
and Technology Planning Project, China (Grant No. 2023YFSH0017), [24] A.K. Bhunia, S.H. Khan, H. Cholakkal, R.M. Anwer, F.S. Khan, M. Shah, Hand-
The Fund of Supporting the Reform and Development of Local Uni- writing transformers, in: 2021 IEEE/CVF International Conference on Computer
versities (Disciplinary Construction) and The Special Research Project Vision, ICCV 2021, IEEE, 2021, pp. 1066–1074.
of First-class Discipline of Inner Mongolia A. R. of China (Grant No. [25] V. Pippi, S. Cascianelli, R. Cucchiara, Handwritten text generation from visual
YLXKZX-ND-036). archetypes, in: Proceedings of the CVPR, 2023, IEEE, 2023, pp. 22458–22467.
[26] D. Gui, K. Chen, H. Ding, Q. Huo, Zero-shot generation of training data
with denoising diffusion probabilistic model for handwritten Chinese character
Data availability
recognition, in: Proceedings of the ICDAR, 2023, Vol. 14188, 2023, pp. 348–365.
[27] H. Ding, B. Luan, D. Gui, K. Chen, Q. Huo, Improving handwritten OCR with
Data will be made available on request. training samples generated by glyph conditional denoising diffusion probabilistic
model, in: Proceedings of the ICDAR, 2023, Vol. 14190, 2023, pp. 20–37.
11
W. Chen et al. Pattern Recognition 159 (2025) 111079
[28] J. Zdenek, H. Nakayama, Handwritten text generation with character-specific [42] N. Hansen, The CMA evolution strategy: a comparing review, in: Towards
encoding for style imitation, in: Proceedings of the ICDAR, 2023, Vol. 14188, a New Evolutionary Computation: Advances in the Estimation of Distribution
Springer, 2023, pp. 313–329. Algorithms, Springer, 2006, pp. 75–102.
[29] J. Bergstra, R. Bardenet, Y. Bengio, B. Kégl, Algorithms for hyper-parameter
optimization, in: Advances in Neural Information Processing Systems, 2011, pp.
2546–2554. Wei Chen received his B.E. degree in computer science and
[30] B. Shi, X. Bai, C. Yao, An end-to-end trainable neural network for image-based B.E. degree in information engineering from the University
sequence recognition and its application to scene text recognition, IEEE Trans. of Emergency Management (China) in 2020. He is currently
Pattern Anal. Mach. Intell. 39 (11) (2017) 2298–2304. pursuing a Ph.D. in computer science at Inner Mongolia
[31] Z. Feng, S. Guo, X. Tan, K. Xu, M. Wang, L. Ma, Rethinking efficient lane University (China). His research focuses on OCR, few-shot
detection via curve modeling, in: Proceedings of the CVPR, 2022, IEEE, 2022, learning and data augmentation.
pp. 17041–17049.
[32] F. Kleber, S. Fiel, M. Diem, R. Sablatnig, Cvl-database: An off-line database for
writer retrieval, writer identification and word spotting, in: Proceedings of the
ICDAR, 2013, IEEE, 2013, pp. 560–564.
[33] K. Saeed, M. Tabedzki, M. Rybnik, M. Adamski, K3M: A universal algorithm for
image skeletonization and a review of thinning techniques, Int. J. Appl. Math.
Comput. Sci. 20 (2) (2010) 317–335. Xiangdong Su received his B.E. degree and Ph.D. degree
[34] K. Wu, E.J. Otoo, K. Suzuki, Optimizing two-pass connected-component labeling from Inner Mongolia University (China) in 2007 and 2016,
algorithms, Pattern Anal. Appl. 12 (2) (2009) 117–135. respectively. He is currently an associate professor at the
[35] M. Pechwitz, S.S. Maddouri, V. Märgner, N. Ellouze, H. Amiri, et al., IFN/ENIT- College of Computer Science, Inner Mongolia University. He
database of handwritten Arabic words, in: Proc. CIFED, 2, Citeseer, 2002, pp. main focus on OCR, medical visual question answering and
127–136. knowledge graph. He has authored more than 20 papers in
[36] D.B. Nurseitov, K. Bostanbekov, D. Kurmankhojayev, A. Alimova, A. Abdallah, R. peer-reviewed journals and international conferences since
Tolegenov, Handwritten kazakh and Russian (HKR) database for text recognition, 2019.
Multim. Tools Appl. 80 (21) (2021) 33075–33097.
[37] U. Marti, H. Bunke, The IAM-database: an english sentence database for offline
handwriting recognition, Int. J. Document Anal. Recognit. 5 (1) (2002) 39–46.
[38] A. Fischer, V. Frinken, A. Fornés, H. Bunke, Transcription alignment of latin
manuscripts using hidden Markov models, in: Proceedings of the 2011 Workshop
Hongxu Hou received his B.E. degree and Master’s degree
on Historical Document Imaging and Processing, ACM, 2011, pp. 29–36.
from Inner Mongolia University (China) in 1993 and 2000,
[39] E. Augustin, M. Carré, E. Grosicki, J.-M. Brodin, E. Geoffrois, F. Prêteux, RIMES
respectively. He received his Ph.D. degree from the Univer-
evaluation campaign for handwritten mail processing, in: International Workshop
sity of Chinese Academy of Sciences in 2008. At present,
on Frontiers in Handwriting Recognition, IWFHR’06, 2006, pp. 231–235.
he is working as a professor and doctoral supervisor in the
[40] J. Bergstra, Y. Bengio, Random search for hyper-parameter optimization, J. Mach.
College of Computer Science of Inner Mongolia University.
Learn. Res. 13 (2012) 281–305.
His research interests include natural language processing,
[41] K. Deb, S. Agrawal, A. Pratap, T. Meyarivan, A fast and elitist multiobjective
information retrieval, and machine translation.
genetic algorithm: NSGA-II, IEEE Trans. Evol. Comput. 6 (2) (2002) 182–197.
12