Efficient Image Colorization for Limited Data
Efficient Image Colorization for Limited Data
B. Generator Architecture
   The generator in this model is designed to convert 32   × 32
grayscale images into 32 ×32 RGB images. It consists of
an encoder-decoder structure with skip connections similar to
the Pix2Pix framework. The generator network begins with
an input grayscale image and progressively downsamples the
image to a low-dimensional latent space and then upsamples
it back to the original resolution while adding the necessary
color information.                                                Fig. 2: Transformation from Original to Grayscale to Gener-
   The architecture of the generator is as follows:               ated Color Image
   Encoder:
   • Conv2D (64 filters, kernel size = 4, strides = 2) +             Given a grayscale input image, the generator will output
     LeakyReLU                                                    a color 32
                                                                           × 32 3 image. During training, the loss function
   • Conv2D (128 filters, kernel size = 4, strides = 2) + Batch   consists of adversarial loss combined with an L1 loss in
     Normalization + LeakyReLU                                    order to force the generator to produce images close to the
                                                                  target color images:
   Bridge:
                                                                     The generator will output a color image ×      32 32 3,
   • Conv2D (256 filters, kernel size = 4, strides = 1) + Batch
                                                                  generated from a given grayscale input image. The used
     Normalization + LeakyReLU                                    loss function for training the generator is an adversarial
   Decoder:                                                       loss combined with an L1 loss in order to drive the
   • Conv2DTranspose (128 filters, kernel size = 4, strides =     generator to output images close to target color images:
     2) + Batch Normalization + ReLU
   • Concatenate with encoder output                                    LG = Ex,y[log(D(x, G(x)))] + λ · Ex,y[∥y − G(x)∥1]
  where D(x, G(x)) is the discriminator’s output for the                where D(x, y) is the discriminator’s output for a real image,
generated image, and λ = 100 controls the weight of the L1            and D(x, G(x)) is its output for a generated image.
loss.
      LD =                                                                                              x
      −E                                                                                      x′ =                 1
               x,y [log   D(x, y)] −    x,z [log(1− D(x,                                                       −
                  E                        G(x)))]                                                   127.5
   where x represents the original pixel values of the image.
   Each grayscale image is paired with its corresponding color
image from the dataset. This grayscale image is passed to the
generator, and the discriminator receives both grayscale
(input) and color (target) images.
   Table III gives detailed SSIM and PSNR results for each
sample size. SSIM and PSNR values are reported for epochs
from 1 to 100 for each sample size. SSIM ranges between
0.36 and 0.89, and PSNR ranges between 14.43 and 24.35. It
is also found that with an increase in the number of epochs,
the values of both metrics improve and larger sample sizes
provide better overall performance.
   The SSIM and PSNR values for the different sample sizes
are visualized in Figures 5 and 6, respectively. These plots il-             Fig. 6: PSNR across sample sizes and epochs.
lustrate the trends in model performance as training
progresses.
                                                                                    V. FINDINGS    AND   DISCUSSION
  •   Figure 5 demonstrates how the SSIM values increase
                                                                        Analysis of Conditional Generative Adversarial Networks
      significantly after the first epoch and then stabilize across
                                                                      applied to the CIFAR-10 dataset as a test set for colorization
      different sample sizes. For all sample sizes, SSIM con-
                                                                      gives several important insights into the relationship between
      verges to values above 0.8 after approximately 10
                                                                      dataset size and number of epochs and model performance.
      epochs, indicating that the generated images become
                                                                      The key takeaways extracted from the results of SSIM and
      increasingly similar to the ground truth images over
                                                                      PSNR across sample sizes and epochs are:.
      time.
  •   Figure 6 shows a similar trend for the PSNR values,             A. Impact of Dataset Size on Model Performance
      where performance improves rapidly in the early stages
      of training and stabilizes as the number of epochs                There’s a positive trend in SSIM as well as PSNR results
      increases. Larger sample sizes achieve slightly higher          by increasing the sample size from 1000 up to 10,000 and
      PSNR values, with 5000 and 10000 samples reaching               especially at larger epochs. Results are also proved to be
      PSNR values above 23 by the 100th epoch. This suggests          robust at smaller-size datasets, but larger sizes of data have
      that the reconstruction quality of the generated images         constantly given high performance:
      improves with more training data.                                 • For 1000 samples, the SSIM improves from 0.3618 at
                                                                           epoch 1 to 0.8773 at epoch 100, a relative improvement
  Both figures emphasize the effectiveness of increasing the               of 142.4%, while the PSNR improves from 14.43 dB to
dataset size and training duration. As the figures show, larger            22.66 dB, a 57.1% increase.
datasets and more epochs result in higher similarity (SSIM)
and better reconstruction quality (PSNR).
                             TABLE III: SSIM and PSNR results for different sample sizes across epochs.
             Epoch            1000                2000                3000                4000                5000               10000
                        SSIM      PSNR       SSIM     PSNR       SSIM     PSNR       SSIM     PSNR       SSIM     PSNR       SSIM     PSNR
                 1      0.3618       14.43   0.4690      15.29   0.5413      16.05   0.6631      17.70   0.7380      19.27   0.7447   19.47
                10      0.7915       20.33   0.8365      21.63   0.8128      20.51   0.8554      21.91   0.8226      20.70   0.8701   22.17
                20      0.8135       21.24   0.8395      21.19   0.7937      19.22   0.8616      22.33   0.8625      22.01   0.8564   21.70
                30      0.8182       21.14   0.8595      22.17   0.7976      20.17   0.8151      21.17   0.8693      22.31   0.8326   21.75
                40      0.8539       22.24   0.8326      21.62   0.8408      21.19   0.8631      22.26   0.8669      22.27   0.8747   22.28
                50      0.8253       20.92   0.8502      22.09   0.8629      22.21   0.8235      21.09   0.8683      22.50   0.8756   22.64
                60      0.8334       21.45   0.8743      22.88   0.8691      22.49   0.8912      23.17   0.8768      22.64   0.8635   22.09
                70      0.8639       22.40   0.8679      22.39   0.8675      22.29   0.8888      23.02   0.8889      23.12   0.8892   23.08
                80      0.8535       22.35   0.8595      22.14   0.8832      22.95   0.8864      23.20   0.8899      23.52   0.8540   23.17
                90      0.8766       22.61   0.8855      22.90   0.8862      23.14   0.8926      23.33   0.8912      23.88   0.8219   23.98
               100      0.8773       22.66   0.8885      22.98   0.8977      23.63   0.8931      23.50   0.8975      24.31   0.7878   24.35
                             TABLE IV: PSNR and SSIM results for different image colorization methods.
       Citation / Source                 Dataset Size (Images)    Image Resolution      Total Pixels (approx.)    Epochs     PSNR (dB)        SSIM
       Zhang et al., 2016 [19]                   1,000                256x256                 65,536,000           100          24.8           0.86
       Iizuka et al., 2016 [9]                   3,000                224x224                150,528,000            50          24.5           0.85
       Isola et al., 2017 [3]                     400                 256x256                 26,214,400           200          22.9           0.81
       Nazeri et al., 2018 [8]                   2,000                128x128                 32,768,000            50          23.1           0.79
       Vitoria et al., 2020 [15]                 1,500                256x256                 98,304,000           100          24.3           0.83
       Su et al., 2019 [16]                       600                 256x256                 39,321,600           150          23.8           0.82
       Sartaj et al., 2021 [17]                  1,200                128x128                 19,660,800            80          23.5           0.84
       Bhattacharjee et al., 2022 [5]             800                 128x128                 13,107,200           120          24.1           0.81
       Proposed Methodology                      5,000                 32x32                   5,120,000           100         24.31          0.8975
  •   With 2000 samples, the SSIM increases from 0.4690 to                      C. Comparison Between Small and Large Datasets
      0.8885 (89.4%) over 100 epochs, and the PSNR from
      15.29 dB to 22.98 dB, a 50.3% increase.                                      The final SSIM and PSNR value varies consistently with
  •   For 3000 samples, the SSIM rises from 0.5413 to 0.8977                    larger datasets at the same number of epochs. For instance,
      (65.8%), and the PSNR improves from 16.05 dB to 23.63                     at epoch 100, the SSIM for 10,000 samples is 0.7878 versus
      dB, a 47.2% gain.                                                         0.8773 for 1000 samples. Similarly, PSNR for 10,000 samples
  •   When the dataset size is 4000, the SSIM improves from                     is 24.31 dB versus 22.66 dB for 1000 samples. However, it is
      0.6631 to 0.8931 (34.7%), while the PSNR increases                        interesting that, even when the dataset is small, the model has
      from 17.70 dB to 23.50 dB, a 32.8% increase.                              reasonably competitive performance, which shows that, even
  •   At 5000 samples, the SSIM starts at 0.7380 and reaches                    on a small dataset, CGANs could perform reasonably well.
      0.8923 (20.9%), while the PSNR increases from 19.27
                                                                                D. Overall Findings
      dB to 24.26 dB, a 25.9% rise.
  •   For the largest dataset size (10,000 samples), SSIM                         The conducted experiments demonstrate that in general,
      improves from 0.7447 to 0.7878 (5.8%), while the PSNR                     a boost in size for a given dataset improves performance,
      rises from 19.47 dB to 24.31 dB, a 24.9% increase.                        yet important gains can be also achieved by longer training
                                                                                on smaller datasets. As the example shows the comparison
                                                                                between the number of samples 5000 and 10,000 at epoch
B. Impact of Epochs on Model Performance                                        100 is barely differentiable with SSIM enhanced by only 0.6
                                                                                percent, and PSNR increased by 0.2 dB. These results show
  Across all dataset sizes, increasing the number of training
                                                                                that even with relatively small datasets, the colorization based
epochs results in a steady improvement in both SSIM and
                                                                                on CGAN can be quite efficient, and such approach appears
PSNR. The model shows rapid improvement within the first
                                                                                viable when data are scarce.
10 epochs, with diminishing returns at later epochs:
  •   For 1000 samples, SSIM increases by 118.8% from                           E. Comparison with Related Work
      epoch 1 to 10, while PSNR improves by 40.9%.                                 When compared with the results of previous research stud-
      However, the change from epoch 10 to epoch 100 is                         ies, some major differences can be noticed. Zhang et al. uses
      more modest, with SSIM improving by 10.8% and                             a dataset size of 1,000 images at 256 x 256 resolution and has
      PSNR by 11.5%.                                                            achieved 24.8 dB PSNR and 0.86 SSIM. Whereas, the PSNR
  •   Similar trends are observed for other dataset sizes, with                 value becomes slightly low as 24.31 dB, but with a higher
      rapid initial improvements followed by slower gains. For                  SSIM of 0.8975, which is higher compared with Zhang et
      instance, for 5000 samples, SSIM improves by 11.8%                        al.’s technique, although implemented on much smaller 32x32
      between epochs 10 and 100, while PSNR improves by                         images and sample size is considered to be 5,000. It depicts
      17.1%.
how the approach can produce the perceptually better images,             skip connections, such as in U-Net architecture, might
even at the lower resolutions.                                           im- prove image restoration as they preserve high-
   Along the same lines, Iizuka et al. also achieve PSNR of              resolution details from earlier layers.
24.5 dB and SSIM of 0.85 from a dataset size of 3,000 images         •   Deeper Generator Network: Increasing the depth of the
at a higher resolution of 224x224. Because this approach uses            generator by adding more convolutional and transpose
fewer pixels but a comparable number of epochs, it is                    convolutional layers might help capture more complex
excellent beyond this concerning SSIM. This implies that the             features from the input images.
CGAN- based approach can attain very competitive results             •   Residual Blocks: Introducing residual blocks into the
relatively easily even from lower-resolution images.                     generator could help improve image reconstruction.
   Apart from that, in comparison with Vitoria                           Residual connections allow better gradient flow and may
                                                                         accelerate training convergence.
textitet al.
                                                                   C. Loss Functions and Optimization
citevitoria2020chromagan, which trained the model on                 •   Perceptual Loss: Instead of relying solely on L1 loss,
1,500 images at 256x256 resolution for 100 epochs, the                   consider adding perceptual loss (VGG-based loss), which
proposed method has PSNR and SSIM higher respectively                    compares high-level features of generated and real im-
(24.31 dB vs. 24.3 dB) and 0.8975 vs. 0.83 using a much                  ages. This could improve the visual quality of generated
larger dataset but significantly fewer pixels. That further              images.
proves the scalability of the proposed method when training          •   Gradient Penalty: In the discriminator, you could intro-
on lower-resolution data.                                                duce a gradient penalty, especially if the GAN suffers
   Moreover, Nazeri et al. Nazeri2018Image and Isola et al.              from instability during training. This could be imple-
Isola2017Image, using 128 × 128 and 256 × 256 images,                    mented using Wasserstein GAN with Gradient Penalty
achieved comparable results at a comparable amount of epoch              (WGAN-GP) for more stable training.
numbers. However, the efficiency of that model with 32 × 32
images demonstrates that CGANs are also capable of creating        D. Training Process
high-quality outputs even if the image resolution is lower, so       •   Dynamic Learning Rate: The learning rate could be
it gives an opportunity to further develop that work in more             adjusted dynamically using learning rate schedulers. A
extended ranges of the sizes of images. As an example, Sartaj            decreasing learning rate over time could help the models
                                                                         converge more efficiently.
textitet al.                                                         •   Discriminator Training Frequency: Try updating the
                                                                         discriminator more or less frequently relative to the
citesartaj2021cgan uses images of size 128x128 and 80                    generator, which might improve training stability.
epochs but it is still superior to their model’s performance at
epoch 100 as in TableIV.                                           E. Metrics and Evaluation
                                                                     •   FID Score: In addition to SSIM and PSNR, consider
               VI. POTENTIAL     IMPROVEMENTS                            using the Fre´chet Inception Distance (FID) score,
A. Data Preprocessing                                                    which compares distributions of real and generated
                                                                         images using a pre-trained Inception network, and
  •   Normalization Range: Currently, the preprocessing step             provides a more comprehensive measure of image
      normalizes images to the range of [−1, 1]. A potential im-         quality.
      provement could be trying different normalization              •   Visual Quality of Results: Visual inspection of the
      ranges, such as [0, 1], depending on the activation                generated results at intermediate epochs can provide
      functions used in the generator and discriminator                  more intuition about the quality of generated images over
      models.                                                            time. Include more frequent visual checkpoints.
  •   Data Augmentation: Introduce data augmentation tech-
      niques such as random flips, rotations, and crops. This                            VII. CONCLUSION
      would increase the diversity of the training data and
                                                                      This paper introduced the lightweight Conditional GAN
      might improve the generalization of the model.
                                                                   model based on CIFAR-10 for the task of colorizing grayscale
  •   Larger Dataset: The number of samples for both
                                                                   images. Based on the compact architecture with the capacity
      training and testing is restricted (3,000 for training and
                                                                   to work regardless of whether large-scale or high-resolution
      100 for testing). A larger dataset could improve
                                                                   inputs are presented, this model is extremely useful for fields
      performance, given that models like GANs typically
                                                                   such as medical imaging, where one usually deals with scarce
      benefit from more data.
                                                                   amounts of data and high-quality labeled datasets are expen-
B. Model Architecture                                              sive.
                                                                      From start to end during the experimentation process,
  •   Skip Connections in Generator: The skip connections          performance was promising for the developed model at such
      currently only link certain layers. Using more detailed      low PSNR and SSIM scores although working at low image
resolution with a tiny dataset as compared to almost all
prior work. Here, considering this research focuses on images
using a less extensive size dataset and resolutions, where data
availability indeed is critical or can be the constraint during
their realistic application, with the resultant superiority in
images’ quality as compared with the colourized images - the
PSNR and SSIM being remarkably impressive.
   Future improvements may include data augmentation, per-
ceptual loss, and deeper network architectures, which could
make the model even more photorealistic and high-quality.
                              REFERENCES
 [1] I. Goodfellow et al., ”Generative Adversarial Nets,” NIPS, 2014.
 [2] M. Mirza and S. Osindero, ”Conditional Generative Adversarial Nets,”
     arXiv preprint arXiv:1411.1784, 2014.
 [3] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, ”Image-to-image translation
     with conditional adversarial networks,” in Proc. IEEE Conf. Comput.
     Vis. Pattern Recognit. (CVPR), 2017, pp. 1125–1134.
 [4] J. Suarez et al., ”Self-Supervised Learning for Image Colorization,”
     IEEE Access, 2022.
 [5] S. Bhattacharjee, V. Singh, and D. S. Kushwaha, ”Efficient image
     colorization using conditional GANs and perceptual loss,” in Proc.
     IEEE Conf. Image Process. (ICIP), 2022.
 [6] R. Zhang, P. Isola, and A. A. Efros, ”Colorful image colorization,” in
     Proc. Eur. Conf. Comput. Vis. (ECCV), 2016, pp. 649–666.
 [7] Y. Cao, Z. Zhu, and Z. Zhang, ”Image Colorization Using Generative
     Adversarial Networks,” International Journal of Advanced Computer
     Science and Applications, 2020.
 [8] K. Nazeri, E. Ng, T. Joseph, F. Qureshi, and M. Ebrahimi, ”Image
     colorization using generative adversarial networks,” in Proc. Int. Conf.
     Artif. Intell. Appl. (AAIA), 2018.
 [9] S. Iizuka, E. Simo-Serra, and H. Ishikawa, ”Let there be color!: Joint
     end-to-end learning of global and local image priors for automatic
     image colorization with simultaneous classification,” ACM Trans.
     Graph., vol. 35, no. 4, pp. 110:1–110:11, 2016.
[10] P. Vitoria, L. Zhang, ”ChromaGAN: Colorization with a GAN using
     chrominance and luminance color spaces,” Pattern Recognition, 2022.
[11] S. Kumar, R. Srivastava, and S. Yadav, ”Image Colorization Using
     Generative Adversarial Networks and Transfer Learning,” International
     Journal of Innovative Technology and Exploring Engineering (IJITEE),
     2021.
[12] P. Serrano, S. Bharadwaj, and R. Diaz, ”Portrait Image Colorization
     Using Conditional GANs,” WACV, 2017.
[13] Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli, ”Image Quality
     Assessment: From Error Visibility to Structural Similarity,” IEEE
     Trans- actions on Image Processing, 2004.
[14] D. Huynh-Thu and M. Ghanbari, ”Scope of Validity of PSNR in
     Image/Video Quality Assessment,” Electronics Letters, 2008.
[15] P. Vitoria, L. Sousa, and P. Quelhas, ”ChromaGAN: Adversarial picture
     colorization with semantic class distribution,” in Proc. IEEE/CVF Conf.
     Comput. Vis. Pattern Recognit. Workshops (CVPRW), 2020, pp. 0–0.
[16] Z. Su, J. Wang, and C. Hu, ”Lightweight image colorization with gen-
     erative adversarial networks,” IEEE Access, vol. 7, pp. 170804–170816,
     2019.
[17] S. N. Ali, P. Kumar, and S. Jain, ”cGAN-based image colorization using
     semantic segmentation,” in Proc. Int. Conf. Pattern Recognit. Mach.
     Intell. (PRMI), 2021, pp. 334–342.
[18] A. Radford, L. Metz, and S. Chintala, ”Unsupervised Representation
     Learning with Deep Convolutional Generative Adversarial Networks,”
     ICLR, 2016.
[19] R. Zhang, P. Isola, and A. Efros, ”Colorful Image Colorization,” ECCV,
     2016.