Remote Sens
Remote Sens
com/scientificreports
                                 Remote sensing satellites and unmanned aerial vehicle (UAV) sensors are susceptible to atmospheric phenomena
                                 that can impair the contrast and color fidelity of the collected images, resulting in weakened image details and
                                 making it difficult to recognize information in the image. Haze, fog and smoke are very common atmospheric
                                 phenomena generated by atmospheric absorption and scattering. With the application of remote sensing tech-
                                 nology in the fields of police security, agriculture and forestry plant protection, electric power patrol inspection,
                                 land resource survey, and similar applications, it is of great significance to accurately remove haze, fog and smoke
                                 from remote sensing images (RSIs) for target detection, target tracking and UAV detection. For simplicity, the
                                 term dehazing is used uniformly to denote the removal of haze, fog and smoke.
                                    In the image dehazing task, the following expression is widely used to describe the hazy image a s1–3:
                                                                           I(x) = J(x)t(x) + A(1 − t(x))                                          (1)
                                 where I(x), J(x), A and t denote the hazy image, the haze-free image, the global atmospheric light, and the
                                 transmission map, respectively. Single image dehazing is a challenging problem, which is under-constrained
                                 due to the unknown depth information. At present, numerous dehazing algorithms from several directions
                                 have been proposed.
                                     Early prior-based approaches have been demonstrated to be effective. Using Eq. (1), A and t must be accurately
                                 estimated to restore clear images. One of the most representative is the dark channel prior (DCP) method4 to
                                 determine the mapping relationship between clear images and atmospheric physical models, which is a relatively
                                 stable dehazing algorithm. However, the dehazing effect in large white areas tends to produce large deviations.
                                 Therefore, several researchers use data-driven deep learning approaches5,6 to estimate the intermediate param-
                                 eters of atmospheric scattering model and construct a mapping relationship from the hazy image to the inter-
                                 mediate parameters. These deep learning algorithms are based on the atmospheric scattering model. Although
                                 they have greatly improved in the sky region and are visually more effective than traditional methods, the models
                                 are highly complex and vulnerable to the limitations of atmospheric lighting and scene changes, resulting in
                                 poor real-time performance and darkened brightness of the restored image. To address these problems, several
                                 School of Information Science and Technology, Yunnan Normal University, Kunming 650500, Yunnan, China. *email:
                                 zhangyp@ynnu.edu.cn
                                                                                                                                               Vol.:(0123456789)
          www.nature.com/scientificreports/
                                            algorithms directly predict the latent haze-free images in an end-to-end manner. Huang et al.7 proposed a con-
                                            ditional generative adversarial network that uses RGB and SAR images for dehazing. Mehta et al.8 developed
                                            SkyGAN specifically for removing haze in aerial images, addressing the challenge of limited hazy hyperspectral
                                            aerial image datasets.
                                                In recent years, Vision Transformer (ViT)9 has excelled in high-level vision tasks, focusing on modeling long-
                                            term dependencies in data. However, earlier ViT and Pyramid Vision Transformer (PVT)10 were over-parame-
                                            terized and computationally expensive. Thus, Liang et al.11 were inspired by Swin-Transformer12 and proposed
                                            SwinIR consisting of several Residual Swin Transformer Blocks (RSTB), each with several Swin Transformer
                                            layers and a residual connection. Uformer13 introduced a novel locally-enhanced window (LeWin) Transformer
                                            block and a learnable multi-scale restoration modulator in the form of a multi-scale spatial bias to adjust features
                                            in multiple layers of the Uformer decoder. Dong et al.14 proposed TransRA, a two-branch neural network fused
                                            with transformer and residual attention, to recover fine details of dehazing RSIs. Song et al.15 proposed Dehaze-
                                            former based on Swin-Transformer12 and U-Net16, modifying the standardization layer, activation function, and
                                            spatial information aggregation scheme, and introducing soft constraints using a weak prior. The Dehazeformer
                                            has shown superior performance compared to previous methods on SOTS indoor datasets, while being more
                                            efficient with fewer parameters and lower computational costs. However, it is difficult to obtain sufficient paired
                                            hazy RSI datasets due to natural conditions and equipment limitations. When the training samples are small and
                                            contain dense haze images, the Dehazeformer performs poorly in RSIs dehazing.
                                                To sum up, in RSIs dehazing tasks, both local and global features are important, and traditional image dehaz-
                                            ing methods rely on sound theoretical foundations that can guide network learning. Thus, we have designed
                                            a new RGB remote sensing image dehazing model (GTMNet) based on Dehazeformer by reconstructing the
                                            model architecture and combining DCP into the proposed network. Due to the down-sampling operations in
                                            the encoder of the Dehazeformer, the compressed spatial information may not be effectively retrieved by the
                                            decoder of the Dehazeformer. Therefore, we use the strengthen-operate-subtract (SOS) strategy in the decoder
                                            to retrieve more compressed information and gradually restore latent haze-free images in this work. We also
                                            compare several advanced dehazing models with GTMNet and verify the applicability of the proposed model.
                                            For this paper, the main contributions are as follows: (1) A novel hybrid architecture is proposed, which is based
                                            on CNN and ViT, and combines the DCP. Compared with other referenced models, it provides better PSNR and
                                            SSIM; (2) The transmission map optimized by guided filtering and a linear transformation is smoothly introduced
                                            into the model through the spatial feature transform (SFT) layer, enabling better estimation of the haze thickness
                                            in the image and thus improving performance; (3) To gradually refine the restored image in the feature recovery
                                            module, the SOS boosted module is combined into the image dehazing task via a skip connection.
                                            Proposed method
                                            This section presents the details of GTMNet. First, we introduce the DCP. Then we estimate the transmission
                                            map. Finally, we describe the details of SFT layer, SOS boosted module and SK fusion module.
                                            Dark channel prior.       He et al.4 conducted statistical analysis on non-sky regions of more than 5,000 haze-
                                            free outdoor images, and found that there are often some pixels with very low values in at least one color channel.
                                            Formally, the dark primary color of the haze-free image J(x) is defined as:
                                                                                                                      
                                                                                J dark (x) = c∈{r,g,b} min y∈�(x) min J c y                                 (2)
                                            where c represents a channel among R, G, and B channels; Ω(x) is a local square centered at x; J c represents a
                                            certain color channel of J . The observation shows that, if J is a haze-free outdoor image, except for the sky region,
                                            the pixel value of J dark tends to be 0. The above statistical observation is called the DCP or the dark primary
                                            color prior.
                                            Estimation of transmission map. To obtain a clear haze-free image J in Eq. (1), it is necessary to solve A
                                            and t. Equation (1) can be rewrite as:
                                                                                                        I(x) − A
                                                                                           J(x) = A +                                                         (3)
                                                                                                          t(x)
                                                According to the DCP, the dark channel of a haze image approximates the haze denseness well. Therefore, He
                                            et al.4 picked the top 0.1% brightest pixels in the dark channel of the hazy image. Among these pixels, the pixel
                                            with the highest intensity in the input image I is selected as the atmospheric light.
                                                Assuming that the transmission in a local patch Ω(x) is constant, the patch’s transmission t (x) can be defined
                                            as:
                                                                                                                   
                                                                                                      min    min  Ic y
                                                                                  t (x) = 1 − y∈�(x)      c                                                  (4)
                                                                                                                    Ac
Vol:.(1234567890)
www.nature.com/scientificreports/
                                      As mentioned in the literature4, even if the weather is clear, distant objects are more or less affected by haze,
                                  so the authors control the degree of haze by introducing a factor ω of [0,1] to give a sense of depth of field. The
                                  specific expression is:
                                                                                                          
                                                                                            min     min   Ic y
                                                                       t (x) = 1 − ωy∈�(x)       c                                                  (5)
                                                                                                            Ac
                                  GTMNet.        As shown in Fig. 2 and Table 1, the proposed network GTMNet is based on Dehazeformer, but
                                  incorporates SFT layers18 and SOS boosted modules. SFT layers integrate the GTM into GTMNet, which can
                                  effectively fuse the features of the GTM and the input image to more accurately estimate the haze thickness
                                  in the input image. SOS boosted modules can restore clear images iteratively. At the end of the decoder, a soft
                                  reconstruction layer is used to estimate the haze-free image J .
                                  SFT layer. The SFT layer is first applied in super-resolution tasks18. It is parameter-efficient and can be easily
                                  introduced to existing dehazing network structures with strong extensibility. As shown in Fig. 3, we use the GTM
                                  t1 as the additional input of the SFT layer, which first applies three convolutional layers to extract the conditional
                                  maps φ from the GTM; then the conditional maps φ is input to the other two convolutional layers to predict the
                                  modulation parameters γ and β, respectively; finally, the transformation is carried out by scaling and shifting
                                  feature maps of a specific layer, and we can obtain the output shifted features by:
                                  Figure 1.  Results of transmission maps on SateHaze1k Dataset: (a) Input images; (b) Dark channel maps; (c)
                                  The transmission maps optimized by fast-guided filter; (d) The guided transmission maps.
                                                                                                                                                 Vol.:(0123456789)
          www.nature.com/scientificreports/
                                                                                       SFT(F|γ , β) = γ ⊙ F ⊕ β                                            (7)
                                            where F is the feature maps with the same dimensions as γ and β, ⊙ is referred to the element-wise multiplica-
                                            tion, i.e., Hadamard product, and ⊕ is the element-wise addition. Since the spatial dimensions are preserved, the
                                            SFT layer performs feature-wise manipulation and spatial-wise transformation. Since the size of each object is
                                            generally tiny in RSIs, obtaining local features becomes crucial. In this paper, we utilized SFT layers with shared
                                            parameters to compensate for the Transformer’s limited ability to acquire local features.
                                            SOS boosted module. The SOS boosting m       ethod19 has been mathematically proven to be effective for image
                                            denoising, which iteratively restores clear images. Dong et al.20 have verified a variety of optional SOS boosted
                                            modules, and the results show that the following boosted scheme has the best effect, as shown in Eq. (8):
Vol:.(1234567890)
www.nature.com/scientificreports/
                                  Encoder
                                  Block                 Filter size   Stride   Channel      In           Out          Input
                                  3 × 3 conv            3×3           1        3/24         (H, W)       (H, W)       Hazy image
                                  SFT                   3×3           1        (24, 1)/24   (H, W)       (H, W)       F(3 × 3 conv), GTM
                                  Dehazeformer Block1   –             –        24/24        (H, W)       (H, W)       F(SFT)
                                  Down-Sample           3×3           2        24/48        (H, W)       (H/2, W/2)   F(Dehazeformer Block1)
                                  Dehazeformer Block2   –             –        48/48        (H/2, W/2)   (H/2, W/2)   F(Down-Sample)
                                  Down-Sample           3×3           2        48/96        (H/2, W/2)   (H/4, W/4)   F(Dehazeformer Block2)
                                  Decoder
                                  Block                 Filter size   Up       Channel      In           Out          Input
                                  Dehazeformer Block3   –             –        96/96        (H/4, W/4)   (H/4, W/4)   F(Down-Sample)
                                  Up-Sample             3×3           2        96/48        (H/4, W/4)   (H/2, W/2)   F(Dehazeformer Block3)
                                  SOS2                  3×3           1        (48,48)/48   (H/2, W/2)   (H/2, W/2)   F(Up-Sample), F(Dehazeformer Block2)
                                  SK Fusion             –             –        (48,48)/48   (H/2, W/2)   (H/2, W/2)   F(SOS2), F(Up-Sample)
                                  Dehazeformer Block4   –             –        48/48        (H/2, W/2)   (H/2, W/2)   F(SK Fusion)
                                  Up-Sample             3×3           2        48/24        (H/2, W/2)   (H, W)       F(Dehazeformer Block4)
                                  SOS1                  3×3           1        (24,24)/24   (H, W)       (H, W)       F(Up-Sample), F(Dehazeformer Block1)
                                  SK Fusion             –             –        (24,24)/24   (H, W)       (H, W)       F(SOS1), F(Up-Sample)
                                  Dehazeformer Block5   –             –        24/24        (H, W)       (H, W)       F(SK Fusion)
                                  SFT                   3×3           1        (24,1)/24    (H, W)       (H, W)       F(Dehazeformer Block5),GTM
                                  3 × 3 conv            3×3           1        24/4         (H, W)       (H, W)       F(SFT)
                                  Soft Recon            –             –        (4,3)/3      (H, W)       (H, W)       F(3 × 3 conv), hazy image
                                  Table 1.  The architecture details of the proposed method (Up up-sampling factor, Channel number of input
                                  and output channels per block, In and Out spatial resolution of input and output per block, Input input per
                                  block).
                                                                                                                                                             Vol.:(0123456789)
          www.nature.com/scientificreports/
                                            SK fusion module. Song et al.22 designed a selective kernel (SK) Fusion module, which is inspired by SKNet23,
                                            to fuse multiple branches using channel attention. We use the SK Fusion m      odule22 to fuse the SOS and decoder
                                            branches. Specifically, let two feature maps x1 and x2, a linear layer f (.) is first used to project x1 to 
                                                                                                                                                        x 1. Then a
                                            global average pooling GAP(.), a Multilayer Perceptron MLP(.), a softmax function and a split operation are used
                                            to obtain fusion weights, as shown in Eq. (10):
                                                                                                                           
                                                                            {a1, a2} = Split Softmax MLP GAP       x 1 + x2                                   (10)
                                               Finally, weights {a1, a2} are used to fuse 
                                                                                          x 1, x2 with an additional short residual via y = a1
                                                                                                                                              x 1 + a2x2 + x2.
                                            Experiments
                                            In this part, we first present datasets and the implementation details of GTMNet. Then, we evaluate our method
                                            on RS-Haze and SateHaze1k datasets. Finally, ablation studies and other comparative experiments are conducted
                                            to analyze the proposed approach.
                                            Datasets. RS-Haze22 is a synthetic hazy RSI dataset synthesized from 76 RSIs containing diverse topography
                                            with good weather conditions and 108 cloudy RSIs. All the images are downloaded from the Landsat-8 Level 1
                                            data product on EarthExplorer. The final training set contains 51,300 RSI pairs, and the test set contains 2,700
                                            RSI pairs with an image resolution of 512 × 512. Since the proposed method is optimized on the Dehazeformer
                                            model, the experimental setup is consistent with the D    ehazeformer22. We train the model using L1 loss for 150
                                            epochs, each of which is validated once. The images in the test set are the same as those in the verification set.
                                               SateHaze1k7 is also a synthetic haze satellite remote sensing dataset, which uses Photoshop software as an
                                            auxiliary tool to generate rich, real and diverse hazy images. This dataset contains 1,200 RSI pairs, and each pair of
                                            images includes a hazy image and a real haze-free image. These images are divided into three haze image subsets:
                                            Thin Fog, Moderate Fog and Thick Fog, with an image resolution of 512 × 512. We select 320 pairs of images from
                                            each type of hazy image subset as the training set and 45 pairs of images as the test set. Each type of hazy image
                                            subset is trained and tested separately. Since the SateHaze1k dataset is small, we train GTMNet for 1000 epochs
                                            and verify it every ten epochs. Other experimental configurations are the same as those of the RS-Haze dataset.
                                            Implementation details. We provide four variants of GTMNet (-T, -S, -B and -L for tiny, small, basic,
                                            and large, respectively), implement the proposed network structure using the PyTorch framework, and train the
                                            model on an NVIDIA GeForce RTX3090. During training, images are randomly cropped to 256 × 256 patches.
                                            We set different mini-batch sizes for different variants, i.e., {32, 16, 8, 4} for {-T, -S, -B, -L}. The initial learning
                                            rate is set to {4, 2, 2, 1} × 10–4 for the variant {-T, -S, -B, -L}. We use the AdamW o
                                                                                                                                    ptimizer24 with a cosine anneal-
                                            ing strategy25 to train the model, where the learning rate gradually decreases from the initial learning rate to {4,
                                            2, 2, 1} × 10–6.
                                                The proposed mechanism for GTMNet training is illustrated in Algorithm 1. All the learnable parameters in
                                            GTMNet are initialized using the truncated normal distribution s trategy26.
Vol:.(1234567890)
www.nature.com/scientificreports/
                                   Evaluation. Quantitative evaluation. We use Peak Signal to Noise Ratio (PSNR) and Structure Similarity
                                   Index Measurement (SSIM) as objective evaluation indicators, and compare the number of parameters between
                                   GTMNet and other methods, as shown in Tables 2 and 3, where bold indicates the optimal value and underline
                                   indicates the suboptimal value.
                                   RS‑Haze dataset. Due to the equipment limitations, only testing and training are conducted on -T. We com-
                                   pare the proposed method with four other classical dehazing algorithms. As shown in Table 2, the PSNR of our
                                   method is slightly lower than that of Dehazeformer-T, while the SSIM of both is the same. Since the proposed
                                   architecture has more parameters, it is easier to overfit, resulting in poor generalization performance.
                                   SateHaze1k dataset. We compare the proposed method with DCP4, DehazeNet5, Huang (SAR)7, SkyGAN8,
                                   TransRA14 and Dehazeformer22, and the results are shown in Table 3. The PSNR and SSIM of GTMNet-T on the
                                    three sub-datasets are better than that of Dehazeformer-T22, especially, the PSNR on Thin Fog is improved by
                                    nearly 2.6%, and the SSIM is increased from 0.968 to 0.970. On Moderate Fog, the PSNR and SSIM of GTMNet-
                                                                                                                                         Vol.:(0123456789)
          www.nature.com/scientificreports/
                                            B reach 27.22 dB and 0.973, respectively, an increase of 7.2% and 7.6% compared to SkyGAN8. On Thick Fog,
                                            although the PSNR of GTMNet-B is lower than that of Huang (SAR)7 and S kyGAN8, the SSIM metric improves
                                            by 8.7% and 5.2%, respectively, compared to the two algorithms. On the three sub-datasets, GTMNet-T achieves
                                            better PSNR and SSIM scores than TransRA14, with a significant improvement in PSNR performance.
                                                As shown in Table 3, combined with the quantitative comparison results above, the proposed model is still
                                            lightweight, although the parameters have increased slightly. On Moderate Fog and Thick Fog sub-datasets,
                                            GTMNet-B performs comparably to Dehazeformer-L, but with only 0.1 times the number of parameters. How-
                                            ever, the performance of GTMNet-L is inferior to that of Dehazeformer-L, which may be caused by two aspects:
                                            Firstly, the increased parameter quantity of GTMNet-L makes it more prone to overfitting; Secondly, the gener-
                                            alization ability of GTMNet-L is reduced due to the small dataset.
                                             Qualitative evaluation. A qualitative comparison of related methods was performed on the RS-Haze and Sate-
                                             Haze1k datasets. Since Song et al.22 has compared the existing advanced dehazing image methods on RS-Haze
                                             dataset, we only present the dehazed images of GTMNet-T and Dehazeformer-T here. As shown in Fig. 5, there
                                             is little visual difference between GTMNet-T and Dehazeformer-T on the RS-Haze images, both showing clarity,
                                             rich feature information, realistic colours and a sense of hierarchy.
                                                  On SateHaze1k dataset, we present the qualitative comparison results of the GTMNet and state-of-the-art
                                             methods. The hazy input images include farmland, roads, buildings and vegetation, as shown in Fig. 6. We found
                                             that the DCP4 method failed, possibly due to the similarity between the colors of the atmospheric light and the
                                             object. Although the method of Huang (SAR)7 can remove haze, the ground feature information of the restored
                                             image in the dense haze area is not rich enough, and the building details are severely weakened. In general, both
                                            DehazeNet5 and S kyGAN8 failed to completely remove the haze (as shown in the processing result of the first
                                             hazy image in Fig. 6), resulting in unnatural color of the image and weak recovery ability for detailed information.
                                             Dehazeformer-T22 and GTMNet-T solve the problem of incomplete image dehazing. However, for areas with
                                             thick haze or cloud haze, the Dehazeformer algorithm suffers from serious color distortion. GTMNet improves
                                             not only the problem of image color deviation but also the sharpness.
                                            Ablation study. In this part, we perform ablation studies on the proposed model structure to analyze the
                                            factors that may influence the results. In these studies, except for different subjects, the other strategies are the
                                            same in each group of experiments.
                                            The effects of different components on the model performance. To study the influence of different components on
                                            the image dehazing effect, we take Dehazeformer-T22 as the baseline model and conduct ablation experiments
                                            on different components on SateHaze1k dataset7.
                                                As shown in Table 4, D-SOS-T refers to adding the SOS module to Dehazeformer-T. According to Table 5,
                                            we found that the PSNR and SSIM indicators of the three sub-datasets have been significantly improved, verify-
                                            ing the effectiveness of the SOS module in the image dehazing task. D-GTM-T indicates the introduction of
                                            the GTM as a prior into Dehazeformer-T through two SFT layers. The location of the SFT layer is shown in
                                            Fig. 9b. According to Table 5, the performance of adding only a prior GTM to Dehazeformer-T without using
                                            the SOS boosted strategy is better than that of Dehazeformer-T on Moderate Fog, but the effect is poor on Thin
                                            Fog and Thick Fog. We believe this is because the method for obtaining GTM is based on statistics for ordinary
                                            images, which have a large gap between RSIs and ordinary images. Traditional prior methods are more effective
                                            in uniform haze images.
                                                As shown in Fig. 7, the haze-free images generated by Dehazeformer-T, D-SOS-T, and D-GTM-T all show
                                            building distortion. Among all the methods, the dehazing effect of GTMNet is the best, which can ensure the
Vol:.(1234567890)
www.nature.com/scientificreports/
                                  clarity of the restored image and better restore the color of the image. On Thin Fog and Thick Fog sub-datasets,
                                  the PSNR and SSIM indicators increase more when the two components are used together than when used
                                  separately.
                                  The effects of different inputs of SOS1 module on the model performance. According to Eq. (8–9), we designed
                                  two different ablation models D-SOS-T and D-SOS1-T on SateHaze1k dataset. The specific configuration is
                                  shown in Table 6. According to Table 7, if S2 is directly upsampled and input to SOS1 (Fig. 2), compared with
                                  D-SOS-T, PSNR decreases from 27.09 to 26.77 dB, and the value of SSIM remains unchanged on Moderate Fog.
                                  In addition, compared with Dehazeformer-T, PSNR and SSIM increase from 26.38 dB and 0.969 to 26.77 dB and
                                  0.971, respectively.
                                      As seen in Fig. 8, there is very little visual difference between the dehazed images of D-SOS-T and D-SOS1-
                                  T. In the dense haze area, the color distortion is severe and the edge detail is lost, as shown in the results of the
                                  third hazy image in Fig. 8. To sum up, Up(J 2 ) is set as the input of SOS1 module.
                                  The effects of SFT layer and GTM on the model performance. According to the structure of the model, the
                                  position of SFT layers can be categorized into four situations (as shown in Fig. 9): (a) using only one SFT layer
                                  in front of Dehazeformer block1, (b) using only one SFT layer behind Dehazeformer block5, (c) using an SFT
                                  layer in front of Dehazeformer block1 and behind Dehazeformer block5, respectively (i.e., GTMNet), and (d)
                                  using an SFT layer in front of Dehazeformer block2 and behind Dehazeformer block4, respectively. As shown
                                                                                                                                                Vol.:(0123456789)
          www.nature.com/scientificreports/
Vol:.(1234567890)
www.nature.com/scientificreports/
                                  Table 5.  Quantitative comparison of different components ablation models on SateHaze1k dataset. Bold
                                  indicates the optimal value and underline indicates the suboptimal value.
                                  in Table 8, (d)-T has the highest PSNR and SSIM on Moderate Fog, but Table 9 indicates that GTMNet-B has
                                  a greater increase in PSNR and SSIM than (d)-B. Moreover, as seen from the comparison results in Fig. 10, the
                                  best dehazed result is achieved using GTMNet-T, with significantly improved image clarity and less severe image
                                  color distortion, especially in the third hazy image in Fig. 10.
                                      Based on the results shown in Table 8, we conclude that adding GTM to both the encoder and decoder has
                                  a superior effect on removing haze from the Thin Fog RSIs, and adding GTM solely to the decoder has a better
                                  effect on removing haze from the Moderate Fog and Thick Fog RSIs. We believe that the effectiveness of GTM is
                                  not only related to the thickness of haze, but also depends on the presence or absence of SOS boosted modules.
                                                                                                                                          Vol.:(0123456789)
          www.nature.com/scientificreports/
                                            Table 7.  Quantitative comparison of ablation models with different inputs to the SOS1 module on SateHaze1k
                                            dataset. Bold indicates the optimal value and underline indicates the suboptimal value.
                                            Figure 8.  Qualitative comparison of ablation models with different inputs to the SOS1 module on SateHaze1k
                                            dataset.
                                            Figure 9.  Position of SFT layers: (a) In front of Dehazeformer block1; (b) Behind Dehazeformer block5; (c)
                                            In front of Dehazeformer block1 and behind Dehazeformer block5; (d) In front of Dehazeformer block2 and
                                            behind Dehazeformer block4.
Vol:.(1234567890)
www.nature.com/scientificreports/
                                  Table 8.  Quantitative comparison of ablation models of SFT layer and GTM on SateHaze1k dataset. Bold
                                  indicates the optimal value and underline indicates the suboptimal value.
                                  Table 9.  Quantitative comparison of ablation models and GTMNet with different variants on SateHaze1k
                                  dataset. Bold indicates the optimal value and underline indicates the suboptimal value.
Figure 10. Qualitative comparison of ablation models of SFT layer and GTM on SateHaze1k dataset.
                                      Different transmission maps can impact the dehazing performance of a model. In our experiment, we utilized
                                  two types of transmission maps: the transmission map optimized solely by guided filtering, named (c)-t-T, and
                                  the GTM obtained by optimizing the estimated transmission map via guided filtering and subsequently applying
                                  a linear transformation to it, which was used in GTMNet. As shown in Table 8, the GTM leads to higher PSNR
                                  and SSIM indicators on both Thin Fog and Thick Fog compared to the transmission map optimized solely by
                                  guided filtering. Moreover, the subjective visual evaluation and objective quantitative metrics results demonstrate
                                  that GTM is also suitable for local dense haze images and yields a remarkable dehazing effect.
                                  The effects of initial learning rate on the model performance. According to the training method in D
                                                                                                                                      ehazeformer22,
                                  the initial learning rate of the model decreases as the batch size decreases. Following the linear scaling rule, the
                                  initial learning rate of GTMNet-B should be 1 × 10–4. We performed ablation experiments on three sub-datasets
                                                                                                                                                  Vol.:(0123456789)
          www.nature.com/scientificreports/
                                            and found that if we reduced the initial learning rate on GTMNet-B, as shown in Table 10, the values of PSNR
                                            and SSIM generally decreased significantly, so we kept the initial learning rate constant, i.e., 2 × 10–4, even if we
                                            reduced the batch size of an iteration on -B.
                                            Quantitative comparison of real‑world images. In order to evaluate the generalization ability of the
                                            GTMNet, we select two real-world unmanned aerial hazy RSIs for testing. Overall, the Dehazeformer method
                                            is suboptimal; therefore, we only compare the results of GTMNet-T and Dehazeformer-T in this part and use
                                            the -T model trained on Moderate Fog to test the two real-world haze images. Figure 11 shows little visual dif-
                                            ference between the processing results obtained by the proposed algorithm and Dehazeformer-T. Both methods
                                            produce clear, rich ground information, and realistic colors, suggesting that both algorithms are suitable for hazy
                                            remote sensing images in the real world. We have included additional visual comparisons in Supplementary
                                            Material to showcase the performance of our method on real-world images (Supplementary material).
                                            The impact of dehazing results on subsequent tasks. Hazy images suffer from problems like low
                                            contrast, low saturation, detail loss, and color deviation, which seriously affect image analysis tasks, such as clas-
                                            sification, positioning, detection, and segmentation. Therefore, in such cases, dehazing is crucial for generating
                                            images with good perceptual quality and improving the performance of subsequent computer vision tasks.
                                                In this section, we analyze the impact of dehazing results on RSI water body segmentation. Firstly, we trained
                                            an RSI water segmentation network inspired by the U-Net for biomedical image segmentation28 using 1500 RSIs
                                            and tested it using 300 RSIs. Secondly, we selected two images from the test set, added a moderate concentration
                                            of haze using Photoshop software, and tested the two images using the -T model trained on Moderate Fog. Finally,
                                            we qualitatively compare the results of water body segmentation for hazy inputs, dehazing results from GTMNet-
                                            T and Dehazeformer-T, and haze-free images. As shown in Fig. 12, there is very little visual difference between
                                            the dehazed images of GTMNet-T and haze-free images. However, the dehazed images of Dehazeformer-T have
                                            increased errors in the water body segmentation process compared to haze-free images.
                                            Figure 11.  Quantitative comparison of Dehazeformer and GTMNet for real-world images. The hazy inputs are
                                            acquired by a DJI-Phantom 4 Pro.
Vol:.(1234567890)
www.nature.com/scientificreports/
                                  Figure 12.  Qualitative comparison of different dehazing results in RSIs water body segmentation task. The
                                  ground truths are acquired by a DJI-Phantom 3 Pro.
                                  Conclusions
                                  Combining the advantages of ViT and CNN, we propose a new RSI dehazing hybrid model GTMNet. The GTM
                                  is first introduced into the model using two SFT layers to improve the model’s ability to estimate the haze thick-
                                  ness. The SOS boosted module is then introduced to refine the local features of the restored image gradually. The
                                  experimental results show that the proposed model has an excellent dehazing effect even for small-scale hazy RSI
                                  datasets, compensating for the lack of training data for current low-level visual tasks effectively and improving
                                  the model’s applicability. Compared with state-of-the-art methods, GTMNet mitigates, to some extent, color
                                  distortion on the roof of buildings with high brightness and in dense haze areas.
                                       We found that the effectiveness of the prior GTM depends on the presence of the SOS boosted module.
                                  Therefore, the strategy of introducing external prior knowledge is crucial. In future work, inspired by a dynamic
                                  memory network (DMN +)29 to fuse target-related external knowledge and image features, and a multi-level
                                  features fusion network (MFFN)30 to address the network redundancy, we will explore the self-weighted fusion
                                  strategy of the auxiliary data (e.g., Synthetic Aperture Radar image, GTM) and RSI features. In addition, we will
                                  further study strategies of combining traditional methods and deep learning–based methods, and design more
                                  suitable models to avoid overfitting.
                                                                                                                                             Vol.:(0123456789)
          www.nature.com/scientificreports/
                                            Data availability
                                            All data generated or analyzed during this study are included in this published article. The version of Photoshop
                                            software for creating hazy RSIs is 24.3, which is available at https://www.adobe.com/products/photoshop.html.
                                            References
                                             1. McCartney, E. J. Optics of the Atmosphere: Scattering by Molecules and Particles (Springer, 1976).
                                             2. Nayar, S. K. & Narasimhan, S. G. Vision in bad weather. In Proceedings of the Seventh IEEE International Conference on Computer
                                                Vision, Vol. 2, 820–827 (IEEE, 1999).
                                             3. Narasimhan, S. G. & Nayar, S. K. Vision and the atmosphere. Int. J. Comput. Vis. 48, 233–254 (2002).
                                             4. He, K., Sun, J. & Tang, X. Single image haze removal using dark channel prior. IEEE Trans. Pattern Anal. Mach. Intell. 33, 2341–2353
                                                (2010).
                                             5. Cai, B., Xu, X., Jia, K., Qing, C. & Tao, D. Dehazenet: An end-to-end system for single image haze removal. IEEE Trans. Image
                                                Process. 25, 5187–5198 (2016).
                                             6. Chavez, P. S. Jr. An improved dark-object subtraction technique for atmospheric scattering correction of multispectral data. Remote
                                                Sens. Environ. 24, 459–479 (1988).
                                             7. Huang, B., Zhi, L., Yang, C., Sun, F. & Song, Y. Single satellite optical imagery dehazing using SAR image prior based on conditional
                                                generative adversarial networks. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 1806–1813
                                                (2020).
                                             8. Mehta, A., Sinha, H., Mandal, M. & Narang, P. Domain-aware unsupervised hyperspectral reconstruction for aerial image dehaz-
                                                ing. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 413–422 (2021).
                                             9. Vaswani, A. et al. Attention is all you need. Adv. Neural Inf. Process. Syst. 30, 1–10 (2017).
                                            10. Wang, W. et al. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the
                                                IEEE/CVF International Conference on Computer Vision, 568–578 (2021).
                                            11. Liang, J. et al. Swinir: Image restoration using swin transformer. In Proceedings of the IEEE/CVF International Conference on
                                                Computer Vision, 1833–1844 (2021).
                                            12. Liu, Z. et al. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International
                                                Conference on Computer Vision, 10012–10022 (2021).
                                            13. Wang, Z. et al. Uformer: A general u-shaped transformer for image restoration. In Proceedings of the IEEE/CVF Conference on
                                                Computer Vision and Pattern Recognition, 17683–17693 (2022).
                                            14. Dong, P. & Wang, B. TransRA: Transformer and residual attention fusion for single remote sensing image dehazing. Multidimen-
                                                sion. Syst. Signal Process. 33, 1119–1138 (2022).
                                            15. Song, Y., He, Z., Qian, H. & Du, X. Vision transformers for single image dehazing. IEEE Trans. Image Process. 32, 1927–1941 (2023).
                                            16. Ronneberger, O., Fischer, P. & Brox, T. U-net: Convolutional networks for biomedical image segmentation. In International Confer-
                                                ence on Medical Image Computing and Computer-Assisted Intervention, 234–241 (Springer, 2015).
                                            17. He, K., Sun, J. & Tang, X. Guided image filtering. In European conference on computer vision, 1–14 (Springer, 2010).
                                            18. Wang, X., Yu, K., Dong, C. & Loy, C.C. Recovering realistic texture in image super-resolution by deep spatial feature transform.
                                                In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 606–615 (2018).
                                            19. Romano, Y. & Elad, M. Boosting of image denoising algorithms. SIAM J. Imag. Sci. 8, 1187–1219 (2015).
                                            20. Dong, H. et al. Multi-scale boosted dehazing network with dense feature fusion. In Proceedings of the IEEE/CVF Conference on
                                                Computer Vision and Pattern Recognition, 2157–2167 (2020).
                                            21. Shi, W. et al. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In
                                                Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1874–1883 (2016).
                                            22. Song, Y., He, Z., Qian, H. & Du, X. Vision Transformers for Single Image Dehazing. http://arxiv.org/abs/2204.03883 (2022).
                                            23. Li, X., Wang, W., Hu, X. & Yang, J. Selective kernel networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and
                                                Pattern Recognition, 510–519 (2019).
                                            24. Loshchilov, I. & Hutter, F. Decoupled Weight Decay Regularization. http://arxiv.org/abs/1711.05101 (2017).
                                            25. Loshchilov, I. & Hutter, F. Sgdr: Stochastic Gradient Descent with Warm Restarts. http://arxiv.org/abs/1608.03983 (2016).
                                            26. Burkardt, J. The truncated normal distribution. Department of Scientific Computing Website 1, 35 (2014).
                                            27. Chen, D. et al. Gated context aggregation network for image dehazing and deraining. In 2019 IEEE Winter Conference on Applica-
                                                tions of Computer Vision (WACV) 1375–1383 (IEEE, 2019).
                                            28. Ronneberger, O., Fischer, P. & Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image
                                                Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5–9,
                                                2015, Proceedings, Part III 18 234–241 (Springer, 2015).
                                            29. Chen, Y., Xia, R., Zou, K. & Yang, K. FFTI: Image inpainting algorithm via features fusion and two-steps inpainting. J. Vis. Com-
                                                mun. Image Represent. 91, 103776 (2023).
                                            30. Chen, Y., Xia, R., Yang, K. & Zou, K. MFFN: Image super-resolution via multi-level features fusion network. Vis. Comput. 1, 1–16
                                                (2023).
                                            Author contributions
                                            H.L.: conceptualization, software, investigation, visualization, validation, writing, revision. Y.Z.: conceptualiza-
                                            tion, methodology, writing, revision, supervision, financial support. J.L.: conceptualization, writing, revision.
                                            Y.M.: validation, resources.
                                            Funding
                                            Yaping Zhang was funded by Yunnan Provincial Agricultural Basic Research Joint Special Project (Grant No.
                                            202101BD070001-042), and the Yunnan Ten-Thousand Talents Program. The authors declare no competing
                                            interests.
                                            Competing interests
                                            The authors declare no competing interests.
Vol:.(1234567890)
www.nature.com/scientificreports/
                                  Additional information
                                  Supplementary Information The online version contains supplementary material available at https://doi.org/
                                  10.1038/s41598-023-36149-6.
                                  Correspondence and requests for materials should be addressed to Y.Z.
                                  Reprints and permissions information is available at www.nature.com/reprints.
                                  Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
                                  institutional affiliations.
                                                Open Access This article is licensed under a Creative Commons Attribution 4.0 International
                                                License, which permits use, sharing, adaptation, distribution and reproduction in any medium or
                                  format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the
                                  Creative Commons licence, and indicate if changes were made. The images or other third party material in this
                                  article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the
                                  material. If material is not included in the article’s Creative Commons licence and your intended use is not
                                  permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
                                  the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Vol.:(0123456789)