See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/328989735
A CNN-Based Fusion Method for Super-Resolution of Sentinel-2 Data
Conference Paper · July 2018
DOI: 10.1109/IGARSS.2018.8518447
CITATIONS READS
14 212
5 authors, including:
Massimiliano Gargiulo Antonio Mazza
University of Naples Federico II University of Naples Federico II
18 PUBLICATIONS 100 CITATIONS 9 PUBLICATIONS 95 CITATIONS
SEE PROFILE SEE PROFILE
Giuseppe Ruello Giuseppe Scarpa
University of Naples Federico II University of Naples Federico II
181 PUBLICATIONS 1,412 CITATIONS 95 PUBLICATIONS 1,707 CITATIONS
SEE PROFILE SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Research Topic: Recent Advances in Remote Sensing of Forest Fires View project
Multitemporal RGB SAR View project
All content following this page was uploaded by Massimiliano Gargiulo on 08 March 2019.
The user has requested enhancement of the downloaded file.
A CNN-BASED FUSION METHOD FOR SUPER-RESOLUTION OF SENTINEL-2 DATA
Massimiliano Gargiulo1 , Antonio Mazza1 , Raffaele Gaetano2 , Giuseppe Ruello1 , Giuseppe Scarpa1
1
DIETI, University Federico II, Via Claudio 21, 80125 Naples, Italy
2
UMR-TETIS Laboratory, CIRAD, 34000 Montpellier, France
ABSTRACT
Sentinel-2 data represent a rich source of information for the
community due to the free access and to the temporal-spatial
coverage assured. However, some of the spectral bands are
sensed at reduced resolution due to a compromise between
technological limitations and Copernicus program’s objec-
ρ8 (PAN-like) ρ11 (↑, bicubic) ρ11 (↑, proposed)
tives. For this reason in this work we present a new super-
resolution method based on Convolutional Neural Networks Fig. 1. Example of super-resolution of ρ11 (SWIR band).
(CNNs) to rise the resolution of the short wave infra-red
(SWIR) band from 20 to 10 meters, that is the highest res-
olution provided. This is accomplished by fusing the target [3, 4]. However, these methods do not take into account any
band with the finer-resolution ones. The proposed solution other source of data but the objective image. In the problem at
compares favourably against several alternative methods ac- hand we dispose of companion bands which are coregistered
cording to different quality indexes. In addition we have also with the target and carry important information, in particu-
tested the use of the super-resolved band from an applicative lar spatial details. Therefore it would be more effective to
perspective by detecting water basins through the Modified resort to pansharpening-like methods meant to fuse a low-
Normalized Difference Water Index (MNDWI). resolution multispectral (MS) image with a high-resolution
Index Terms— Deep learning; convolutional neural network; single panchromatic (PAN) band, to rise the resolution of the
normalized difference water index; Sentinel-2; pansharpening. MS to that of the PAN. In our case the SWIR band would
play the role of the MS while one or more higher resolution
1. INTRODUCTION companion bands would replace the single PAN. By follow-
ing this paradigm several pansharpening methods, based on
The European Space Agency has recently launched the twin both component substitution [5, 6] and multiresolution anal-
satellites Sentinel-2 which can provide global acquisitions of ysis [7, 8], were adapted to the Sentinel-2/SWIR case and
relatively fine spatial resolution multispectral images with a compared in [9]. Somehow related to this case is the fusion
high revisit frequency, whose objective is to supply data for of multispectral and hyperspectral images for which a deep
services such as risk management (floods, forest fires, subsi- learning approach has been already proposed in [10].
dence, landslide), land monitoring, food security/early warn- Here, according with this line of research, we propose to
ing systems, water management, soil protection and so forth use a CNN-based approach similar to [11, 12] which have
[1]. Unfortunately, due to a balance between technological proved to be very successful to pansharpen very high reso-
constraints and the objectives of the mission, only four out of lution data like Ikonos, GeoEye, or WorldView. In particu-
thirteen bands are provided at the highest resolution of 10 me- lar we have developed three CNN models corresponding to
ters. The remaining bands are given at 20 or 60 meters. One three input combinations. In the simplest case (M1) we feed
such bands is for example the SWIR, provided at 20 meters, the network only with the objective band ρ11 , without high-
which has proven to be very useful for water detection [2]. resolution guiding bands (pure super-resolution). Then we
Motivated by the above considerations, in this work we moved to the pansharpening-like case including also higher
propose a super-resolution method for the SWIR band of resolution bands, considering the limit cases when only the
Sentinel-2, for the purpose of water monitoring at fine-scale most correlated band (near infra-red, NIR) is enclosed (M2)
through the MNDWI. The basic approach to address a super- opposed to the case when all 10-m resolution bands are used
resolution problem is to use a bicubic or a more general (M5). A preview of the proposed solution compared with a
polynomial interpolation. State-of-the-art solutions resort simple bicubic interpolation is given in Fig. 1. On the left
instead to the use of deep learning methods such as CNNs is the “guide” band ρ8 , in the middle is the bicubic interpo-
978-1-5386-7150-4/18/$31.00 ©2018 IEEE 4713 IGARSS 2018
lation of ρ11 , and on the right is the proposed upsampling input kernel size Interaction
Model
with model M2. Numerical results discussed below show that bands (# features) range
the quality of our method compares favourably against differ- l=1 l=2 l=3
ent pansharpening-like alternatives according to several in- 3×3 3×3 3×3
dicators. In addition we have also tested the proposal from M1 ρ11 7×7
(48) (32) (1), ρb11
the user’s perpective by detecting water basins through the
3×3 3×3 3×3
MNDWI computed at 10m resolution using the upsampled M2 ρ11 , ρ8 7×7
(48) (32) (1), ρb11
SWIR component.The rest of the paper is organized as fol-
lows. Section 2 describes the proposed method in more detail. ρ11 , ρ8 3×3 3×3 3×3
M5 7×7
Section 3 summarizes experimental results while conclusions ρ2 , ρ3 , ρ4 (48) (32) (1), ρb11
are drawn in Section 4. Table 1. Hyper-parameters of the proposed networks.
2. PROPOSED CNN-BASED METHOD
whose concatenation gives the overall CNN function
Convolutional neural networks have been successfully ap-
y = f (x, Φ) = fL (fL−1 (. . . f1 (x, Φ1 ), . . . , ΦL−1 ), ΦL )
plied to many image processing problems, like super-resolution
[3], pansharpening [12], classification [13], and others, be- where x = x(1) , y = y(L) , and Φ , (Φ1 , . . . , ΦL ) is the
cause of several advantages such as (i) the capability to whole set of parameters to learn. In this chain, each layer l
approximate complex non-linear functions, (ii) the ease of provides a set of so-called feature maps, y(l) , which “acti-
training that allows to avoid time consuming handcraft filter vate” on local cues in the early stages (small l), to become
design, (iii) the parallel computational architecture. On the more and more representative of global interactions in sub-
downside the availability of a large amount of “labelled” data sequent ones (large l). The network hyper-parameters are
is required for training. summarized in Tab.1. Model M1 corresponds to the “pure”
In this work we propose to use a relatively shallow ar- super-resolution of ρ11 , without using any additional “guid-
chitecture which is a cascade of L = 3 convolutional lay- ing” band. M2 uses only the most correlated band as guide,
ers interleaved by Rectified Linear Unit (ReLU) activations while M5 uses all available high-resolution bands. In the last
that ensure fast convergence of the training process [13]. Let column it is reported the scope of the overall network func-
x , (ρ11 , ρHR1 , . . . , ρHRB ) be the input to the network1 , and tion readily obtained as comulative convolutional spread, as
y , ρb11 be the network output that is the sharpened SWIR the nonlinear ReLU is a punctual operator which does not
band. The l-th (1 ≤ l ≤ 3) convolutional layer, with N -band increase the scope. These hyper-parameters were selected
input x(l) , yields an M -band output z(l) among several alternative configurations as optimal choice in
terms of complexity and accuracy. It is worth notice that the
z(l) = w(l) ∗ x(l) + b(l) ,
overall scope is relatively small compared to that of the CNN
whose m-th component is a combination of 2D convolutions: pansharpening method [11], which is 17×17. This should
not surprise as [11] is conceived for a super-resolution ratio
N
X which is double, therefore requiring in principle major efforts
(l)
z (m, ·, ·) = w(l) (m, n, ·, ·) ∗ x(l) (n, ·, ·) + b(l) (m). to work “equally” well.
n=1
The tensor w is a set of M convolutional N × (K × K) ker- 2.1. Learning
nels, with a K × K spatial support (receptive field), while
b In order to train the network’s parameters Φ a sufficiently
is a M -vector bias. These parameters, Φl , w(l) , b(l) , are large number of input-output examples and the choice of a
learnt during a proper training phase. Likewise in [11] first suitable cost function to minimize on them are required to
and second convolutional layers are followed by a pointwise run any learning algorithm, like for example the Stochastic
ReLU activation function gl (·) , max(0, ·) yielding the in- Gradient Descent (SGD) adopted in [12]. In the pansharpen-
termediate layer outputs ing case [11] it has been proposed to generate examples for
( training through Wald’s protocol, that consists in using as in-
(l) (l) max(0, w(l) ∗ x(l) + b(l) ), l < L puts properly downsampled PAN-MS pairs and taking as cor-
y , fl (x , Φl ) =
w(l) ∗ x(l) + b(l) , l=L responding output the original MS. This same approach has
been extended to our case where a selection of high resolution
1 In CNN-based super-resolution or pansharpening it is custom to prelim-
bands play the role of the PAN. An high-level description of
inarly upsample the lower resolution components in input with a standard
ideal interpolator, e.g. bicubic, to align the input stack. For the sake of sim-
the training process for model M2 is given in Fig. 2 (left). The
plicity we keep the notation ρ11 for this interpolated band which actually resolution downgraded bands are marked with a downward
feeds the net. arrow superscript. Once the network has reached a conver-
4714
(↓) (↓)
ρ11 , ρ8 ρ11 , ρ8
↓2×2
(↓) (↓) (↓)
ρ11 , ρ8 Network ρb11 – + ρ11 ρ11 , ρ8 Network ρb11
+
Φ(n) Φ(∞)
downscaled Φ(n+1)
sample Learning
(SGD)
training input output
samples
>>> training <<< >>> inference <<<
Fig. 2. Top-level training (left) and inference (right) workflows for model M2.
gence condition the current parameters Φ(∞) are frozen and Methods Q-index ERGAS HCC CER L-CER
(ideal value) (1) (0) (1) (0) (0)
ready to be used to perform the super-resolution of the target
Bicubic 0.9914 4.992 0.5366 0.0166 0.1876
images (right part of Fig. 2). The training phase is carried out M1 (proposed) 0.9970 3.036 0.7090 0.0086 0.0909
offline once for all and takes a few hours using GPU cards, ATWT-M3 [8] 0.9873 5.949 0.5828 0.0160 0.1762
while the test can be done in real-time. Moreover, we pre- MTF-GLP-HPM [17] 0.9823 7.245 0.4509 0.0207 0.1370
HPF [7] 0.9922 4.688 0.5832 0.0138 0.1680
ferred to use the L1-norm in place of the L2-norm as it has M2 (proposed) 0.9975 2.830 0.7718 0.0064 0.0637
proven [12] to be more effective in the error backpropagation. M5 (proposed) 0.9983 2.354 0.8500 0.0066 0.0594
Specifically, the loss is computed by averaging over a suitable
set (mini batch) of training examples at each updating step of Table 2. Accuracy of ρb11 (Q-index, ERGAS, HCC) and water
the SGD process: maps (CER, L-CER) at 20-m.
h
i
(↓)
L(Φ(n) ) = E
ρ11 − ρb11 (Φ(n) )
. The average numerical results obtained for the three
1
scenes of interest are gathered in Tab. 2 (left part). The
proposed pure super-resolution method M1 is compared to
the standard bicubic interpolator in the top part of the table.
3. EXPERIMENTAL RESULTS As average figures, Q-index and ERGAS do not stress that
much the gain provided by M1 as it does HCC which deals
In order to build a sufficiently general dataset for training we with high frequency components that are much affected by
have chosen three images portraiting rather different scenes: the super-resolution and are mostly localized on boundaries.
Guinea, Tunisia, and Italy (Venice). Once left apart some Moving to model M2, which takes the additional input band
450×450 clips for testing, 17×17 patches for training were ρ8 , it compares favourably against classical pansharpening
uniformly sampled from all scenes in the remaining segments. methods adapted to the Sentinel-2/SWIR problem as sug-
Overall, 19000 patches were collected and randomly grouped gested in [9].In the last row it is given the performance of
in 128-size mini batches for the implementation of the SGD- the proposed method when all four high-resolution bands are
based training. Additional patches were also extracted for the added to the input stack. As it can be seen the three additional,
purpose of validation completing the partition in 70% (train- although less correlated with ρ11 , provide an additional gain.
ing), 15% (validation), and 15% (test). The proposed models are also tested from the application
To assess the performance of the proposed method we re- point of view by detecting water basins through the computa-
sort to three full-reference numerical figures commonly used tion of the MNDWI index (I) defined as
for pansharpening: (↓)
ρ3 − ρ11 ρ3 − ρb11
I= (↓)
or Ib =
- Q-index, an image quality indicator introduced in [14]; ρ3 − ρ11 ρ3 − ρb11
- ERGAS, proposed in [15], which reduces to the root
at resolution of 20m or 10m, respectively. Once water (W ) is
mean square error in case of single band;
detected by suitably thresholding the MNDWI (W = I > α),
- HCC, the correlation coefficient between the high-pass the classification error rate on the whole image (CER) and
components of reference and its estimate [16]. locally to boundaries2 (L-CER) is computed and reported on
the right-hand side of Tab. 2. These figures provide a further
As these indicators require reference, likewise for the train-
confirm of the superiority of the proposed method.
ing data, reduced resolution test data are produced through
Wald’s protocol. 2 Boundaries are detected using morphological gradient.
4715
IEEE transactions on pattern analysis and machine intelli-
gence, vol. 38, no. 2, pp. 295–307, 2016.
[4] Kyoung Mu Lee Jiwon Kim, Jung Kwon Lee, “Accurate image
super-resolution using very deep convolutional networks,” in
CVPR, 2016, pp. 1646–1654.
[5] V. P. Shah, N. H. Younan, and R. L. King, “An efficient pan-
RGB Ground-truth Bicubic sharpening method via a combined adaptive PCA approach and
contourlets,” IEEE Trans. Geosci. Remote Sens., vol. 46, no. 5,
pp. 1323–1335, May 2008.
[6] Te-Ming Tu, Shun-Chi Su, Hsuen-Chyun Shyu, and Ping S.
Huang, “A new look at IHS-like image fusion methods,” In-
formation Fusion, vol. 2, no. 3, pp. 177–186, 2001.
[7] P.S. Chavez and J.A. Anderson, “Comparison of three differ-
ent methods to merge multiresolution and multispectral data:
ATWT-M3 MTF-GLP-HPM HPF Landsat TM and SPOT panchromatic,” Photogramm. Eng. Re-
mote Sens., vol. 57, no. 3, pp. 295–303, 1991.
[8] T. Ranchin and L. Wald, “Fusion of high spatial and spec-
tral resolution images: the ARSIS concept and its implemen-
tation,” Photogramm. Eng. Remote Sens., vol. 66, no. 1, pp.
49–61, 2000.
[9] Y. Du, Y. Zhang, F. Ling, Q. Wang, W. Li, and X. Li, “Water
bodies mapping from sentinel-2 imagery with modified nor-
M1 M2 M5
malized difference water index at 10-m spatial resolution pro-
Fig. 3. MNDWI estimations over a sample detail (from duced by sharpening the swir band,” Remote Sensing, vol. 8,
Venice image). In order to have a reference ground-truth we no. 4, pp. 354, 2016.
applied Wald’s protocol (downgraded resolution). [10] F. Palsson, J. R Sveinsson, and M. O Ulfarsson, “Multispectral
and hyperspectral image fusion using a 3-d-convolutional neu-
ral network,” IEEE Geoscience and Remote Sensing Letters,
To conclude this section we show in Fig. 3 some sam- vol. 14, no. 5, pp. 639–643, 2017.
ple results (at 20-m with reference) which further confirm the [11] G. Masi, D. Cozzolino, L. Verdoliva, and G. Scarpa, “Pan-
effectiveness of the proposed method. An example of full- sharpening by convolutional neural networks,” Remote Sens-
resolution (10-m) estimation of ρ̂11 is shown in Fig. 1. ing, vol. 8, no. 7, pp. 594, 2016.
[12] Giuseppe Scarpa, Sergio Vitale, and Davide Cozzolino,
“Target-adaptive cnn-based pansharpening,” arXiv:
4. CONCLUSION 1709.06054v2, 2017.
[13] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton, “Im-
In this work we have introduced a CNN-based fusion method
agenet classification with deep convolutional neural networks,”
to enhance the spatial resolution of the SWIR component of in Advances in Neural Information Processing Systems, 2012,
Sentinel-2 images. Very promising results suggest that the pp. 1106–1114.
proposed approach deserve further investigation, in particu- [14] Zhou Wang and A. C. Bovik, “A universal image quality in-
lar exploting different architectural choices and/or learning dex,” IEEE Signal Processing Letters, vol. 9, no. 3, pp. 81–84,
strategies, and extending its application to other spectral March 2002.
bands and/or sensors. [15] Lucien Wald, Data Fusion. Definitions and Architectures -
Fusion of Images of Different Spatial Resolutions, Presses de
5. REFERENCES l’Ecole, Ecole des Mines de Paris, Paris, France, 2002, ISBN
2-911762-38-X.
[1] M. Drusch et al., “Sentinel-2: Esa’s optical high-resolution [16] Mohammad Fallah Yakhdani and Ali Azizi, Quality assess-
mission for gmes operational services,” Remote Sensing of En- ment of image fusion techniques for multisensor high resolu-
vironment, vol. 120, no. Supplement C, pp. 25 – 36, 2012, The tion satellite images (case study: IRS-P5 and IRS-P6 satellite
Sentinel Missions - New Opportunities for Science. images), na, 2010.
[2] H. Xu, “Modification of normalised difference water index [17] B. Aiazzi, L. Alparone, S. Baronti, A. Garzelli, and M. Selva,
(ndwi) to enhance open water features in remotely sensed im- “An mtf-based spectral distortion minimizing model for pan-
agery,” International journal of remote sensing, vol. 27, no. sharpening of very high resolution multispectral images of ur-
14, pp. 3025–3033, 2006. ban areas,” in 2003 2nd GRSS/ISPRS Joint Workshop on Re-
mote Sensing and Data Fusion over Urban Areas, May 2003,
[3] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang,
pp. 90–94.
“Image super-resolution using deep convolutional networks,”
4716
View publication stats