Satellite image inpainting with deep generative adversarial neural networks

Received Aug 20, 2020 Revised Jan 29, 2021 Accepted Feb 11, 2021 This work addresses the problem of recovering lost or damaged satellite image pixels (gaps) caused by sensor processing errors or by natural phenomena like cloud presence. Such errors decrease our ability to monitor regions of interest and significantly increase the average revisit time for all satellites. This paper presents a novel neural system based on conditional deep generative adversarial networks (cGAN) optimized to fill satellite imagery gaps using surrounding pixel values and static high-resolution visual priors. Experimental results show that the proposed system outperforms traditional and neural network baselines. It achieves a normalized least absolute deviations error of L1 = 0.33 (21% and 60% decrease in error compared with the two baselines) and a mean squared error loss of L2 = 0.15 (29% and 73% decrease in error) over the test set. The model can be deployed within a remote sensing data pipeline to reconstruct missing pixel measurements for near-real-time monitoring and inference purposes, thus empowering policymakers and users to make environmentally informed decisions.


INTRODUCTION
Climate change poses serious challenges that threaten humanity's long-term safety [1]. Addressing these challenges depends on breakthroughs in environmental policy and climate science [2]. Climate research is key to understanding the long-term effects of global warming on agriculture [3][4], food security [5], air quality [6], and weather conditions [7]. On the other hand, one of the primary data sources that empower environmental research is satellite imagery [8]. Satellite programs such as landsat [9], sentinel, aqua/terra, among others, provide a wealth of freely available data sets for the masses. This tremendous progress has unlocked many innovations that extract valuable insights [10][11] from satellite imagery using big data pipelines and advanced machine learning systems [12].
Satellite sensors are limited by their Spatio-temporal resolution. A satellite's temporal resolution represents the duration of getting information about the same point on earth. On the other hand, spatial resolution specifies the surface size of 1 pixel of information (ex. Sentinel-2 has an RGB spatial resolution of 10 × 10 m per pixel). Due to various reasons, most satellite imagery contains "holes" or "gaps" of missing pixel values. Clouds are the primary contributor to such noise. Satellite noise is challenging because it worsens the satellite's temporal resolution and introduces uncertainty into atmospheric monitoring pipelines. Many have resorted to using IoT sensors [13] that provide a higher-quality ground-level stream of  [14]. However, ground sensors can only give information about a specific location and, as a result, do not have the geographic coverage that satellites have (most polar satellites cover the whole earth).
Remote sensors are the primary data source for large-scale atmospheric monitoring and, more specifically, air quality monitoring. Enhancing the Spatio-temporal resolution of satellite sensors is of critical importance since it enables greater visibility over the state of planet earth. For this reason, this study focuses on inpainting satellite NO 2 images. Each NO 2 pixel measures the atmospheric NO 2 vertical density (in Dobson units) over the pixel. NO 2 is a trace gas that negatively affects air quality and the climate. It is linked to road traffic and industrial activities such as fossil fuel combustion [15]. A high NO 2 concentration can cause numerous respiratory diseases [16].
This paper proposes a generative adversarial system used to fill the missing gaps in images based on the image's content (available pixel values) and high-resolution visual priors. The system's novelty lies in its use of a different data modality (higher-resolution static RGB images) encoded by a conditional layer to provide auxiliary features to the completor network. As a result, the neural system inpaints all future images and pushes the sensor's temporal resolution to its theoretical limit (nullifying the effects of clouds or sensory errors). The paper's main contributions are outlined as: − A cGAN-based neural system for inpainting multi-spectral satellite imagery. − A full description of the pre-inference data preprocessing pipeline. − Case study: the method is evaluated on 2 pollution images for near-real-time air quality monitoring, showcasing the potential of fusing multi-modal satellite data using neural approximators.
The rest of the paper is structured as; "Related works" describes the most notable research efforts that tackle image inpainting. "Research method" introduces the neural system architecture, the training algorithm, and the data set. "Results and discussion" describes synthetic noise generation, introduces the performance metrics, and presents the final results. Finally, it provides an intuitive understanding of the effects of priors and their limitations. "Conclusion" summarizes the paper and describes future work.

RELATED WORKS
The existing research literature on image inpainting can be grouped into two main parts. Nonlearning methods such as diffusion/patch-based algorithms, and the relatively recent work that attempts to learn inpainting by training convolutional neural network-based architectures (CNNs). This section outlines the most notable efforts from both sides.

Diffusion or patch-based methods
The early success in image inpainting is attributed to information propagation techniques through patch similarity or variational methods. Efros and Leung [17] proposed to model image textures as Markov random fields then use similarity search to fill the missing pixels. Other efforts [18][19] were directed toward inpainting images through their texture and structure using search-based guided propagation that synthesizes patterns resembling other image regions or other images within a searchable database.
Variational methods are also present in [20] that use feature extractors such as patch statistics, colors, and gradients to synthesize the missing image gaps. Lastly, out-of-sample inpainting was achieved by [21] using an extensive database of images. Its algorithm inpaints missing regions of an image by finding similar images then diffusing extracted low-level features. Unlike others, this technique can suggest multiple completions based on the chosen database item.
This class of methods works well on images that contain repeated or static patterns (examples: sand, grid, paper) but fails on images with rich semantic content. Furthermore, automatic non-learning algorithms cannot inpaint abstractions that make complex images cohesive in their content, and their use of out-ofsample information is limited due to their local dependencies.

Learning-based approaches
One of the earliest efforts to use representation learning for image inpainting proposed a multi-layer perceptron (MLP) architecture to fill missing pixels in gray-scale images by minimizing the reconstruction loss [22]. The paper established the importance of masking missing pixels and the potential of neural networks (NNs) in image completion. Furthermore, Xu et al. [23] used a CNN architecture to propose a general method for solving three tasks: image inpainting, denoising, and image degradation recovery.
Recently, neural networks trained using pixel-wise reconstruction error and adversarial loss reported promising results. The work of [24] introduced context encoders to fill large holes in image centers. Yang et al. [25] enabled high-resolution image inpainting by proposing joint content and texture losses. Xu et al. [26] combined local and global discriminators into one network and used convolutions and dilated convolutions to inpaint images. Yu et al. [27] improved the previous architecture by dividing the generation process into two stages. The first outputs a blurry image optimized with spatial discounted ℒ 1 reconstruction loss, and the second refines and outputs the final image. The authors used the network's output as input to the global and local discriminators and chose wasserstein GANs (WGAN) to train the neural system (WGAN stabilizes the overall optimization process). Finally, Xu et al. [28] improved the previous architecture by incorporating contextual attention and dilated gated convolutions into both the coarse and refinement networks.
Although the mentioned neural systems provide impressive inpaintings and predict high-quality visual semantics, none have experimented with priors or extended the generator/discriminator with a conditional layer. Furthermore, all of the mentioned methods assume one source data distribution to be modeled, as shown in Table 1. This study establishes the importance of using different data modalities and fusing them through a conditional layer to solve image inpainting in general, and air quality estimation specifically.

RESEARCH METHOD
Two CNN-based network architectures were trained within a conditional adversarial framework as shown in Figure 1. The generator network, responsible for filling the missing gaps using contextual information and static priors, and an auxiliary discriminator network trained to distinguish between real and completed pollution patches. Both networks are conditioned over true-color imagery that corresponds to the region covering the input patch. The prior is encoded by a conditional layer (reducer). The input to the generator consists of a damaged image (x) and its high-resolution prior (p). The reducer network compresses p to the same size of x then stacks both for inpainting. The discriminator network takes either a healthy or a completed image with its encoded prior. The discriminator judges if an image is real or completed.

Convolutional neural networks
The reducer, completor, and discriminator networks are based on convolutional neural networks (CNN). CNNs are a special type of neural network that uses weight sharing to extract hierarchical visual features with minimal free parameters and maximal local connections. Kernel weights are optimized to produce activations that help in the final prediction task. CNNs are capable of progressively extracting higher-order abstractions that serve to minimize a pre-defined objective function. A specific activation is calculated using (1).
With X representing the input, K the kernel (matrices of learnable weights), s is the kernel size, and used to simulate a larger receptive field without adding more parameters. To calculate dilated activations, one parameter is added to the previous definition: η, which is the dilation factor as (2).

Conditional generative adversarial networks
Generative adversarial networks (GAN) are a class of neural networks trained in an adversarial manner. A GAN consists of two networks: a generative network G(. ) that learns the true data distribution (the process that generated the training data) and a discriminative network D(. ) that estimates if a sample came from the true data distribution or G(. ). Ideally, both G and D are trained simultaneously, G′s parameters are adjusted to minimize Log (1 − D(G(z))) (i.e., to fool the discriminator), and D′s parameters are tuned to maximize Log(D(x)) (i.e., to detect fake generated inputs). D and G play the following two-player minimax game with value function V(G, D) as (3).
In the context of this study, visual priors are of higher resolution than pollution patches. Downsampling visual imagery to fit pollution patches will result in losing much of its encoded information. Additionally, from the perspective of a vanilla GAN (i.e., an unconditioned GAN), there is no control over data modes during the inpainting process. Inputting pollution images without pixel-level meta-data will result in a model that mimics a general-purpose interpolator. However, by conditioning the output over its region's visual imagery, the model can produce accurate inpaintings by finding correlations between priors and pollution patches.
As a result, the generative adversarial network is extended with a conditional layer to encode the static priors. The generator and discriminator are provided with high-resolution encoded imagery (priors: p). The objective function of the two-player minimax game is updated as (4).
In a conditional generative adversarial setup, the same condition is provided to both the generator and discriminator networks. Priors are purposely used as conditions to help the generator enhance its completions. For example, one can imagine how useful vehicle traffic density images would be for a model that predicts near-real-time NO 2 concentrations. In this case, RGB images provide low-level information about urban and greenness densities. This study argues that a visual prior could be useful to the task of predicting NO 2 densities over large regions of interest.

Completion network
The completion network takes low-resolution NO 2 images that contain the gaps to be sfilled, and a mask channel that indicates which pixels are missing. Each damaged patch has its corresponding highresolution RGB image that covers the same region and provides gap-free visual information. Two networks were trained. The reducer acts as a down-sampler that intelligently resizes the high-resolution RGB image (the prior) to the same size as the damaged image. Table 2 presents its layers in successive order. The completor network is a fully convolutional network (FCN) that acts as the main inpainter. It is optimized to fill the missing gaps in the input image. Table 3 specifies its layers. The activations for both networks were passed through batch normalization and ReLU after each layer. The completor network is used to inpaint the missing regions in the input. On the other hand, the reducer network resizes the prior to the same size as the input. Without it, the model would learn the unconditioned pollution image distribution, which is not optimal for location-variant patterns. Urban, land, and other visual features serve as strong priors for the completor to generate accurate patches. The completor network was trained using mean squared error loss (MSE) averaged over the masked (gap) pixels.

Discriminator network
The discriminator network is trained to detect completed NO 2 images. A ResNet-18 [30] architecture shown in Figure 1 is used to extract the feature vector which is mapped to the probability of the input being completed (fake) or real. Reducer network weights are frozen while optimizing the discriminator since the reducer is optimized for efficient inpainting, not discrimination. The primary role of RGB images in the context of the discriminator is to provide a useful prior that is independent of whether the input image is real or not. Hence, the discriminator is optimized to estimate P(input = completed|p).

Training
The completor network is denoted: C(x, p). x represents a batch of NO 2 images with masks M comprised of 0s and 1s, with 1s representing the pixels that are missing in x. p are the priors for each image in x. They consist of many RGB images for the corresponding ROIs. Similarly, D(x ̅, p) designates the discriminator network, x ̅ represents the pollution images (real or completed), and p the encoded priors over x ̅ 's regions of interests (ROIs).
Mean squared error loss (ℒ 2 ) is an inpainting loss choice that results in blurred estimations over the gaps. It averages the squared differences between gap pixel predictions and targets as (5).
On the other hand, adversarial loss can be formulated as (6).
C is the completor network, D is the discriminator, x is the input, x ̅ is the damaged/healthy input, and p the prior. ℒ 2 and adversarial losses are combined to formalize the general optimization problem as (7).
GANs are challenging to train due to the instability between the generator and discriminator networks in the early training phase. For this reason, the training loop is balanced as described in Algorithm 1.
The method proposed in [26] is chosen as the neural baseline. Its model was trained to produce visually appealing completions for a variety of natural scene images. It serves as a good benchmark because the proposed architecture is an extension of the baseline's modular design.

Data
The european organization of the exploitation of meteorological satellites (EUMETSAT) is an international satellite agency responsible for acquiring, preprocessing, and distributing reliable weather, climate, and environmental data. Its low-orbiting satellite, MetOp, continuously delivers critical climate data. EUMETSAT also distributes data from other partners such as the national oceanographic and atmospheric administration (NOAA). Offline EUMETSAT's data products are free and available for research purposes.
MetOp is a series of 3 polar-orbiting meteorological satellites developed by the european space agency (ESA) and operated by EUMETSAT. MetOp takes 90 minutes to orbit the earth, totaling 14 times a day. Having three satellites enhances the temporal resolution of MetOp as a data provider. MetOp carries a payload of 11 scientific instruments. After transferring the data, it gets preprocessed into multiple levels and fed into numerical simulators for weather forecasting and environmental monitoring. Many of its available data products provide vertical density measurements.
Gas traces were acquired from the global ozone monitoring experiment-2 (GOME-2) instrument, a scanning spectrometer that provides global monitoring coverage. The near real-time total column (NTO) product provides concentration measurements for four types of atmospheric trace gases: O 3 , NO 2 , SO 2 , and HCHO. The product is operational since 01/12/2007, has a spectral resolution of 0.26 − 0.51nm, and provides global geographic coverage with a spatial resolution of 40 × 40km (MetOp-A).
On the other hand, Meteosat is a series of geostationary meteorological satellites operated by EUMETSAT. Meteosat second generation (MSG) provides images of the full earth disc and data for weather forecasts. It has a temporal resolution of 15 minutes. The spinning enhanced visible and infrared imager (SEVIRI) instrument captures the true-color images in a spatial resolution of 1 × 1km. The prior's (p) imagery is collected and preprocessed from MSG's SEVIRI instrument.
MSG covers the ROI of Morocco. The tiles were clipped using the region of interest and merged through pixel-averaging to store a single (mosaic) high-quality image over the ROI.
In this study, Morocco was chosen as a region of interest. All images were filtered to be in the bounding box [(−5.39,35.54), (−5.34,35.54), (−5.34,35.59), (−5.39,35.59), (−5.39,35.54)] in (latitude, longitude) coordinates. All SEVIRI tiles were taken from 06/2017 to 01/2018. For pollution images, Tiles were acquired and filtered for the same ROI that range from 03/2018 to 09/2018. The priors were sampled from a previous timeframe because they will be used to inpaint future patches. Synthetic mask generation is explained in the "RESULTS AND DISCUSSION" section.
T is denoted as the set of acquired tiles for the region of interest. for each t i ∈ T, it is processed is by as: − t i is projected into a pre-defined static spatial grid to normalize pixel positions.

RESULTS AND DISCUSSION 4.1. Noise generation
As opposed to the task of natural image inpainting, where not much consideration is given to the shape of the gap mask, the geometry of missing regions in satellite imagery follows strong patterns (clouds, holes, and lines). The mask generator should reproduce these patterns in training time. As a result, a separate GAN G n was trained using the isolated noised patches to learn the natural noise distribution. G n serves as a noise mask generator that is used to sample artificial gaps during training. For every healthy patch x i ∈ X, a 64 × 64 gap mask is sampled from G n , a copy of x i is purposely damaged using the mask. The mask is stacked on top of the damaged image; the final 2-channel image represents the input. The original image x i becomes the target used in loss calculation.

Results
The proposed model was benchmarked against two algorithms, PatchMatch [29], which represents the classical state-of-the-art method, and the LocalGlobal [26] neural inpainter. For training, input images of size 64 × 64 pixels and prior images of size 256 × 256 pixels were used as input to the model. The training and testing datasets were extracted using a time-series split.
As opposed to natural scene completion, where MAE and MSE are not considered good metrics because many completions can be conceptually possible, in the case of pollution images, sensor measurements are unique targets. Hence, the model was evaluated using two regression metrics; least absolute deviations loss as (8).
ℒ 1 averages the absolute differences between the target and predicted gap pixels. Mean squared error (ℒ 2 ) is also reported. Both metrics measure error as the distance between the model's predictions and the ground-truth normalized targets.
The model achieves a least absolute deviations error of ℒ 1 = 0.33 (21% and 60% decrease in error with respect to the two baselines) and an MSE of ℒ 2 = 0.15 (29% and 73% decrease in error, respectively) over the test set. These results represent a significant performance increase over the PatchMatch and LocalGlobal inpainters. When converted to Dobson Units (DU), we get ℒ 1 = 0.56 DU and ℒ 2 = 0.41 DU. 1 corresponds to a column density of 2.8 × 10 16 −2 , the raw measurements range between 0 and 2.5 DU.

Discussion
The main contributor to the increase in performance is the conditional layer. An ablation study was conducted by removing the conditional layer and training the GAN without visual priors. The resulting ℒ 1 and ℒ 2 scores were slightly worse than LocalGlobal [26], last column of Table 4.  Figure 2 showcases the effects that visual priors can have on pollution predictions. The model predicts higher NO 2 concentrations in the city of Casablanca without processing high-density neighboring pixels, top row in Figure 2. It predicted high NO 2 concentrations by relying solely on the RGB prior. Such patterns are noticeable in other urban and industrial cities in Morocco. However, at the bottom row, the model failed to predict a high pollution concentration over the region of Taourirt. That could have been the result of the temporal nature of industrial activities and the noise that is inherent to sensory measurements. The average increase in performance indicates that the model learned to associate certain visual features to high/low pollution densities. This study showcases the benefits of multi-modal learning in computer vision. Despite the prior not providing population, time-dependent industry activity estimates, or traffic information, it increased the overall performance by simply providing high-quality visual information. The reducer also played a critical role in transforming the priors in a way that was useful to inpaint the missing regions. The proposed model can also be used as a data enahncer for downstream tasks by remving cloud effects. Downstream applications insclude object detection [31], landcover generation [32], landuse classification [33], change detection, crop monitoring, field and urban mapping among others.

CONCLUSION
This paper proposed a neural image inpainter capable of filling measurement gaps using neighboring pixel values and static priors. It described the preprocessing data pipeline, the CNN-based subnetworks, and the training process. The neural system was successfully trained to fill pollution gaps in the region of Morocco. It outperformed two prior-less baselines and showed the potential of data fusion for satellite image inpainting. The described neural system can be deployed within a remote sensing pipeline to fill incoming satellite patches, resulting in greater near-real-time visibility over weather, atmospheric, and climate conditions. The system intelligently nullifies the effects of sensor perturbations and cloud effects. However, one limitation of the system is that temporal resolution enhancement is not enough to offer a competitive alternative to IoT-based devices for small-scale monitoring, the limited spatial resolution of satellite sensors remains the most important open challenge in remote sensing. However, the model can be modified to tackle super-resolution through prior learning. Such system can enhance the satellite's spatial resolution by using low-resolution source imagery and high-resolution static priors.