Closing the domain gap: blended synthetic imagery for climate object detection

Abstract Accurate geospatial information about the causes and consequences of climate change, including energy systems infrastructure, is critical to planning climate change mitigation and adaptation strategies. When up-to-date spatial data on infrastructure is lacking, one approach to fill this gap is to learn from overhead imagery using deep-learning-based object detection algorithms. However, the performance of these algorithms can suffer when applied to diverse geographies, which is a common case. We propose a technique to generate realistic synthetic overhead images of an object (e.g., a generator) to enhance the ability of these techniques to transfer across diverse geographic domains. Our technique blends example objects into unlabeled images from the target domain using generative adversarial networks. This requires minimal labeled examples of the target object and is computationally efficient such that it can be used to generate a large corpus of synthetic imagery. We show that including these synthetic images in the training of an object detection model improves its ability to generalize to new domains (measured in terms of average precision) when compared to a baseline model and other relevant domain adaptation techniques.

Accurate geospatial information about the causes and consequences of climate change, including energy systems infrastructure, is critical to planning climate change mitigation and adaptation strategies.When up-to-date spatial data on infrastructure is lacking, one approach to fill this gap is to learn from overhead imagery using deep-learning-based object detection algorithms.However, the performance of these algorithms can suffer when applied to diverse geographies, which is a common case.We propose a technique to generate realistic synthetic overhead images of an object (e.g., a generator) to enhance the ability of these techniques to transfer across diverse geographic domains.Our technique blends example objects into unlabeled images from the target domain using generative adversarial networks.This requires minimal labeled examples of the target object and is computationally efficient such that it can be used to generate a large corpus of synthetic imagery.We show that including these synthetic images in the training of an object detection model improves its ability to generalize to new domains (measured in terms of average precision) when compared to a baseline model and other relevant domain adaptation techniques.

Impact Statement
Existing methods of gathering information about energy and climate-related infrastructure and their impacts rely on self-reported information from organizations and governments, large-scale surveys, or crowd-sourced information.Object detection using overhead imagery offers a fast and low-cost way of identifying climaterelated objects and other characteristics visible from above.These techniques, if scaled up, could democratize access to climate and energy infrastructure information cheaply and globally.Yet, as we show, the performance of object detection models can suffer when applied to geographically different regions, hindering the wider application of these models.To aid the broader applicability of detecting important climate infrastructure, we propose a domain adaptation technique that uses easy-to-generate synthetic overhead images across a range of diverse geographies.We show that this technique outperforms many alternative domain adaptation techniques, is simple to implement, and can be widely applied to other object detection problems.

Introduction
From power plants to wildfires, many of the causes and consequences of climate change are visible from overhead imagery.Accurate geospatial information about the causes of climate change, including energy infrastructure systems, is critical to planning climate change mitigation and adaptation strategies.However, spatial data on current energy infrastructure is often lacking.The data may not be publicly available, may be incomplete, or may not be of a sufficiently high resolution (Stowell et al., 2020).Recent research has demonstrated the potential of using satellite imagery to fill the data gaps by monitoring energy systems at unprecedented frequencies and scale (Donti and Kolter, 2021;Ren et al., 2022).Two remaining challenges to detecting climate objects at scale include (1) a lack of large datasets with labeled data for relevant applications, and (2) the difficulty of applying these techniques across diverse geographic domains.
The first roadblock is the substantial number of labeled examples required to successfully train a machine learning model.Building a large dataset of overhead images can be challenging because of the laborintensive nature of sifting through images and accurate data labeling (Tan et al., 2018).When the object of interest is rare and few instances may exist, the difficulty of detecting such an object is exacerbated.This is common for data related to energy and climate systems.
The second challenge is aligning the distribution of training images and testing images.It is frequently the case that the training and testing samples are not independently and identically distributed.The divide in the distribution of the images we use to train versus those we use to test our models is known as the domain gap.For example, we may only have training samples of an object in arid landscapes, but we might want to detect the object in mountainous or forestland landscapes as well.When the training samples come from different distributions than the testing samples, object detection performance deteriorates.Deep neural network models can be unreliable in cases where a domain gap exists (Tuia et al., 2016).Throughout this paper, we use the term geographic domain to refer to a landscape or region with an underlying distribution of visual properties that is regionally similar, but potentially distinct from other regions.In this case, when training on samples that originate from one geographic domain and testing on another, the potential for deteriorated object detection or segmentation performance, which we refer to as a domain gap, exists.Example images from diverse geographic domains can be found in Figure 2.
In their work, Tuia et al. (2016) lay out four general strategies for domain adaptation methods for remotesensed classification: invariant feature selection, adapting the data distributions, adapting the model/ classifier, and selective sampling.These strategies can be combined together or used separately for domain adaptation.Our contribution is a new technique that falls under the category of adapting data distributions proposed by Tuia et al.Changing the data distributions can mean expanding or transforming the training set in a way that enhances the ability of the algorithm to perform well on alternative domains (Shorten and Khoshgoftaar, 2019).In remote sensing applications, this enables the overhead imagery in a training dataset to become more representative of the data in the testing domains.This category of techniques also includes incorporating additional synthetic imagery to help close the domain gap by exposing a model to a broader variety of data (e.g., Synthinel-1, Kong et al., 2020;SIMPL, Hu et al., 2021;Xu et al., 2022).
Image transformation techniques are often color or pixel-based transformations, which leave image content unchanged but vary the color of pixels in an image, mapping pixels of each color to a new value.For example, histogram equalization adjusts image pixel intensity values such that they follow a uniform distribution to standardize image appearance regardless of domain.These pixel-wise techniques aim to better align the training and testing image sets.These techniques have the benefit of being able to transform a dataset in place without additional labels while alternative techniques generate unique synthetic images to add to the training set.Generative adversarial network (GAN)-based synthetic generation methods may modify both the pixel values and the information content of an image.These are neural models that learn a highly complex and expressive mapping between two domains.Examples of GAN-based models that have been used to generate synthetic training samples include CyCADA (Hoffman et al., 2018) and CycleGAN (Zhu et al., 2017).
In this work, we confine our comparisons to transformation and augmentation-based approaches.Within the subclass of color-based image transformation techniques, we compare to popular methods such as the gray world mapping (which assumes each color channel averages to gray) (Kanan and Cottrell, 2012), histogram matching (Abramov et al., 2020), and color equalization (Mustafa and Kader, 2018).We also compare the GAN-based synthetic image generation techniques to CycleGAN and CyCADA.We chose CycleGAN and CyCADA because our experiments simulate the detection of a rare climate object with minimal access to labeled data from the source domain and no labeled data from the target domain.

Dataset creation
To train our object detection models and test our domain adaptation technique, we created a dataset of overhead images containing wind turbines.Wind turbines were selected as an example of climate  infrastructure for three reasons.First, they are relatively homogeneous in appearance which helps to minimize intra-class variance and aid object detection.Second, they are found in a diverse variety of geographies and contexts (mountains, fields, etc.).Finally, they are relatively rare in occurrence, as the official 2021 U.S. Wind Turbine Database (Hoen et al., 2018) contains only 68,714 turbines across the entire United States.
We collected geographically diverse images from the Northwest, Southwest, and Eastern Midwest regions of the United States.We selected these regions, shown in Figure 2, with the intention of creating visually distinct domains.Since visual distinctiveness does not guarantee the presence of a domain gap, we also verified the presence of a domain gap experimentally (see Figure 6).
In each domain, we collected a set of training and testing images containing wind turbines using coordinates from the U.S. Wind Turbine Database.These covered the three geographies mentioned earlier (Northwest, Southwest, and Eastern Midwest regions of the United States).Overhead images over the selected coordinates were collected from the National Agriculture Imagery Program dataset (NAIP, 1 meter per pixel resolution) using Google Earth Engine (Gorelick et al., 2017).
To ensure that the training and testing sets had similar levels of variation in each domain, we used stratified geographic sampling.In this way, the coordinates of the labeled wind turbines from each region were first clustered using DBSCAN (Ester et al., 1996), and subsequently, the training and testing coordinates were selected using stratified random sampling to ensure representative sampling within each region.This avoided, for example, having a Northeast domain with training data only from Massachusetts and testing data only from New York.
Since we were using the coordinates of wind turbines for capturing overhead images, we wanted to ensure that the turbine was not in the center of every image in the dataset.To avoid this issue, we shifted the coordinate center of each image uniformly randomly up to 75 meters both horizontally and vertically.The dimensions of each of the final images are 608 × 608 pixels.
Finally, each image was quality-checked manually to ensure that wind turbines were present in the image and that the training and test datasets had no overlap.
We also collected a supplementary set of images without turbines for each domain which we call "background" images.Our synthetic image generation technique uses these unlabeled images as a canvas on which to blend the object of interest.We captured images at a distance between four and six kilometers away from a known turbine location within the same domain, which were manually inspected to ensure that no wind turbines were present.The close distance was chosen to ensure visual similarity to the domain.The total distribution of images across domains and the number of labeled wind turbines contained within those images used for training, image generation, and validation can be referenced in Table E1 in Appendix E.

Generation of synthetic images
Our image generation process aims to produce synthetic images that are as similar as possible to real labeled data from any potential target domain.We use a pre-trained GP-GAN image blending model from Wu et al. (2019) to blend background target domain images together with source domain target objects.The GP-GAN model was pre-trained on webcam images depicting diverse seasonal, weather, and lighting settings from the Transient Attributes dataset by Laffont et al. (2014).Our process consists of four steps, as shown in Figure 3: 1. Sample a random background image from the target domain.2. Sample objects from a set of source labeled domain objects (in our case wind turbines) from any domain.3. Randomize the location, orientation, and size of the objects.4. Blend the randomized objects into the target domain background image using GP-GAN.
One benefit of our technique is its ability to generate synthetic images while using only a few examples of the object.Namely, a small number of target object examples can be blended into as many contexts as desired.Second, the synthetic images can be stylized to look similar to any target domain given only Environmental Data Science e39-5 unlabeled images from that domain.Acquiring unlabeled datasets of overhead imagery is an easier task than manually labeling a dataset from the target domain.Although we manually inspected the background images to ensure that no turbines were present, for rare objects, it may be reasonable to assume that randomly captured snapshots from the target domain do not include the objects.In such cases, images from the target domain could be directly used as backgrounds without manual inspection.A final advantage is our technique's customizability along important dimensions, such as the number of object instances blended in.This stands in contrast to fixed color-mapping techniques, which simply transform an image in place and cannot control the number of object instances.Other controllable hyper-parameters include the spacing, size, and scale of the objects.Overall, our technique has unique advantages that could make it beneficial for domain adaptation contexts where the object is rare.
In our experiments, we used the background images from each domain as our GP-GAN canvases.We blended turbine examples from the source domain into target domain backgrounds.Our turbine examples included the shadows, as we thought this could give the object detection algorithm important contextual information.We customized the size and placement of the synthetically added turbines such that the wind turbines never overlapped one another, the size of the examples remained unchanged, and each synthetic image contained three turbines.Three turbines were chosen because we wanted the synthetic images to include ample examples to learn from and, for each source domain, three was the 90th percentile of the number of turbines in our training images.Some of the generated synthetic images contain artifacts including artificially bright spots and blurred blending borders or turbines, as can be seen in Figure 4.While not all of the data are visually perfect, this work evaluates whether or not these synthetic images are effective in overcoming performance differences between domains from an object detection perspective.Additional refinements to the image blending process would likely further enhance the performance improvements achieved through this work.

Experimental setup
Our experiments trained object detection models to detect wind turbines across a variety of geographic domains and to compare the performance of an object detection model trained using synthesized data versus other image transformation and augmentation approaches.Many types of infrastructure are quite rare, so this work focused on investigating applications where the objects were also rare.These could either be objects that are rare in a global sense, if the object is unlikely to be found in a randomly sampled point on Earth (a stricter requirement), or rare regionally within one domain (a likely situation to encounter when the availability of regional, high-resolution satellite imagery is expensive to collect).
While climate-and energy-relevant infrastructure and resources are often rare in the context of imagery for any given region, we made the assumption that we would be able to acquire more images from some domain, even if that is typically not the target domain, and selected that number to be 100, which is on the high end of past studies.For example, in Martinson et al. (2021) 10-50 images are used for training in the context of "rare" objects and Wang et al. (2019) used between 10 and 30 images in their investigations.Many few-shot techniques use as few as 1-10 instances such as in Wang et al. (2020) while the highest that we have seen was from Xu et al. (2022), which varied from 0 to 151.
To simulate detecting a rare object, we assumed access to a small corpus of labeled wind turbine examples in the training domain and no additional wind turbine labels from the target domain.Each experiment tested all of the possible source/target domain pairings across the three domains: Northwest, Eastern Midwest, and Southwest, for nine pairings in total.1For each pair of domains, we trained and tested a YOLOv3 model five separate times to account for variation in training and to estimate model variance.
For an experimental baseline, we evaluated model performance using the training data for each pair of domains without using any domain adaptation methods.The baseline experiments were trained on 100 labeled source domain images and tested on 100 labeled target domain images across all domain pairs.Then we compared our approach and other domain adaptation techniques involving image transformation or augmentation to the baseline experiments.We evaluated whether the addition of these different types of supplementary imagery (including our synthetic images) could improve wind turbine detection performance across domains.These additional experiments were trained on 100 labeled source domain images, supplemented with an additional 100 images from the domain adaptation technique under evaluation, and were tested on 100 labeled target domain images.The experimental configurations are

-Run Experiments
Figure 5.The experimental setup.First, a pairing of a source and a target domain is selected.For each pairing, a baseline experiment is run as a benchmark by using labeled source domain images and labeled test images from the target domain.Then, a series of domain adaptation experiments are run that augment the baseline training set with additional supplemental images produced from a domain adaptation method.
Environmental Data Science e39-7 shown in Figure 5.More information about the number of images and turbine labels used from each domain can be found in Appendix E. We differentiated between within-domain experiment trials where the detection model was trained and tested on imagery from the same geographic domain, versus cross-domain experiment trials in which the model was trained on one domain and tested on a different domain.As an example, in an experiment, a within-domain trial may be trained on images of wind turbines from the Northwest U.S. and validated on other images from the Northwest.In contrast, a cross-domain trial may be trained on images from the Northwest and tested on images from the Southwest.We expected object detection performance to suffer in cross-domain contexts due to the presence of a domain gap.
For our wind turbine object detection model, we used a YOLOv3 model architecture with spatial pyramid pooling (Redmon and Farhadi, 2018).For each experimental trial, a YOLOv3 model was trained from scratch on the available training images.
Lastly, an important note on the experimental design that we want to highlight is that we used a fixed mixed batch ratio of real-to-synthetic data for all the experiments for this work with a mini-batch size of eight images.This allowed us to control exactly how many train and supplemental images were in each mini-batch and thus the relative influences of each image set.In a given mini-batch, seven of the images were from the baseline set while one image was from the supplementary set (the images generated through the domain adaptation method).One of the challenges we observed with this method is that synthetic data was not a perfect replacement for real data: if we used all synthetic data rather than real data, the crossdomain performance dropped substantially as shown in Figure B1 in Appendix B. We selected a real-tosynthetic mixed batch ratio of 7-to-1 a priori to ensure the presence of synthetic data in every minibatch.We varied the number of synthetic images in each mixed batch for GP-GAN training and, as seen in Figure B1 in Appendix B, we found that between 1 and 7 were effective ratios, while 0 or 8 (no synthetic, or all synthetic) performed markedly worse.
In the mixed batch setup, we fixed a number of images seen as an epoch.Since our typical experimental setup consisted of 100 training and 100 supplemental images, we defined an epoch as having passed 200 images or 25 mini-batches through the training process.Each model was trained over 300 epochs.
We evaluated model performance using average precision (AP).We also reported 95% confidence intervals (CI) constructed with t-distributions alongside AP results.Where applicable these intervals may represent the average of multiple confidence intervals-see Appendix C for further details on variability reporting.

Experimental results
We experimentally demonstrated the presence of a domain gap through the results of the baseline experiment which can be seen in Figure 6.On average, the cross-domain AP performance of our object detection model was 12% worse than the within-domain AP performance for each test domain.A full comparison of results between the baseline experiment and our synthetic images can be found in Table 1.Overall, baseline within-domain trials achieved an AP of 0.901 while baseline cross-domain trials achieved an average AP of 0.791.For each domain pair, the addition of synthetic images improved AP.This was especially true in a cross-domain context: on average, synthetic cross-domain trials achieved a 0.065 higher AP than baseline cross-domain trials, while synthetic within-domain trials achieved a 0.020 higher AP than baseline within-domain trials.Detecting turbines in the Southwest was especially challenging compared to other domains-we observed many small and clustered wind turbines in this domain and hypothesize that this could have contributed to weaker object detection performance.
We also compared our synthetic image blending technique to other domain adaptation techniques including histogram equalization, histogram matching, gray world, CyCADA, and CycleGAN. 2 In addition to comparing each of the above techniques to each other and the baseline, we incorporated two further variations of the baseline experiment.First, to estimate an upper bound on the performance we might be able to achieve when training with 100 supplemental images, we supplemented the baseline training dataset with 100 additional real images from the target domain.Second, since we added 100 training samples of augmented or synthetic imagery in every experiment with a domain adaptation technique, we wanted to test if adding unlabeled imagery improved performance .If this were true then the domain adaptation techniques would not be enhancing performance.To investigate this, we supplemented the baseline experiment with 100 target domain background images (unlabeled target domain images without wind turbines present; these were the same images used for image blending in our synthetic case, but without having blended target objects).
The results of all these experiments are shown in Figure 7, with the estimated upper and lower-bound experiments shown in light gray.These experiments collectively tested whether adding each set of supplemental images improved cross-domain model performance when faced with highly limited training data availability (as is common with rare object detection, as is often the case with energy and climate applications).The results indicate that adding synthetically blended imagery is able to produce the greatest improvement in average precision of the techniques compared in this study.On average, our Each experiment was run five times and the mean trial result along with 95% confidence interval widths are shown.Bold results indicate the higher result when comparing between the baseline and synthetic experiments.
2 CycleGAN and CyCADA were trained from scratch for each domain mapping (e.g., from the Southwest to the Northwest domain).The models were trained using 100 background and 100 real images from the source domain and 100 background images from the target domain.CycleGAN and CyCADA learn a bidirectional mapping between a pair of domains.Thus, neither CycleGAN nor CyCADA are generic in their applicability more broadly than the specific domain pairings.Since these methods require the data for each new application, and such data would not generally be available in the context of rare objects, we limited the training data to be the same data as all of the rest of the algorithms were received to ensure a fair comparison.
Environmental Data Science e39-9 synthetic image blending technique outperformed baseline trials by 8.2% in cross-domain pairings and 2.2% in within-domain pairings.A full table of results including the within-domain pairings can be found in Appendix A.3 There were many design choices that need to be made about these experiments including the number of synthetic images to include in the analysis, the number of objects per synthetic image, and the total number of labeled objects from the source domain available to the synthetic image blending tool.We investigated the sensitivity of the results to the parameters by varying each of them for one pair of domains: the Northwest and the Southwest.Those experimental results can be found in Appendix D. We found that as long as some amount of synthetic objects were included in the training process, we observed a performance improvement, but the results were not exceptionally sensitive to any of these parameters.

Conclusions
The automated mapping of climate-relevant infrastructure from overhead imagery can help to fill critical information gaps, but global applications have been impeded by the challenges of geographic domain adaptation and a lack of training data.To combat this, we proposed a computationally inexpensive synthetic imagery generation technique designed to work with minimal labeled examples and help downstream models become more generalizable.Our approach uses GP-GAN to blend real images of an object onto unlabeled background images from the target domain.Our experiments provide evidence that supplementing training data with synthetically blended imagery can improve domain adaptation while requiring minimal time, no proprietary software, and few labeled object examples.This may aid in scaling up automated mapping to larger applications.
Future work could further refine our synthetic image generation methodology by adjusting the parameters of the image blending process and/or applying post-processing to generate even more realistic and well-blended synthetic images.Additionally, this approach could be applied to more energy and 7. Experimental results displaying the averaged 95% confidence intervals among cross-domain pairs.See Appendix C for more on how the intervals constructed.
climate objects for further evaluation.Lastly, this approach could be tested beyond one-to-one domain pairings to evaluate the technique in one-to-many, many-to-one, or many-to-many contexts.
While many other GAN-based techniques require labeled examples from both domains (a supervised context), CycleGAN and CyCADA do not have this requirement.Example images generated using each technique we compare to can be found in Figure 1.We contribute to the field of domain adaptation in remotely sensed data by proposing a new technique to generate synthetic overhead images that require minimal labeled examples of the target object.Our technique takes advantage of unlabeled overhead images to reduce the number of labeled examples required.Unlabeled remote sensed images are often easily acquirable through public datasets such as the National Agriculture Imagery Program (NAIP).Using the detection of wind turbines as a case study, we run a series of experiments and benchmarks demonstrating the benefit of our technique in augmenting training datasets.Our experiments show the potential of our technique to improve downstream model performance as compared to other domain adaptation techniques in this space-especially in situations with limited training data or when applying a model to new geographies.

Figure 1 .
Figure 1.Example images from selected domain adaptation methods.For each domain mapping, the original image is shown in the first column, while each image to the right shows that same image transformed by the technique to look like it came from the target domain.A detailed mapping of the domains can be found in Figure 2.

Figure 2 .
Figure 2. Sample images and corresponding locations from our chosen geographic domains: the Northwest, Southwest, and Eastern Midwest United States.

Figure 3 .
Figure 3. Diagram depicting our process to generate GP-GAN synthetic images.In brief, sampling target objects, randomizing their locations, selecting a background image, and blending via GP-GAN.

Figure 4 .
Figure 4. Example GP-GAN blended synthetic images.The upper original images without turbines have been transformed into the lower GP-GAN synthetic images by blending in turbine examples.

Figure 6 .
Figure 6.Baseline experimental results demonstrating evidence of a domain gap.In this plot, all domain pairings with the same test domain are grouped together and divided into the within-and cross-domain settings.The gap in performance between within-domain and cross-domain settings is shown in this figure as the distance between the red and blue points.

Table 1 .
Synthetic experiment results compared to the baseline experiment using average precision