Hostname: page-component-89b8bd64d-46n74 Total loading time: 0 Render date: 2026-05-13T09:12:00.745Z Has data issue: false hasContentIssue false

Extending scene-to-patch models: Multi-resolution multiple instance learning for Earth observation

Published online by Cambridge University Press:  04 December 2023

Joseph Early*
Affiliation:
Agents, Interaction, and Complexity Group, University of Southampton, Southampton, United Kingdom The Alan Turing Institute, London, United Kingdom
Ying-Jung Chen Deweese
Affiliation:
School of Computer Science, Georgia Institute of Technology, Atlanta, Georgia, USA Descartes Labs, Santa Fe, New Mexico, USA
Christine Evers
Affiliation:
Agents, Interaction, and Complexity Group, University of Southampton, Southampton, United Kingdom
Sarvapali Ramchurn
Affiliation:
Agents, Interaction, and Complexity Group, University of Southampton, Southampton, United Kingdom The Alan Turing Institute, London, United Kingdom
*
Corresponding author: Joseph Early; Email: joseph.early.ai@gmail.com

Abstract

Land cover classification (LCC) and natural disaster response (NDR) are important issues in climate change mitigation and adaptation. Existing approaches that use machine learning with Earth observation (EO) imaging data for LCC and NDR often rely on fully annotated and segmented datasets. Creating these datasets requires a large amount of effort, and a lack of suitable datasets has become an obstacle in scaling the use of machine learning for EO. In this study, we extend our prior work on Scene-to-Patch models: an alternative machine learning approach for EO that utilizes Multiple Instance Learning (MIL). As our approach only requires high-level scene labels, it enables much faster development of new datasets while still providing segmentation through patch-level predictions, ultimately increasing the accessibility of using machine learning for EO. We propose new multi-resolution MIL architectures that outperform single-resolution MIL models and non-MIL baselines on the DeepGlobe LCC and FloodNet NDR datasets. In addition, we conduct a thorough analysis of model performance and interpretability.

Information

Type
Methods Paper
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2023. Published by Cambridge University Press
Figure 0

Figure 1. MIL scene-to-patch overview. The model produces both instance (patch) and bag (scene) predictions but only learns from scene-level labels. Example from the DeepGlobe dataset (Section 4.1).

Figure 1

Figure 2. S2P single-resolution model architecture. Note that some fully connected (FC) layers use ReLU and Dropout (denoted with *) while some do not, and $ b $ denotes bag size (number of patches).

Figure 2

Figure 3. S2P multi-resolution architecture. The embedding process uses independent feature extraction modules (CNN layers, see Figure 2), allowing specialized feature extraction for each resolution. The MRMO configuration produces predictions at $ s=0 $, $ s=1 $, $ s=2 $, and $ s=m $ resolutions (indicated by the dashed box); MRSO only produces $ s=m $ predictions.

Figure 3

Figure 4. Multi-resolution patch extraction and concatenation. Left: Patches are extracted at different resolutions, where each patch at scale $ s=0 $ has four corresponding $ s=1 $ patches and 16 corresponding $ s=2 $ patches. Right: The $ s=0 $ and $ s=1 $ embeddings are repeated to match the number of $ s=2 $ embeddings, and then concatenated to create multi-resolution embeddings.

Figure 4

Table 1. DeepGlobe results

Figure 5

Table 2. FloodNet results

Figure 6

Figure 5. S2P resolution comparison. We compare model performance for scene RMSE (left), scene MAE (middle), and pixel mIoU (right) for both DeepGlobe (top) and FloodNet (bottom).

Figure 7

Figure 6. DeepGlobe model analysis. Top: Row normalized confusion matrices. Bottom: Classwise precision and recall, where the dotted line indicates the macro-averaged performance (which is also denoted in the top right corner of each plot).

Figure 8

Figure 7. FloodNet model analysis. Top: Row normalized confusion matrices. Bottom: Classwise precision and recall, where the dotted line indicates the macro-averaged performance (which is also denoted in the top left corner of each plot).

Figure 9

Figure 8. MRMO interpretability example. Top (from left to right): the original dataset image; the true pixel-level mask; the true patch-level mask at resolution $ s=2 $; the predicted mask from the MRMO model overlaid on the original image. Middle: Predicted masks for each of the MRMO outputs. Bottom: MRMO $ s=m $ predicted masks for each class, showing supporting (+ve) and refuting regions (−ve).

Figure 10

Figure 9. Single-resolution S2P ablation study. We observe that, on average, the large configuration achieves the best performance (lowest RMSE, lowest MAE, and highest mIoU) on both datasets. Error bars are given representing the standard error of the mean from five repeats.

Supplementary material: File

Early et al. supplementary material
Download undefined(File)
File 322.4 KB