Hostname: page-component-6766d58669-r8qmj Total loading time: 0 Render date: 2026-05-21T15:35:18.349Z Has data issue: false hasContentIssue false

MoTiF: a self-supervised model for multi-source forecasting with application to tropical cyclones

Published online by Cambridge University Press:  23 July 2025

Clément Dauvilliers*
Affiliation:
ARCHES, INRIA, Paris, Île-de-France, France
Claire Monteleoni
Affiliation:
ARCHES, INRIA, Paris, Île-de-France, France Department of Computer Science, University of Colorado, Boulder, CO, USA
*
Corresponding author: Clément Dauvilliers; Email: clement.dauvilliers@inria.fr

Abstract

We present a deep learning architecture that reconstructs a source of data at given spatio-temporal coordinates using other sources. The model can be applied to multiple sources in a broad sense: the number of sources may vary between samples, the sources can differ in dimensionality and sizes, and cover distinct geographical areas at irregular time intervals. The network takes as input a set of sources that each include values (e.g., the pixels for two-dimensional sources), spatio-temporal coordinates, and source characteristics. The model is based on the Vision Transformer, but separately embeds the values and coordinates and uses the embedded coordinates as relative positional embedding in the computation of the attention. To limit the cost of computing the attention between many sources, we employ a multi-source factorized attention mechanism, introducing an anchor-points-based cross-source attention block. We name the architecture MoTiF (multi-source transformer via factorized attention). We present a self-supervised setting to train the network, in which one source chosen randomly is masked and the model is tasked to reconstruct it from the other sources. We test this self-supervised task on tropical cyclone (TC) remote-sensing images, ERA5 states, and best-track data. We show that the model is able to perform TC ERA5 fields and wind intensity forecasting from multiple sources, and that using more sources leads to an improvement in forecasting accuracy.

Information

Type
Methods Paper
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2025. Published by Cambridge University Press
Figure 0

Figure 1. Overall diagram of the architecture. (a) Overall view of the masking-and-reconstruction pipeline. In this case, Source 1 is chosen to be masked. (b) Multi-source anchored cross-attention mechanism (MSCA) is shown for an example with two 2D sources. In this case, each embedded vector is a patch from the original image, as usual in ViT-based architectures. The embedded vectors that are not anchor points remain unchanged through this layer. (c) Diagram of the backbone. IWSA, individual windowed self-attention. The MSCA block lets information travel across the sources, while the IWSA block lets information travel within each source. The embedded coordinates serve as positional encoding for the attention layers.

Figure 1

Figure 2. Two examples (the first two rows and the last two rows, respectively) of reconstructions from the same storm, but with different available sources and targets. The model is able to correctly identify the area in the other sources that is closest in space and time to its target.

Figure 2

Table 1. RMSE values for different channels and sets of input sources. Lower is better. SLP, sea-level pressure

Figure 3

Figure 3. Example of ERA5 surface fields reconstruction from a previous ERA5 state and PMW images. Left of the vertical bar: available sources, given as input to the model. Right of the vertical bar: targets (left) and predictions (right). Each row corresponds to a variable. The distance-to-center variable is displayed for every source, but is actually not given as input to the model, being only required as output. The sea-surface temperature (SST) field is only used as input. The annotations $ {\delta}_t= Ddhh: mm $ indicate the time delta between the reference time $ t0 $ and the time of the observation, in the format days–hours–minutes. The forecast time is at $ {t}_0+1 $ day; thus, the predictions are at $ {\delta}_t=-1 $day.

Figure 4

Figure 4. Comparison of error in forecast storm location (km). Text annotations indicate the median values.

Figure 5

Table 2. Overall RMSE and MAE for intensity forecasting for different input sources (kt). Lower is better

Figure 6

Figure 5. Comparison of mean absolute error (MAE) for different sets of input sources, as well as the non-fine-tuned model. Lower is better. Text annotations indicate the median values.

Figure 7

Figure 6. Distribution of the mean absolute error at the intensity forecasting task, for different subsets of input sources. Lower is better.

Figure 8

Figure 7. Example of reconstruction of a passive microwave image. Left of the vertical bar, top row (brightness temperature): available sources, fed to the model as input. Right, top row: target and prediction. Bottom row: “dist_to_center” channel, which is the distance between each pixel and the storm’s center according to best-track data. This channel is not given as input, but is required to be predicted by the model to judge its ability to locate the cyclone. The annotations $ {\delta}_t= Ddhh: mm $ indicate the time delta between the reference time $ {t}_0 $ and the time of the observation, in the format days–hours–minutes. In this example, the closest source (41 min apart, left) in time does not capture the tail of the cyclone, while another image, further away chronologically (8 h 53 min apart), does show it. The model is able to merge the sources, in the sense that it fetches information from the closer image for the center and from the other one for the cyclone’s tail.

Figure 9

Figure 8. Example of reconstruction using images from six different satellites as input to reconstruct a seventh. The multi-source cross-attention lets the model process many different sources at a reasonable cost. In particular, examples such as this one can be processed in the same batch as examples with only one or two sources, both during training and inference. While cases with five or more available satellites are a minority of the training data, the model does not collapse, and shows an ability to select information from the sources that are closest in time and space. Every image has a specific aspect ratio and size.

Figure 10

Figure 9. Example of temporal interpolation of a passive microwave image where the image to reconstruct is in between the two other available sources chronologically. While the only the low frequencies are reconstructed, the distance-to-center channel shows that the location of the storm’s core was well interpolated.

Figure 11

Figure 10. Example of reconstruction where the image to reconstruct is anterior to the other available images.

Supplementary material: File

Dauvilliers and Monteleoni supplementary material

Dauvilliers and Monteleoni supplementary material
Download Dauvilliers and Monteleoni supplementary material(File)
File 10.9 MB

Author comment: MoTiF: a self-supervised model for multi-source forecasting with application to tropical cyclones — R0/PR1

Comments

Dear Editors,

I am pleased to submit the methods manuscript “MoTiF: A self-supervised model for multi-source forecasting with application to tropical cyclones” for publication in Environmental Data Science. This paper is at the intersection of geospatial sciences and machine learning (ML), and tackles the problem of combining multiple sources of data into a single ML model. The MoTiF architecture proposed in this work considers multiple sources in a sense more general than previous works: the sources can be of different natures, have different dimensionalities, and cover distinct (even disjoint) areas in space and time. Moreover, the architecture can use a flexible number of sources, meaning it can be used within knowing à priori which sources are going to be available within a sample.

The contributions are the following: (a) introducing a very general ML architecture for combining multiple sources of data in geosciences; (b) introducing a new type of attention, based on anchor-points, that allows processing many sources at a reasonable cost; (c) applying the architecture to the case of tropical cyclones, to show that combining multiple sources helps in tasks such as forecasting the location of the storm or its intensity. As the manuscript proposes methods in ML for the specific case of geospatial data, EDS seems like a natural fit.

This work was accepted as part of Climate Informatics 2025, and has not been submitted for publication in any other venue. The paper was peer-reviewed via the conference’s reviewing system. Changes have been applied to the manuscript that was accepted in CI, but for clarity purposes: the Related work section that was in the supplementary material was moved back into the main article, some mathematical descriptions were made less verbose, and a diagram of the architecture was added. The method and results have remained unchanged.

The authors are the following: Clément Dauvilliers, PhD student at Inria (Paris, France), Claire Monteleoni (Professor at CU Boulder and Project-team leader at Inria). No competing interests need to be declared. The data and code are fully public and available online (for the dataset, at least at the time of writing…).

We thank you for considering our manuscript for publication in Environmental Data Science. We welcome any feedback from you or the reviewers, should there be an additional round of reviews.

Sincerely,

Clément Dauvilliers (on behalf of both authors)

Inria Paris

clement.dauvilliers@inria.fr

Review: MoTiF: a self-supervised model for multi-source forecasting with application to tropical cyclones — R0/PR2

Conflict of interest statement

Reviewer declares none.

Comments

1. Summary: In this section please explain in your own words what problem the paper addresses and what it contributes to solving it.

The paper addresses the problem of integrating multiple sources of data from different types to improve weather forecasting AI models. The paper presents a transformer-based architecture designed to merge multiple types of geospatial information from different sources, which is trained through self-supervised learning (by masking different sources at different time steps randomly). They apply this architecture to tropical cyclones remote sensing images and show that including more sources leads to an improvement in tropical cyclones wind intensity forecasting.

2. Please select a score of relevance to climate informatics which promotes the interdisciplinary research between climate science, data science, and computer science.

Highly relevant

3. Relevance and Impact: Is this paper a significant contribution to interdisciplinary climate informatics?

The general problem of incorporating multiple sources of data into weather forecasting models is very relevant to climate informatics. The paper proposes an architecture, that while heavily based on previous work, is the first that I have seen to allow sources that are misaligned in space and at irregular time intervals, and can work with a flexible set and number of sources. The only issue I see is the lack of comparisons with other models as benchmarks.

4. Overall recommendation of the submission.

Accept: Good paper to be accepted as it.

5. Detailed Comments

The paper tackles an important problem, properly positions the work with respect to the existing literature and makes very clear what are the contributions. The proposed architecture and self-supervised learning algorithms are explained in enough details to be reproduced if needed (although they not make the code available). They chose one specific application to evaluate the paper on (tropical cyclone wind intensity forecasting) and describe well the experimental setting and the data sets used (which are publicly available). They do an evaluation of the effect of adding more sources of data, showing that this improve their results. However, they do not provide any external baseline models (ones that use a single source or with aligned sources) to compare, leaving it open whether the model actually improves on the state-of-the-art for this task.

Furthermore, for a better assessment of the general applicability of this architecture beyond tropical cyclones, it would have been nice to see other application, but this goes beyond the scope of this paper.

6. Reviewer’s confidence

The reviewer has general research experience in the relevant field and is fairly confident for the evaluation

Review: MoTiF: a self-supervised model for multi-source forecasting with application to tropical cyclones — R0/PR3

Conflict of interest statement

Reviewer declares none.

Comments

1. Summary: In this section please explain in your own words what problem the paper addresses and what it contributes to solving it.

The work described a self-supervised learning method to reconstruct data at a given coordinate using data from other sources. This method can be used for data interpolation and prediction depending on how the task is designed. The main goal of this work is to address the limitations of the current deep-learning numerical weather prediction that is constrained by its training data at a fixed resolution (ERA5) that may not provide accurate forecasts for extreme events.

2. Please select a score of relevance to climate informatics which promotes the interdisciplinary research between climate science, data science, and computer science.

Highly relevant

3. Relevance and Impact: Is this paper a significant contribution to interdisciplinary climate informatics?

The presented work highlights a novel approach of self-supervised learning to improve AI for numerical weather prediction for extreme events. This is a highly relevant work for the CI community.

4. Overall recommendation of the submission.

Minor Revision: Borderline, require minor changes.

5. Detailed Comments

The method presented in the paper is well-described and the results are promising. I have a few comments for the authors to consider.

-. The introduction describes how this method can directly leverage data from different sources with different resolutions. However, the paper further describes the experiment by resampling the data to 0.15 degrees. This experiment does not appear to show the flexibility of the method for leveraging data with different resolutions if the resampling is needed.

-. In the encoding, it is possible to further include the resolution information in the coordinates? If only latitude and longitude are used as spatial coordinates, that often just accounts for the centroid of an area/grid for the 2D data. Would including the spatial resolution from different sources either through using the lat/lon of multiple corners of a spatial grid or including spatial resolution in the encoding of the coordinates?

-. TC PRIMED data included data sources from different satellites. However, it is organized by individual cyclones and there aren’t 11 sensors on 11 different satellites for all storms. As shown in Figure 1, there are only data from a couple of satellites for individual storms. Please clarify it in the revision process.

-. The description of the wind intensity forecasting is a bit unclear to me. How is the forecast task carried out? What are the sources of the data used to complete the forecast at t0+24h?

-. If there are differences of the value from different sources, i.e., if multiple reanalysis data are used here, how are the systematic differences between these data sources addressed by the method?

-. Some minor clarifications and corrections are listed below:

- Line 37 of Page 2: remove a duplicate “its”

- Line 15 of page 3: An element ... is composed of four elements. Are these two uses of “element” the same in this context? If not, please replace the second “element” with another word.

- Line 39 of page 5: change “normaization” to “normalization”

- Typically, the conclusion comes after the discussion of the limitations and the proposed future work.

6. Reviewer’s confidence

The reviewer has general research experience in the relevant field and is fairly confident for the evaluation

Recommendation: MoTiF: a self-supervised model for multi-source forecasting with application to tropical cyclones — R0/PR4

Comments

This article was accepted into the Climate Informatics 2025 Conference after the authors addressed the comments in the reviews provided. It has been accepted for publication in Environmental Data Science on the strength of the Climate Informatics Review Process.

Decision: MoTiF: a self-supervised model for multi-source forecasting with application to tropical cyclones — R0/PR5

Comments

No accompanying comment.