Hostname: page-component-77f85d65b8-pkds5 Total loading time: 0 Render date: 2026-03-29T20:34:22.780Z Has data issue: false hasContentIssue false

Graph neural networks for hourly precipitation projections at the convection permitting scale with a novel hybrid imperfect framework

Published online by Cambridge University Press:  13 October 2025

Valentina Blasone*
Affiliation:
Department of Mathematics, Informatics and Geosciences, University of Trieste , Trieste, Italy
Erika Coppola
Affiliation:
Earth System Physics, Abdus Salam International Centre for Theoretical Physics (ICTP) , Trieste, Italy
Guido Sanguinetti
Affiliation:
Theoretical and Scientific Data Science, Scuola Internazionale Superiore di Studi Avanzati (SISSA) , Trieste, Italy
Viplove Arora
Affiliation:
Theoretical and Scientific Data Science, Scuola Internazionale Superiore di Studi Avanzati (SISSA) , Trieste, Italy
Serafina Di Gioia
Affiliation:
Earth System Physics, Abdus Salam International Centre for Theoretical Physics (ICTP) , Trieste, Italy
Luca Bortolussi
Affiliation:
Department of Mathematics, Informatics and Geosciences, University of Trieste , Trieste, Italy
*
Corresponding author: Valentina Blasone; Email: valentina.blasone@phd.units.it

Abstract

Extreme precipitation events are projected to increase both in frequency and intensity due to climate change. High-resolution climate projections are essential to effectively model the convective phenomena responsible for severe precipitation and to plan any adaptation and mitigation action. Existing numerical methods struggle with either insufficient accuracy in capturing the evolution of convective dynamical systems, due to the low resolution, or are limited by the excessive computational demands required to achieve kilometre-scale resolution. To fill this gap, we propose a novel deep learning regional climate model (RCM) emulator called graph neural networks for climate downscaling (GNN4CD) to estimate high-resolution precipitation. The emulator is innovative in architecture and training strategy, using graph neural networks (GNNs) to learn the downscaling function through a novel hybrid imperfect framework. GNN4CD is initially trained to perform reanalysis to observation downscaling and then used for RCM emulation during the inference phase. The emulator is able to estimate precipitation at very high resolution both in space ($ 3 $km) and time ($ 1 $h), starting from lower-resolution atmospheric data ($ \sim 25 $km). Leveraging the flexibility of GNNs, we tested its spatial transferability in regions unseen during training. The model trained on northern Italy effectively reproduces the precipitation distribution, seasonal diurnal cycles, and spatial patterns of extreme percentiles across all of Italy. When used as an RCM emulator for the historical, mid-century, and end-of-century time slices, GNN4CD shows the remarkable ability to capture the shifts in precipitation distribution, especially in the tail, where changes are most pronounced.

Information

Type
Application Paper
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Open Practices
Open materials
Copyright
© The Author(s), 2025. Published by Cambridge University Press
Figure 0

Figure 1. The hybrid imperfect framework, applied to the GNN4CD emulator. Scheme of (a) training: reanalysis to observation downscaling, (b) inference: reanalysis to observation downscaling, (c) inference: RCM emulation.

Figure 1

Table 1. Variables used as predictors (P) and target (T), each reported with its symbol, unit, pressure levels, space and time resolutions

Figure 2

Figure 2. Graph conceptualisation: Low nodes (blue dots) and High nodes (red dots) and close-up of (a) Low-to-High unidirectional edges (orange), connecting Low nodes to High nodes (b) High-within-High bidirectional edges (red), linking High nodes.

Figure 3

Figure 3. Schematic views of (a) RC, designed as a combination of Regressor and Classifier components, (b) R-all consisting of a single Regressor, (c) architecture, composed of four modules: a RNN-based pre-processor, a GNN-based downscaler, a GNN-based processor, and a FCNN-based predictor.

Figure 4

Figure 4. (a) training (northern Italy) and inference (entire Italy) areas, (b) locations of original stations used to create the GRIPHO dataset, and (c) percentage of valid time steps for each station.

Figure 5

Figure 5. Results in the reanalysis to observation downscaling setting and comparison with GRIPHO observations for the testing year $ 2016 $ for the PDF of hourly precipitation [mm/h] with bin size of $ 0.5 $ mm for (a) Italy (I) (b) northern Italy (N) and central-south Italy (C-S); the insets provide a magnified view of the tail of the distribution; (c) average [mm/h], (d) frequency [%] and (e) intensity [mm/h] seasonal diurnal cycles for Italy (I).

Figure 6

Table 2. Extreme percentiles computed for GRIPHO and the GNN4CD RC and R-all model designs for Italy (I), northern Italy (N), and central-south Italy (C-S)

Figure 7

Figure 6. Results in the reanalysis to observation downscaling setting and comparison with GRIPHO observations for the testing year $ 2016 $ for (a) average precipitation [mm/h] and percentage bias [%], (b) p$ 99 $ [mm/h] and percentage bias [%], (c) p$ 99.9 $ [mm/h] and percentage bias [%].

Figure 8

Table 3. Spatial correlation between the reference GRIPHO maps and the GNN4CD RC and R-all estimated maps for precipitation average, p$ 99 $ and p$ 99.9 $; results are shown for Italy (I), northern Italy (N), and central-south Italy (C-S)

Figure 9

Figure 7. Total precipitation [mm] for $ 10 $ flood events in Italy. Events $ 1 $, $ 4 $, $ 5 $, $ 6 $, $ 8 $, $ 10 $ took place in northern Italy (N), events $ 2 $, $ 7 $ in central Italy (C), events $ 3 $, $ 9 $ in southern Italy (S).

Figure 10

Figure 8. PDF of hourly precipitation [mm/h] with bin size of $ 0.5 $ mm for Italy (I); comparison of GRIPHO $ 10 $-years (grey) and RegCM historical (black) with (a) historical GNN4CD RC (blue), (b) mid-century RegCM (pink) and GNN4CD RC (orange), (c) end-of-century RegCM (magenta) and GNN4CD RC (dark orange), (d) historical GNN4CD R-all (blue), (e) mid-century RegCM (pink) and GNN4CD R-all (orange), (f) end-of-century RegCM (magenta) and GNN4CD R-all (dark orange); the insets provide a magnified view of the tail of the distribution.

Figure 11

Figure 9. Maps for GNN4CD RC, GNN4CD R-all, and RegCM showing (a) historical average hourly precipitation [mm/h] and (b) end-of-century average change [%]; (c)-(d) the same for p$ 99 $ and (e)-(f) the same for p$ 99.9 $.

Figure 12

Figure 10. Box-plots for RegCM (red) and GNN4CD RC (green) and R-all (blue), derived for Italy (I) from the spatial maps of (a) average precipitation [mm/h], (b) p$ 99.9 $ [mm/h] and (c) percentage of rainy hours [%]; the lower panels show the box plots for the relative bias maps of the same quantities.

Figure 13

Figure A1. Same as Figure 5 but for the R-all model.

Figure 14

Figure A2. Same as Figure 6 but for the R-all model.

Figure 15

Figure A3. Seasonal results in the reanalysis to observation downscaling setting for the testing year $ 2016 $ for the hourly average precipitation; (a) GRIPHO observational reference [mm/h], (b) GNN4CD RC percentage bias [%], (c) GNN4CD R-all percentage bias [%].

Figure 16

Figure A4. Same as Figure A3 but for p$ 99 $.

Figure 17

Figure A5. Same as Figure A3 but for p$ 99.9 $.

Figure 18

Figure A6. Seasonal results in the reanalysis to observation downscaling setting for the testing year $ 2016 $ for the PDF of hourly precipitation [mm/h] with bin size of $ 0.5 $ mm for Italy (I); comparison between GRIPHO and (a) GNN4CD RC, (b) GNN4CD R-all; the insets provide a magnified view of the tail of the distribution.

Figure 19

Figure A7. Same as Figure 9, but computing the change for the mid-century period.

Figure 20

Figure A8. Same as Figure 10 but for (a) northern Italy (N) and (b) central-south Italy (C-S).

Figure 21

Figure A9. Comparison between the four alternative setups for the R-all model configuration, i.e., GNN4CD $ \left[t-24,\dots, t\right] $, GNN4CD $ \left[t-12,\dots, t\right] $, GNN4CD $ \left[t-6,\dots, t\right] $ and GNN4CD $ \left[t\right] $ in terms of relative percentage bias [%] with respect to GRIPHO, considering the validation year $ 2007 $; (a) average, (b) p$ 99 $ and (c) p$ 99.9 $ spatial maps.

Figure 22

Figure A10. Comparison between the four alternative setups for the R-all model configuration, i.e., GNN4CD $ \left[t-24,\dots, t\right] $, GNN4CD $ \left[t-12,\dots, t\right] $, GNN4CD $ \left[t-6,\dots, t\right] $ and GNN4CD $ \left[t\right] $ with respect to GRIPHO, considering the validation year $ 2007 $ and Italy (I); (a) hourly precipitation PDF [mm/h] using a bin size of $ 0.5 $mm and (b) average [mm/h] (c) frequency [%], and (d) intensity [mm/h] seasonal diurnal cycles.

Author comment: Graph neural networks for hourly precipitation projections at the convection permitting scale with a novel hybrid imperfect framework — R0/PR1

Comments

Dear Professor Monteleoni,

I am pleased to submit our manuscript entitled “Graph neural networks for hourly precipitation projections at the convection permitting scale with a novel hybrid imperfect approach” authored by Valentina Blasone, Erika Coppola, Guido Sanguinetti, Viplove Arora, Serafina Di Gioia and Luca Bortolussi for consideration for publication in the Environmental Data Science journal. This paper is an original contribution and is not published in the present or other forms elsewhere neither is considered for publication in any other journal. This paper presents a new machine learning-based climate emulator, combined with an innovative training strategy to derive high-resolution precipitation projections and we believe it will be of significant interest to your readers.

Please feel free to contact me if you need any additional information.

Sincerely,

Valentina Blasone

Review: Graph neural networks for hourly precipitation projections at the convection permitting scale with a novel hybrid imperfect framework — R0/PR2

Conflict of interest statement

Reviewer declares none.

Comments

This study tackles the problem of RCM emulation with deep learning approaches, which is an active area of research. The setup proposed by the authors intends to be model-agnostic, and leverages a graph-based architecture, which is relatively new in this field. The paper is generally well written, easy to follow and the evaluation part is based on relevant diagnostics. However, I struggled a bit to understand the reasons behind the relatively complex setup and the added value of the graph approach. I guess these issues could be removed in a large part by improving the positioning with respect to the

state-of-the-art and including a comparison to existing approaches. For these reasons, I consider the paper in its present deserves substantial reviewing before publication in EDS.

Main comments.

1. My first concern is about the experimental setup and objectives of the present work. I recommend the authors to clarify the positioning of their work with respect to the state-of-the-art in the introduction. In particular, the different types of statistical downscaling can be confusing. Page 3, L38-40, perfect model framework is not equivalent to perfect prognosis and imperfect model framework is not the same as MOS. As two main approaches, I would suggest: Observation downscaling and RCM emulation, and within each method there are 2 possible options corresponding to perfect prognosis/super-resolution and perfect/imperfect frameworks respectively (see Rampal et al. 2024, es-

pecially their Figure 3, which introduces the different setups very clearly).

What I understand is that you train your model in the framework of observation downscaling with the perfect prognosis approach. The model trained under these conditions is then used to perform RCM emulation (in a perfect model framework), which is a kind of transfer learning. Is that right ? This design is quite complicated, and it should be carefully explained. The terms real-world and model-world are also confusing. As an attempt to limit the vocabulary, I think real world could be replaced by downscaling and model world by emulation. I also wonder if there are similar configurations in the

literature. I am aware of some observation downscaling/perfect prognosis works in studies by Bano-Medina et al. and Rampal et al. for instance, but they don’t seem to apply their models for RCM emulation. In addition, I dont’ understand why your framework is named hybrid imperfect approach ? Is that hybrid because you’re using as RCM emulator a model trained for observation downscaling ? And why imperfect ? If your main objective is to perform RCM emulation, then you should really explain why you don’t simply train a RCM emulator ?

2. Another major concern relates to the motivations and added value of using a GNN rather than more standard approaches, such as convolutional networks.

GNN is a complicated architecture, its use should be carefully justified, and if possible explicitly demonstrated. Some reasons are given p7 L16-19. I understand the two different coordinate systems, but is it really a sufficient reason for a GNN approach ? GRIDPHO could also be processed on a regular lat/lon grid ? Regarding domain transferability, I guess it can also be achieved with CNN.

As the use of a GNN is presented as the main innovation of your work, comparing its performances to those of a baseline model seems essential.

3. The list of large-scale predictors is quite long for downscaling a single parameter. I guess it could be reduced without loss of quality. Did you look at predictors importance ? Using time series of predictors is quite unsual (as far as I know) and less documented, and I agree this should to some extent improve the results. However, your temporal length of 24h seems very long. As it is quite a relatively new setup I would recommend testing the sensitivity of the results to the size of this temporal window, or at least to evaluate the gain of using time series (and RNN).

4. I agree precipitation is a highly skewed variable that deserves specific attention. Thresholding and logarithmic transformation are common pre-processing steps that often give good results. In addition, the authors propose a strategy based on 2 models. Could you clarify what you mean by ‘ the regressor is trained only on targets where precipitation values exceed the threshold’ ? Do you mask the nodes where precipitation is below the threshold ? Do you also apply the precipitation thresholding on the GNN outputs ? In the end, between RC and R-All, which strategy would you recommend ?

Regarding the imbalance problem, Ravuri at al. proposed importance sampling to design meaningful datasets with skewed data. It can be an avenue for future works.

Specific comments

1. GRIDPHO grid: what is the reason for choosing a Lambert grid compared to a regular Lat/Lon grid ?

2. More details are needed on the GNN design.

- I understand the processor directly operates on the high-res grid, is that right ? If yes it can be

computationally expensive, did you consider introducing a coarser scale grid for the processor ?

- Can you confirm the processor consists of 5 message passing layers ? How is this number chosen ?

- Is there only one mesh level (compared to some approaches that include various refinement

layers) ?

- P7 L12-13: I guess processor should be replaced by predictor here.

3. Carefully define α, which is used twice, in eq (3.1) and (3.3), but with different meanings and values I suppose. This should be modified (and the definition of α in eq (3.1) should be given).

4. P10 L23: could you clarify what you mean by hourly average precipitation, frequency and intensity ? In my interpretation average precipitation is computed using zero and non-zero precipitation, while intensity is computed with non-zero precipitation only, but I’m not sure. If frequency is defined by percentage of wet hours, I don’t understand why the unit is mm/hr (Figure 5 and similar).

5. Figure 5: do you have an idea why the intensity is overestimated in JJA ? Is it observed for both northern and south-central Italy ?

6. Figures 5 and 6: legend is missing for panel (c).

7. Interpretation of results from Figures 5/6: do you have any ideas to explain the overestimation over complex topography (and underestimation in plain) ? Can it be related to the tuning of the QMSE loss ? I don’t really see the diffusive behaviour of RC, is it also observed when comparing the power spectra ?

8. Figure 7: it does not seem fairplay to show cases from both training and testing datasets. I recommend showing only events from the testing set.

9. P12 L7-12: the domain transferability is demonstrated to some extent. I am a bit puzzled by the correlations in Table 2, which are significantly lower in the Central-South domain, compared to the North one. Given these results, it may be worth being more cautious in the conclusions.

10. P13 L37: could you clarify the difference you make between downscaling and emulation ? I have the feeling both terms are used without distinction throughout the paper, but in the end there seems to be a suble (and maybe important) difference. This clarification is also linked to major point 1.

11. P13 L45-46: I would be more cautious here, as the downscaling is far from perfect. You could add ‘with a relatively good accuracy’ at the end of the sentence.

12. Figure 9: add legend of panel (b).

Review: Graph neural networks for hourly precipitation projections at the convection permitting scale with a novel hybrid imperfect framework — R0/PR3

Conflict of interest statement

Reviewer declares none.

Comments

Title: Graph neural networks for hourly precipitation projections at the convection permitting scale with a novel hybrid imperfect approach

Authors: Blasone, V., Coppola, E., Sanguinetti, G., Arora, V., Di Gioia, S. and L. Bortolussi

Journal: Environmental Data Science

Nr: EDS-2024-0049

In this manuscript the authors present a novel approach to downscale climate information to local spatiotemporal scales. Specifically, hourly precipitation at so-called km-scales. These scales are fine enough to resolve sub-daily mesoscale characteristics of precipitation (e.g., diurnal cycles) and short duration, high impact extreme precipitation events. The novelty of the approach hinges on the chosen architecture and the training strategy. Graph Neural Networks (GNN) such as the one employed here have demonstrated considerable flexibility when transferred to domains not seen in training. Spatial transferability is a known challenge in the emerging sub-discipline of AI for climate downscaling and demonstrating this capability represents a significant advance. The training strategy, a so-called Hybrid Imperfect Approach, aims to leverage the advantages conferred by using reanalysis data and observations. The reasons for this are twofold. One, training on observations negates the potential confounding effects of model errors and biases. Two, this training approach has, in principle, better generalization properties as it is not climate model-specific. The authors demonstrate this generalization capability, albeit in a limited way, in their “model-world” analyses. This is also a significant advance as machine learning algorithms trained in the “perfect model” framework often struggle to adapt to new inputs/predictors unseen in training (whether they come from global or regional climate models). A third advance this manuscript presents is the ability of deep learning algorithm to extrapolate to climate states unseen in training. Specifically, the GNN algorithm can capture the distribution shifts in hourly precipitation under future climate conditions. The shifts in the extremes are well represented, which is a prerequisite for employing these tools in developing ensembles of local scale climate projections. Extrapolation is a known challenge for machine learning and a long-running and well-documented criticism of traditional empirical statistical downscaling (ESD) techniques. While the examples provided here is not a comprehensive demonstration, it is a very promising first step.

While the scientific contribution of the paper is significant, I believe the writing could use some improvement. The are several unclear passages and run-on sentences that detract from its overall impact, some missing discussion points and one analysis that seems poorly justified. Further, the figures are of only middling quality. Improving them would greatly enhance the manuscript. I have several major comments/suggestions and many more minor comments/suggestions. However, I do not believe addressing these will require a major revision.

This is an important contribution to the field. As such, I believe it is worth taking the time to strengthen its impact. This can be done through improved figures, more discussion on strengths and limitations, removal of superfluous and ill-posed analyses, and clearer more concise writing. A note on syntax: in the “specific comments” section I use a P<x>L<y> notation where “P<x>” is page number and “L<y>” is line number(s).

<b>General comments</b>

1. The Impact Statement could be strengthened. I understand that these have strict character/word limits. Therefore, I suggest the authors shorten the first sentence. Modify the second and third sentence to emphasize the main advances (spatial transferability, future extrapolation, generalizability) and why they are important.

2. Throughout the manuscript the authors refer to ERA5 (~25km grid spacing) as “low resolution”. This is incorrect. Its effective resolution is sufficient to resolve synoptic scale circulations. State-of-the-art GCMs at ~50-250km grid spacing are appropriately referred to as “low resolution” while RCMs and CPRCMs are operating at resolutions that may be considered “high” to “very high”, respectively. What the authors effectively demonstrate is a downscaling from “moderate to high” resolution and not “low to high”. If they had shown that the algorithm effectively downscales a GCM to 3km then the latter would apply.

3. The introduction contains a helpful review of perfect model and imperfect frameworks. While it is beyond the scope to go into an in-depth discussion of strengths and weaknesses, a few additional sentences would help the reader understand the motivation for seeking a third way. For example, the perfect model framework does not, in the end, address the prohibitive cost issues that plague CPRCM simulations since it requires long future simulations for training. Also, the authors could link the discussion of bias mitigation, a few paragraphs later, more explicitly to these frameworks. Lastly, the authors provide many references for both perfect and imperfect frameworks, but they neglect to mention approaches that are similar to their own such as the approaches outlined by Hess et al., (2024) and Schmidt et al. (2025). These should be included for completeness.

4. A bit more discussion of GRIPHO is needed. What is the method used for interpolation to the common grid? How is observational uncertainty quantified (e.g., undercatch, instrument errors), if it is quantified at all? I would also suggest the authors include station locations in one of their figures (e.g., Figure 4). I suspect that areas of complex topography will also be areas with more sparse measurements, which leads to unreliable interpolation (see, Lussana et al., 2019). These areas are also where, unsurprisingly, the GNN4CD algorithm exhibits a mismatch with the observations.

5. There is a paragraph just before the description of the emulator that discusses the spatial and temporal length scale assumptions. However, it is unclear in the presentation of the emulator architecture, exactly where and how these elements are implemented. It appears the temporal issues are handled by the RNN pre-processor. The GAT layer presumably handles the spatial influence. However, details about how these layers treat the spatial and temporal scales are missing. Please provide some additional details about these elements of the emulator.

6. Is the empirical tuning the of the hyperparameters a problem?

7. I worry that the fourteen years for training, one year for validation and one year for testing is quite unbalanced and may result in overfitting. General guidance in machine learning is an 60-80% for training, 10-20% validation, and 10-20% testing. While I recognize and appreciate that general rules are meant to be broken, the split in this study is closer to 90-5-5. It would be helpful if the authors could defend this choice.

8. I appreciate the attempt by the authors to include an impacts analysis by examining the ability of the emulator to reproduce the precipitation signatures associated with historical flood events. However, I do not think it is appropriate to include results that come from the training period, even if some of the events are outside the training region. Considering the timing of these events (all Oct. or Nov.), even those outside the training region are not likely to be synoptically isolated. Only one of the events is completely out of training period/region. As a solution I propose the authors only include the 2016 event, which they can emphasize shows a promising first result towards events-based, impacts assessments using AI-ML emulators. They authors could then include the other events as supplementary material noting that inclusion in the main body of the results would not be appropriate but that they largely support the use of GNN4CD in this manner and that further research can/will be done to confirm this.

9. The conclusion section needs to a bit more measured with respect to the advances conferred by GNN4CD. This is not to detract from their significance – it is really impressive! – but rather, to more helpfully frame knowledge gaps and research directions. I save the specific suggestions for the following section but can generally say that there should be some discussion of the limitation of the present approach. For example, it downscales coarsened CPRCM for future climates which is not the same as downscaling from an “unseen” GCM due to the fact that the “perfect model” set up using coarsened data already contains much of the information needed to reproduce the precipitation signal. Also, the reader only ever sees seasonal performance in the diurnal cycle plots. All other figures show aggregated results. As such we don’t know if the biases, for example, are uniform in sign and magnitude across different seasons or if the aggregated pattern arises from a single season. Such information can help discern where the emulator struggles and what processes it captures well.

<b>Specific comments</b>

1. P1L24: spell out GNN4CD.

2. P1L29: replace “low-resolution” with “medium-resolution”.

3. P1L38: replace “effect” with “impacts”.

4. P2L30: replace “intense” with “expensive”.

5. P2L31: replace “prohibitive” with “prohibitively expensive”.

6. P2L36-38: The authors imply that ML can “improve” upon traditional models. I am unaware of any results that show emulators exhibit improvements over RCMs. I could be wrong but would encourage the authors to include a reference if this is the case.

7. P3L5: The Addison study is now on arXiv (see references below).

8. P3L19-21: The HIA approach is not well defined. It would be helpful if the authors could be more precise in its definition and describe (briefly) why it is a promising alternative to the purely “perfect” and “imperfect” training frameworks.

9. P5L34: I just wanted to state that, overall, the description of the GNN4CD emulator is excellent. Well-balanced in terms of technical detail and clearly and logically organized.

10. P7L47-48: The authors state that the scaling factor helps representation of rare events. But if the Classifier is solely focused on a binary categorization of rain/no-rain, how can it even begin to discern extreme events?

11. P9L29-32: Since the authors justify the use of a manual tuning of the hyper-parameters by invoking the “cost” of training, it would be helpful to know what this cost is. Also, it might be useful to describe the advantages/disadvantages of an automatic tuning approach?

12. P11L24-27: This is a run-on sentence, and its meaning is not entirely clear. I suggest splitting the sentence in two and taking care with the sentence structure. Here is a suggested rewording: “The RC model exhibits a larger bias in average JJA precipitation compared to other seasons (Figure 5c). This arises predominantly due to too high precipitation intensities. The R-all model also exhibits large biases in JJA average precipitation; in this case too frequent precipitation is the main contributor (Figure A1c).” Hopefully, this kind of restructuring can be helpful in other sections of the manuscript where the sentences are too long and try to include too many details.

13. P11L27-29: The peak in the JJA diurnal precipitation is quite clearly late afternoon/early evening 17-1800.

14. P11L31-34: Another confusing sentence. Precipitation is clearly overestimated over areas of complex topography. But what does “where observations are higher” mean? As an aside this section is where a more robust consideration of the limitations of the observational dataset could fit (see general comment 4).

15. P11L40: Add comma after “Conversely”.

16. P12L45-48: This is a nice result and augurs well for the emulator’s ability to extrapolate to unseen/unknown future climates. However, I think the authors need to temper this statement as the test is performed in a perfect model setup and as such there is already considerable information about the resulting precipitation distribution is contained in the coarsened data (so-called upscaled added value). A more robust test would have been to take a GCM, or an uncoarsened RCM, as input.

17. P13L11-12: The authors need to be a bit more transparent here. The emulator is not just “wetter”, it actually exhibits opposite signed responses over the entirety of Italy for p99, and specific areas for average change. Interestingly, there is good agreement for p99.9, which raises important questions about the stability of the emulator. These inconsistencies should not be viewed as negatives. Rather they serve an important purpose in highlighting knowledge gaps.

18. P13L13: Delete “instead”.

19. P13L29: Replace “to” with “with”.

20. P13L37: Replace “low-resolution” with “medium-resolution”.

21. P13L41: “biases”.

22. P13L42-43: Re-word. I suggest, “This training strategy, which we refer to as HIA, should facilitate the ability of the emulator to generalize to climates and models unseen in training.”

23. P13L47: Replace “leaded” with “leads”.

24. P13L52: Delete “instead”.

25. P14L13-15: Another confusing run-on sentence. This is an important implication, so it is well worth re-wording. Suggestion: “This is important as spatial transferability is a unique feature of this emulator and has the potential to extend the emulator’s application to remote and/or data sparse regions of the world.”

26. P14L20-22: I’ve often wondered why large resolution jumps are a problem in ML. In dynamical downscaling the problem is clear. In traditional ESD we often make jumps from hundreds of kilometers to point scales. The authors needn’t address this in the manuscript, but I am curious. Hess et al., (2024) for example, claim (implicitly) that the resolution jump isn’t a problem; the only limitation is the resolution of the training “target” dataset.

27. P14L23-26: Run-on sentence. Consider breaking in two. Suggestion, “These future research directions will help further establish the effectiveness and reliability of GNN4CD emulator. Doing so will put high-resolution ensembles of climate projections, generated at a fraction of the cost and time compared to dynamical methods, within reach.”

<b>Figures</b>

The figures could use some improvement. Doing so will greatly increase the impact of the manuscript. I only comment on the figures in the main body of the text, but my points also apply to the figures in the appendix.

Figure 4. This figure takes up a lot of space yet communicates very little. I suggest adding e.g., station locations to the map in panel a. Panel b is unnecessary and can be removed. It is also misleading because 2007-2016 is in the “training” period but the caption states it is in the “inference” period.

Figure 5. The PDFs are nearly indecipherable. I suggest taking the approach of Addison et al., (2024). See their figure 3a. This will help more clearly show the separation and/or overlap. The yellow lines are very faint. Choose a color with better contrast. In panel c the frequency is shown in mm/hr. This is incorrect. Frequency should be either a fraction or percentage. It is also unclear whether the diurnal cycles are computed over the training region or all of Italy. Lastly, the caption is missing critical details.

Figures 6,7,9. It is difficult to tell due to the small size of some of the panels, but it appears that the “rainbow” colormaps are not perceptually uniform (quite evident in Figures 6 & 9). Some alternatives can be found here: https://colorcet.com/.

Figure 8. See comments about the PDFs in Figure 1. The issue is even worse here as the authors are trying to show the shift in the future distribution by the CPRCM and how well the emulator reproduces it. It is impossible to discern this as it is currently displayed.

Figure 10. I suggest showing box-plots as they contain more information than just mean + 95% confidence interval. Also, the lines/colors in the bottom panels are unreadable. Lastly, the caption should state whether these are computed over all of Italy or just the training region.

<b>References</b>

Addison, H., Kendon, E., Ravuri, S., Aitchison, L., & Watson, P. A. (2024). Machine learning emulation of precipitation from km-scale regional climate simulations using a diffusion model (No. arXiv:2407.14158). arXiv. https://doi.org/10.48550/arXiv.2407.14158

Hess, P., Aich, M., Pan, B., & Boers, N. (2024). Fast, Scale-Adaptive, and Uncertainty-Aware Downscaling of Earth System Model Fields with Generative Foundation Models (No. arXiv:2403.02774). arXiv. https://doi.org/10.48550/arXiv.2403.02774

Lussana, C., Tveito, O. E., Dobler, A., & Tunheim, K. (2019). seNorge_2018, daily precipitation, and temperature datasets over Norway. Earth System Science Data, 11(4), 1531–1551. https://doi.org/10.5194/essd-11-1531-2019

Schmidt, J., Schmidt, L., Strnad, F., Ludwig, N., & Hennig, P. (2025). A Generative Framework for Probabilistic, Spatiotemporally Coherent Downscaling of Climate Simulation (No. arXiv:2412.15361). arXiv. https://doi.org/10.48550/arXiv.2412.15361

Recommendation: Graph neural networks for hourly precipitation projections at the convection permitting scale with a novel hybrid imperfect framework — R0/PR4

Comments

No accompanying comment.

Decision: Graph neural networks for hourly precipitation projections at the convection permitting scale with a novel hybrid imperfect framework — R0/PR5

Comments

No accompanying comment.

Author comment: Graph neural networks for hourly precipitation projections at the convection permitting scale with a novel hybrid imperfect framework — R1/PR6

Comments

Dear Editor-in-Chief,

we are pleased to submit the review for our manuscript entitled "Graph neural networks for hourly

precipitation projections at the convection permitting scale with a novel hybrid imperfect framework".

We believe that the Reviewers' comments have been very helpful in improving and enriching the work

and we hope that we have responded to all their comments in an appropriate manner.

We confirm that this manuscript is original, has not been published, and is not currently being considered

for publication elsewhere. All authors have approved the manuscript and agree with its submission.

Thank you for considering this manuscript for publication.

Sincerely,

Valentina Blasone

On behalf of all the Authors

Review: Graph neural networks for hourly precipitation projections at the convection permitting scale with a novel hybrid imperfect framework — R1/PR7

Conflict of interest statement

No competing interests.

Comments

I would like to thanks the authors for their thoughtful and thorough responses to my comments. I am satisfied that the manuscript is ready for publication.

Review: Graph neural networks for hourly precipitation projections at the convection permitting scale with a novel hybrid imperfect framework — R1/PR8

Conflict of interest statement

Reviewer declares none.

Comments

I thank the authors for the revision and answers to all my comments. I’m overall happy with the revision, and I only have some residual minor comments detailed below.

• 1. P6L49 ‘The term hybrid refers to the use of different domains’: in this context ‘domain’ can be misleading as it can also refer to geographical domain (which could apply in your case).

• 2. P8L33-24 ‘time steps with only targets below the threshold are removed, reducing the dataset to approximately 50% of its original size’: you don’t apply the same rule to the R-all case ?

• 3. Figure 7: for each case it would be interesting to know how extreme the event is (for instance, is it a percentile 90, 99 event ?)

• 4. P14L42 ‘with slight overestimation of the precipitation estimate’: I wouldn’t say case (2) is a slight overestimation ! This case would merit more in-depth investigation (however I can understand it’s beyond the scope of this paper).

• 5. Figures 8a and 8d: you should be cautious in comparing with GRIPHO since they do not correspond to the same period as GNN4CD and RegCM estimates, it cannot be considered as the ground truth here.

• 6. P15 L27 than, L37 These findings

• 7. Figure 10: legend indicates Bias % is computed against GRIPHO, but it would make more sense to compute against RegCM (is it just a typo mistake ?).

• 8. I would add somewhere that choosing to emulate the most extreme RCP8.5 scenario makes the task particularly challenging for GNN4CD, and the results even more remarkable.

Recommendation: Graph neural networks for hourly precipitation projections at the convection permitting scale with a novel hybrid imperfect framework — R1/PR9

Comments

No accompanying comment.

Decision: Graph neural networks for hourly precipitation projections at the convection permitting scale with a novel hybrid imperfect framework — R1/PR10

Comments

No accompanying comment.

Author comment: Graph neural networks for hourly precipitation projections at the convection permitting scale with a novel hybrid imperfect framework — R2/PR11

Comments

Dear Editor-in-Chief,

I am pleased to submit the revised version of our manuscript. We believe that this version further improved the quality and clearness of the manuscript.

Thank you for considering this manuscript for publication in the Environmental Data Science journal.

Sincerely,

Valentina Blasone

On behalf of all the co-authors

Review: Graph neural networks for hourly precipitation projections at the convection permitting scale with a novel hybrid imperfect framework — R2/PR12

Conflict of interest statement

Reviewer declares none.

Comments

I thank the authors for their answers to my last comments.

Recommendation: Graph neural networks for hourly precipitation projections at the convection permitting scale with a novel hybrid imperfect framework — R2/PR13

Comments

No accompanying comment.

Decision: Graph neural networks for hourly precipitation projections at the convection permitting scale with a novel hybrid imperfect framework — R2/PR14

Comments

No accompanying comment.