Hostname: page-component-89b8bd64d-x2lbr Total loading time: 0 Render date: 2026-05-05T19:42:55.800Z Has data issue: false hasContentIssue false

How reliable are retrieval-augmented and standard ChatGPT models to support flood susceptibility mapping?

Published online by Cambridge University Press:  21 April 2026

Ali Pourzangbar*
Affiliation:
Institute for Water and Environment (IWU), Karlsruhe Institute of Technology (KIT) , Karlsruhe, Germany
Mário J. Franca
Affiliation:
Institute for Water and Environment (IWU), Karlsruhe Institute of Technology (KIT) , Karlsruhe, Germany
*
Corresponding author: Ali Pourzangbar; Email: ali.pourzangbar@kit.edu

Abstract

This paper evaluates the performance of baseline and domain-augmented ChatGPT models for literature-based knowledge support in flood susceptibility mapping (FSM) using machine Learning approaches. To assess this, we designed five key questions related to FSM, with benchmark responses derived from our comprehensive review article (Pourzangbar et al., Journal of Flood Risk Management 18, e70042), which analyzed 100 studies on ML applications in FSM. The same questions were posed (i) to standard ChatGPT-4 and ChatGPT-4o models without additional contextual material, and (ii) to a domain-augmented GPT-4 configuration (Chat-FSM) equipped with retrieval access to the 100 reviewed articles. The comparison highlights that GPT-based models can reasonably reproduce frequently reported machine learning models and conditioning factors from the reviewed literature, but show weaker consistency in feature selection methods, often suggesting less relevant techniques. Among the models, ChatGPT-4o demonstrated the weakest alignment with benchmark data, while Chat-FSM demonstrated the highest agreement with the benchmark dataset across most evaluated questions. In terms of application-level efficiency, GPT models required substantially less time and computational effort compared to manual literature synthesis under the defined experimental setup. While ChatGPT-based systems can support literature-informed exploration in FSM, human expertise remains essential for critical reasoning, methodological design, and application to novel or context-specific scenarios.

Information

Type
Position Paper
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Open Practices
Open data
Copyright
© The Author(s), 2026. Published by Cambridge University Press
Figure 0

Table 1. Five key questions designed in this study to assess the agreement of different ChatGPT configurations with the benchmark dataset

Figure 1

Figure 1. The procedure used in this study to evaluate the agreement of different ChatGPT configurations with the FSM benchmark dataset.

Figure 2

Table 2. Three out-of-corpus questions designed to evaluate the generalization behavior of Chat-FSM through comparison with an independent benchmark study (Zhuang et al., 2026)

Figure 3

Table 3. Inference statistics of different ChatGPT models considering Q1

Figure 4

Table 4. Inference statistics of different ChatGPT models considering Q2

Figure 5

Table 5. Inference statistics of different ChatGPT models considering Q3

Figure 6

Table 6. Inference statistics of different ChatGPT models considering Q4

Supplementary material: File

Pourzangbar and Franca supplementary material 1

Pourzangbar and Franca supplementary material
Download Pourzangbar and Franca supplementary material 1(File)
File 56 KB
Supplementary material: File

Pourzangbar and Franca supplementary material 2

Pourzangbar and Franca supplementary material
Download Pourzangbar and Franca supplementary material 2(File)
File 1.1 MB

Author comment: How reliable are retrieval-augmented and standard ChatGPT models to support flood susceptibility mapping? — R0/PR1

Comments

Dear Professor Monteleoni,

We would like to submit a paper for possible publication in the Environmental Data Science. The manuscript entitled:

“How reliable is ChatGPT as virtual assistant in Flood Susceptibility Mapping?”

By Dr. Ali Pourzangbar & Prof. Mário J. Franca

Institute for Water and Environment (IWU), Karlsruhe Institute of Technology (KIT), Karlsruhe, Germany

ali.pourzangbar@kit.edu; mario.franca@kit.edu

In the current contribution (with two appendices), we explored the growing role and reliability of Large Language Models, i.e., ChatGPT, in answering key questions given the use of Machine Learn-ing models in Flood Susceptibility Mapping (FSM). We designed five key questions related to FSM, and the responses are extracted from our review article (published in the Journal of Flood Risk Man-agement) titled “Analysis of the Utilization of Machine Learning to Map Flood Susceptibility” (https://doi.org/10.1111/jfr3.70042), which analyzed 100 articles on the application of Machine Learning in FSM. These responses serve as benchmark data. Then, the same questions are made (i) directly to ChatGPT-4 and ChatGPT-4o with no further information, and (ii) to ChatGPT-4, as a basis for analysis, using the articles from the review as reference material (named Chat-FSM).

Our intention with this manuscript is to contribute to the discussion on how scientists and engineers can working in hydrological sciences make professional use of such AI tools as ChatGPT. We have significantly expanded our discussion on how much ChatGPT’s responses are reliable and how they can benefit flood managers and practitioners. Additionally, we analyzed the reasons behind the dis-crepancies between different ChatGPT models and their implications on the reliability of Flood Sus-ceptibility Mapping (FSM).

We believe this paper will provide valuable insights into the journal’s readership, especially given the increasing interest in applying AI to flood management. Thank you for considering our revised manu-script for publication.

Sincerely yours,

Ali Pourzangbar

Karlsruhe Institute of Technology (KIT)

Review: How reliable are retrieval-augmented and standard ChatGPT models to support flood susceptibility mapping? — R0/PR2

Conflict of interest statement

"None"

Comments

This paper tackles an interesting and timely question about ChatGPT’s potential role in Flood Susceptibility Mapping. I appreciate the systematic comparison across different GPT versions and the energy efficiency analysis, which adds a sustainability angle often overlooked in AI discussions.

However, I have concerns about a mismatch between what the paper claims to evaluate and what the validation actually demonstrates. The title and conclusions suggest an assessment of ChatGPT’s reliability as a “virtual assistant” for FSM more broadly, but the validation design specifically tests information retrieval from a curated set of 100 papers. While related, these are quite different capabilities.

The core finding (that Chat-FSM outperforms base GPT-4 and GPT-4o) is somewhat expected given the design. Chat-FSM has direct access to the same 100 papers that generated the benchmark data, so it’s essentially being tested on its training material. This demonstrates the well-established benefit of retrieval-augmented generation (RAG), but doesn’t necessarily validate “reliability” in the broader sense of being a dependable assistant for real-world FSM applications.

Specific Recommendations:

1. Reframe the scope to match what was tested. The paper would be stronger if it clearly positioned itself as evaluating information retrieval and summarization from FSM literature, rather than making broader claims about reliability as a virtual assistant. The current framing oversells what the validation actually demonstrates.

2. Be explicit about what the comparison shows. I’d recommend stating upfront that this experiment demonstrates how domain-specific knowledge bases improve retrieval accuracy for literature-based queries (a useful finding in itself) while acknowledging the limitations for novel applications or contexts not represented in the training corpus.

Test generalization beyond the knowledge base. To truly assess reliability as an assistant, include at least one test case where Chat-FSM must apply principles to a scenario not explicitly covered in the 100 papers. This could be a new geographic region, an emerging data source (like social media flood reports), or a methodological challenge that has emerged since 2023.

3. Distinguish information retrieval from analytical reasoning. Questions like “What are the most common ML models?” test factual recall and ranking, which are essentially retrieval tasks. But questions like “What model should I use for flood mapping in a data-scarce urban catchment?” require synthesis, critical judgment, and context-specific recommendations. The paper conflates these different capabilities under “reliability.” Clarifying this distinction would sharpen the contribution.

Additional Technical Concerns:

Reproducibility: Please provide any system prompts or configuration details, and clarification on whether this uses OpenAI’s custom GPT feature, API-based RAG, or actual fine-tuning. The distinction matters for readers assessing the technical contribution.

Terminology: The paper uses “trained” when describing Chat-FSM’s development, which typically implies fine-tuning model weights. If this is actually document upload to a custom GPT (which seems likely given the GitHub link is actually a ChatGPT share link), please use more precise terminology to avoid confusion.

Overall Assessment

This work makes a useful contribution by demonstrating how domain-specific knowledge bases can enhance ChatGPT’s performance on literature retrieval tasks in FSM. The comparison across GPT versions is informative, and the energy efficiency analysis adds valuable perspective. However, the claims about “reliability as a virtual assistant” need to be carefully aligned with what the validation actually tested. The current design validates retrieval accuracy, not the broader analytical and application capabilities that “virtual assistant” implies.

With clearer framing of scope, acknowledgment of the circular validation design, and at least one generalization test, this would be a solid contribution to the position paper literature on AI tools in environmental science. As written, the gap between claims and validation is too large for publication without revision.

Recommendation: How reliable are retrieval-augmented and standard ChatGPT models to support flood susceptibility mapping? — R0/PR3

Comments

While the reviewer notes that the paper can be a useful contribution, there are important points that must be addressed prior to further consideration.

Decision: How reliable are retrieval-augmented and standard ChatGPT models to support flood susceptibility mapping? — R0/PR4

Comments

No accompanying comment.

Author comment: How reliable are retrieval-augmented and standard ChatGPT models to support flood susceptibility mapping? — R1/PR5

Comments

Dear Dr. Brajard,,

Please find enclosed, as a re-submission for possible publication on Journal of Environmental Data Science, the manuscript entitled: “How reliable are retrieval-augmented and standard ChatGPT models to support flood susceptibility mapping?” by Ali Pourzangbar and Mário J. Franca.

We sincerely thank you and the reviewer for the constructive and insightful comments provided on our manuscript. The feedback has been instrumental in strengthening the conceptual clarity, methodo-logical rigor, and technical transparency of the study.

In response to the reviewer’s concerns, we have undertaken substantial revisions to ensure full align-ment between the study’s claims and its validation design. The manuscript has been comprehensively reframed to clarify that it presents a structured benchmarking analysis of literature-consistency rather than a broad assessment of virtual assistant reliability. Specifically, we revised the Abstract, Intro-duction, Methodology, Discussion, and Conclusions to emphasize agreement with a literature-derived benchmark using quantitative metrics (Jaccard Index and Kendall’s Tau), thereby precisely reflecting what was empirically tested. We have also explicitly clarified the technical implementation of Chat-FSM.

The manuscript now clarifies that Chat-FSM was developed using OpenAI’s Custom GPT interface with retrieval augmentation from a defined corpus of 100 reviewed articles. No model weight fine-tuning or API-based customization was performed. The exact prompts and full model responses are provided in the Appendices, and public access to the Chat-FSM configuration is included in the Data Availability section. To address concerns regarding potential circular validation, we additionally con-ducted an out-of-corpus evaluation using an independent FSM benchmark study not included in the retrieval corpus, providing a preliminary assessment of generalization beyond the configured knowledge base.

All other reviewer comments have been carefully addressed in a detailed point-by-point response accompanying this submission.

We greatly appreciate the opportunity to revise and resubmit our work and look forward to your consideration. Given that the content of this research belongs to a very fast evolving field, we would appreciate a swift handling of this resubmission.

Sincerely yours,

Ali Pourzangbar

Karlsruhe Institute of Technology

Review: How reliable are retrieval-augmented and standard ChatGPT models to support flood susceptibility mapping? — R1/PR6

Conflict of interest statement

Reviewer declares none.

Comments

The author have addressed all major concerns. The manuscript is now edited appropriately to reframe ‘reliability as a virtual assistant’ to assessing ‘literature-based knowledge support,’ with revisions throughout the title, abstract, methodology, results, and conclusions.

Key improvements:

• Authors have used correct terminology by distinguishing retrieval-augmented generation from model training/fine-tuning

• Authors have added out-of-corpus generalization testing that clearly demonstrated both capabilities and limitations

• There is a clear difference between literature retrieval tasks and context-dependent analytical reasoning

• There is discussion on the fact that performance is constrained by corpus coverage

• Detailed methodology and appendix documentation will help in reproducibility.

Recommendation: How reliable are retrieval-augmented and standard ChatGPT models to support flood susceptibility mapping? — R1/PR7

Comments

No accompanying comment.

Decision: How reliable are retrieval-augmented and standard ChatGPT models to support flood susceptibility mapping? — R1/PR8

Comments

No accompanying comment.