Hostname: page-component-77f85d65b8-pkds5 Total loading time: 0 Render date: 2026-03-27T08:00:43.816Z Has data issue: false hasContentIssue false

Toward HydroLLM: a benchmark dataset for hydrology-specific knowledge assessment for large language models

Published online by Cambridge University Press:  02 June 2025

Dilara Kizilkaya
Affiliation:
IIHR - Hydroscience and Engineering, University of Iowa, Iowa City, IA, USA Computer Science, University of Iowa, Iowa City, IA, USA
Ramteja Sajja*
Affiliation:
IIHR - Hydroscience and Engineering, University of Iowa, Iowa City, IA, USA Electrical and Computer Engineering, University of Iowa, Iowa City, IA, USA
Yusuf Sermet
Affiliation:
IIHR - Hydroscience and Engineering, University of Iowa, Iowa City, IA, USA
Ibrahim Demir
Affiliation:
River-Coastal Science and Engineering, Tulane University, Iowa City, IA, USA ByWater Institute, Tulane University, New Orleans, LA, USA
*
Corresponding author: Ramteja Sajja; Email: ramteja-sajja@uiowa.edu

Abstract

The rapid advancement of large language models (LLMs) has enabled their integration into a wide range of scientific disciplines. This article introduces a comprehensive benchmark dataset specifically designed for testing recent LLMs in the hydrology domain. Leveraging a collection of research articles and hydrology textbooks, we generated a wide array of hydrology-specific questions in various formats, including true/false, multiple-choice, open-ended, and fill-in-the-blank. These questions serve as a robust foundation for evaluating the performance of state-of-the-art LLMs, including GPT-4o-mini, Llama3:8B, and Llama3.1:70B, in addressing domain-specific queries. Our evaluation framework employs accuracy metrics for objective question types and cosine similarity measures for subjective responses, ensuring a thorough assessment of the models’ proficiency in understanding and responding to hydrological content. The results underscore both the capabilities and limitations of artificial intelligence (AI)-driven tools within this specialized field, providing valuable insights for future research and the development of educational resources. By introducing HydroLLM-Benchmark, this study contributes a vital resource to the growing body of work on domain-specific AI applications, demonstrating the potential of LLMs to support complex, field-specific tasks in hydrology.

Information

Type
Data Paper
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2025. Published by Cambridge University Press
Figure 0

Table 1. Sample questions from HydroLLM-Benchmark categorized by source and question type

Figure 1

Figure 1. Conceptual overview of HydroLLM-Benchmark.

Figure 2

Figure 2. Post-processing, model output generation, and scoring.

Figure 3

Figure 3. Accuracy scores for true/false Q&A.

Figure 4

Figure 4. Accuracy scores for multiple-choice Q&A.

Figure 5

Figure 5. Cosine similarity scores for fill-in-the-blanks Q&A.

Figure 6

Figure 6. Cosine similarity scores for open-ended Q&A.

Author comment: Toward HydroLLM: a benchmark dataset for hydrology-specific knowledge assessment for large language models — R0/PR1

Comments

Editorial Office

Environmental Data Science

Subject: Submission of Manuscript for Consideration in Environmental Data Science

Dear Editors,

We are pleased to submit our manuscript, “Towards HydroLLM: A Benchmark Dataset for Hydrology-Specific Knowledge Assessment for Large Language Models,” for consideration in Environmental Data Science. Our study introduces HydroLLM-Benchmark, a comprehensive dataset designed to evaluate the performance of large language models (LLMs) in hydrology-specific tasks. Given the increasing reliance on artificial intelligence (AI) for environmental research and decision-making, our work provides a critical resource for assessing the capabilities and limitations of LLMs in hydrological applications.

Our research addresses a significant gap in the intersection of AI and environmental sciences by developing and evaluating domain-specific benchmarks for LLMs. Using a diverse set of questions derived from research articles and textbooks, we assess the performance of state-of-the-art LLMs—including GPT-4o-mini, Llama3:8B, and Llama3.1:70B—across various hydrology-related question types. By leveraging accuracy and semantic similarity metrics, our evaluation framework highlights the strengths and weaknesses of these models, offering insights for future AI-driven hydrology applications.

We believe that our study aligns well with the scope of Environmental Data Science, particularly in the areas of AI applications in environmental research, benchmark dataset development, and domain-specific machine learning evaluations. The findings of this study contribute to the broader discourse on integrating AI technologies into environmental sciences, with potential applications in hydrological modeling, climate resilience planning, and water resource management.

We affirm that this manuscript is original, has not been published elsewhere, and is not under consideration by any other journal. All authors have approved the final version of the manuscript, and there are no conflicts of interest to disclose.

We appreciate your time and consideration, and we look forward to your feedback. Please do not hesitate to contact us if you require any further information.

Sincerely,

Ramteja Sajja (Corresponding Author)

IIHR - Hydroscience and Engineering, University of Iowa

Email: ramteja-sajja@uiowa.edu

Review: Toward HydroLLM: a benchmark dataset for hydrology-specific knowledge assessment for large language models — R0/PR2

Conflict of interest statement

Reviewer declares none.

Comments

The manuscript is well-written, well-structured, and highly relevant to the scope of Environmental Data Science Journal. The introduction of the HydroLLM-Benchmark dataset is a significant contribution to the field, addressing a critical gap in evaluating large language models (LLMs) for hydrology-specific applications. The clarity of your methodology, the robustness of the evaluation framework, and the open-source availability of the dataset make this work a valuable resource for both AI and hydrological research communities.

1) You evaluate GPT-4o-mini, LLaMA-3 8B, and LLaMA-3.1 70B “out-of-the-box.” Please explain why these three models (in particular the “mini” GPT-4o variant) were selected over other comparable open or commercial LLMs. A concise rationale (e.g. parameter size spectrum, licensing, multimodal capabilities) would help readers understand the benchmark’s scope and limits. For instance, LLaMA-3.1 has multiple variants (e.g., instruct-tuned vs. base, different context window sizes, etc.). Why was the 70B variant chosen, and what prompted the use of that specific configuration (e.g., base vs. instruction-tuned)? Were other variants considered?

2) In Section 4.3.1, you mention the bias in GPT-4o-mini towards selecting “B” for Multiple Choice questions. While the mitigation steps are well-described, it would be helpful to discuss why this bias might have occurred (e.g., model training artifacts, prompt design) and its implications for other researchers using LLMs for question generation.

3) The Introduction highlights the importance of multimodal data in hydrology (e.g., textual, visual, numerical). However, the dataset and evaluation focus solely on text-based questions. Could you discuss in Section 4 or 5 whether future iterations of HydroLLM-Benchmark plan to incorporate multimodal elements (e.g., questions based on hydrological maps or satellite imagery)?

4) Section 4.2 emphasizes HydroLLM-Benchmark as a living dataset and encourages community contributions. To make this vision more actionable, consider outlining specific mechanisms for community engagement (e.g., GitHub contribution guidelines, planned workshops, or collaboration platforms). This would provide a clearer roadmap for researchers and practitioners interested in contributing to the dataset’s evolution.

5) I encourage the authors to briefly discuss how the HydroLLM-Benchmark, or a future extension, could be adapted to assess LLMs in operational hydrology studies such as flood forecasting, drought monitoring, or disaster response. This would enhance the broader relevance of the study and align with real-world hydrological decision-making needs. I also suggest the authors consider recent related work where LLMs were applied to operational tasks. For example: “FF-BERT: A BERT-based ensemble for automated classification of flash flood reports.”, “Can Large Language Models Effectively Reason About Adverse Weather Conditions?”, “A Hybrid Machine Learning Pipeline for Automated Mapping of Events and Locations From Social Media in Disasters.” These studies provide concrete use cases where LLMs are already contributing to early warning systems or disaster informatics. Comparing results or methods with such applications could help deepen the discussion in Section 4 and emphasize the translational value of HydroLLM-Benchmark.

6) The dataset is primarily sourced from Fundamentals of Hydrology (Davie, 2019) and Elsevier journals (2022–2024). While this ensures recent and authoritative coverage, have the authors considered incorporating additional textbooks or open-access hydrology resources (e.g., USGS reports, IPCC assessments) to improve domain diversity?

7) Since Meta has recently released LLaMA 3.2 and LLaMA 4, which includes updated model variants and improvements in instruction-following and multilingual reasoning, I suggest the authors briefly acknowledge this in the discussion or conclusion. While I understand that the current study was completed prior to the release of recent LLaMA variants, noting it as a direction for future benchmarking efforts would help keep the paper forward-looking and relevant to ongoing developments in the LLM space.

Minor comments

- “…Leveraging a collection of research articles and hydrology textbook and, we generated…” → delete “and”.

- Figure 1. & 2. Add a legend explaining the colored blocks; enlarge text for readability.

- Specify GPU memory (e.g. “NVIDIA L40S) for clearer hardware guidance.

- In Section 2.3, the sentence “Across multiple iterations, we developed 1,124 research article and 224 book True/False questions…” could be clarified by specifying whether “book” refers to the textbook or another source.

- Ensure consistency in terminology (e.g., “Fill-in-the-Blanks” vs. “Fill in the Blanks”) throughout the manuscript.

- The bias toward option “B” (70.4%) is interesting. Was this phenomenon observed across all models or only GPT-4o-mini? A brief exploration of potential causes (e.g., prompt structure, token probabilities) would be useful.

Overall, this is a strong and timely contribution. Addressing these points would further elevate the manuscript’s rigor and applicability.

Recommendation: Toward HydroLLM: a benchmark dataset for hydrology-specific knowledge assessment for large language models — R0/PR3

Comments

This is an interesting article and I would like to see this progress towards publication. I would request the authors to address the reviewer comments and re-submit.

Decision: Toward HydroLLM: a benchmark dataset for hydrology-specific knowledge assessment for large language models — R0/PR4

Comments

No accompanying comment.

Author comment: Toward HydroLLM: a benchmark dataset for hydrology-specific knowledge assessment for large language models — R1/PR5

Comments

Editorial Office

Environmental Data Science

Subject: Resubmission of Revised Manuscript – “Towards HydroLLM: A Benchmark Dataset for Hydrology-Specific Knowledge Assessment for Large Language Models”

Dear Editors,

We are pleased to resubmit our revised manuscript, “Towards HydroLLM: A Benchmark Dataset for Hydrology-Specific Knowledge Assessment for Large Language Models,” for continued consideration in Environmental Data Science. We would like to extend our sincere thanks to the editors and reviewers for their thoughtful and constructive feedback, which has significantly strengthened the quality and clarity of our work.

In this revised version, we have carefully addressed all reviewer comments. Key updates include:

A clear rationale for model selection and configuration.

An expanded explanation of GPT-4o-mini’s answer bias behavior.

A forward-looking discussion of future benchmark extensions—including multimodal elements and operational hydrology use cases.

Clarification of data sources, GPU specifications, and evaluation procedures.

Consistency and precision improvements throughout the manuscript.

These revisions enhance the manuscript’s rigor and align it more closely with the journal’s mission to advance data-driven solutions for environmental challenges. We believe the HydroLLM-Benchmark offers a timely and impactful resource for evaluating large language models within hydrology and related environmental domains.

We affirm that the manuscript remains original, is not under consideration elsewhere, and that all authors have reviewed and approved the final version. There are no conflicts of interest to disclose.

We are grateful for the opportunity to revise and resubmit, and we welcome any additional feedback from the editorial team.

Sincerely,

Ramteja Sajja (Corresponding Author)

IIHR - Hydroscience and Engineering, University of Iowa

Email: ramteja-sajja@uiowa.edu

Review: Toward HydroLLM: a benchmark dataset for hydrology-specific knowledge assessment for large language models — R1/PR6

Conflict of interest statement

None

Comments

The authors have thoroughly addressed all of my previous comments and suggestions. I’m glad to see the improvements they’ve made to the manuscript.

Recommendation: Toward HydroLLM: a benchmark dataset for hydrology-specific knowledge assessment for large language models — R1/PR7

Comments

No accompanying comment.

Decision: Toward HydroLLM: a benchmark dataset for hydrology-specific knowledge assessment for large language models — R1/PR8

Comments

No accompanying comment.