An Automated Evaluation Agent for Q&A Pairs and Reticular Synthesis Conditions

Nakul Rampal; Dongrong Joe Fu; Chengbin  Zhao; Hanan S. Murayshid; Albatool A.  Abaalkhail; Nahla E.  Alhazmi; Majed O.  Alawad; Christian Borgs; Jennifer T. Chayes; Omar M. Yaghi

doi:10.26434/chemrxiv-2025-rw85t

Theoretical and Computational Chemistry

Search within Theoretical and Computational Chemistry

An Automated Evaluation Agent for Q&A Pairs and Reticular Synthesis Conditions

10 July 2025, Version 1

Working Paper

Show author details

This content is an early or alternative research output and has not been peer-reviewed by Cambridge University Press at the time of posting.

Abstract

We report an automated evaluation agent that can reliably assign classification labels to different Q&A pairs of both single-hop and multi-hop types, as well as to synthesis conditions datasets. Our agent is built around a suite of large language models (LLMs) and is designed to eliminate human involvement in the evaluation process. Through extensive testing of various approaches such as DSPy and finetuning, among others, we found that the performance of a given LLM on these Q&A and synthesis conditions classification tasks is determined primarily by the architecture of the agent where it makes a significant difference how the different inputs are parsed and processed, and how the LLMs are called. We also found that the quality of the prompt provided remains paramount, irrespective of the sophistication of the underlying model. Even models considered state-of-the-art, such as GPT-o1, exhibit poor performance when the prompt lacks sufficient detail and structure. To overcome these challenges, we performed systematic prompt optimization, iteratively refining the prompt to significantly improve classification accuracy and achieve human-level evaluation benchmarks. We show that while LLMs have made remarkable progress, they still fall short of human reasoning without substantial prompt engineering. The agent presented here provides a robust and reproducible tool for evaluating Q&A pairs and synthesis conditions in a scalable manner and can serve as a foundation for future developments in automated evaluation of LLM inputs and outputs.

Keywords

Supplementary materials

Title

Description

Actions

Title

Supporting Information

Description

Includes the prompt templates (Figures S1-S6, S14-S16) and DSPy agent workflows (Figures S2-S4) used for RAG and LLM-based evaluation. Figures S7-S13 present the comparative performance of different LLMs across multiple evaluation iterations. Figures S17-S19 describe automated algorithms for prompt optimization, API evaluation, and data aggregation. Table S1 summarizes the iterative refinement process and evaluation results.

Actions

Supplementary weblinks

Title

Description

Actions

Title

Github repository

Description

Contains the source code, executables, and instructions to run the tools developed.

Actions

View

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting and Discussion Policy - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here .

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Version History

Jul 10, 2025 Version 1

Metrics

1,906

537

Views

Downloads

License

The content is available under CC BY NC 4.0

DOI

10.26434/chemrxiv-2025-rw85t

Author’s competing interest statement

The author(s) have declared they have no conflict of interest with regard to this content

Ethics

The author(s) have declared ethics committee/IRB approval is not relevant to this content

An Automated Evaluation Agent for Q&A Pairs and Reticular Synthesis Conditions

Authors

Abstract

Keywords

Supplementary materials

Supplementary weblinks

Comments

Version History

Metrics

License

DOI

Author’s competing interest statement

Ethics

Share