An Automated Evaluation Agent for Q&A Pairs and Reticular Synthesis Conditions

10 July 2025, Version 1
This content is an early or alternative research output and has not been peer-reviewed by Cambridge University Press at the time of posting.

Abstract

We report an automated evaluation agent that can reliably assign classification labels to different Q&A pairs of both single-hop and multi-hop types, as well as to synthesis conditions datasets. Our agent is built around a suite of large language models (LLMs) and is designed to eliminate human involvement in the evaluation process. Through extensive testing of various approaches such as DSPy and finetuning, among others, we found that the performance of a given LLM on these Q&A and synthesis conditions classification tasks is determined primarily by the architecture of the agent where it makes a significant difference how the different inputs are parsed and processed, and how the LLMs are called. We also found that the quality of the prompt provided remains paramount, irrespective of the sophistication of the underlying model. Even models considered state-of-the-art, such as GPT-o1, exhibit poor performance when the prompt lacks sufficient detail and structure. To overcome these challenges, we performed systematic prompt optimization, iteratively refining the prompt to significantly improve classification accuracy and achieve human-level evaluation benchmarks. We show that while LLMs have made remarkable progress, they still fall short of human reasoning without substantial prompt engineering. The agent presented here provides a robust and reproducible tool for evaluating Q&A pairs and synthesis conditions in a scalable manner and can serve as a foundation for future developments in automated evaluation of LLM inputs and outputs.

Keywords

LLMs
Generative AI
Evaluation
Bechmarks
Reticular Chemistry
Agentic AI
Datasets

Supplementary materials

Title
Description
Actions
Title
Supporting Information
Description
Includes the prompt templates (Figures S1-S6, S14-S16) and DSPy agent workflows (Figures S2-S4) used for RAG and LLM-based evaluation. Figures S7-S13 present the comparative performance of different LLMs across multiple evaluation iterations. Figures S17-S19 describe automated algorithms for prompt optimization, API evaluation, and data aggregation. Table S1 summarizes the iterative refinement process and evaluation results.
Actions

Supplementary weblinks

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting and Discussion Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.