Abstract
Decades of catalysis knowledge remain locked in unstructured prose, hindering data-driven discovery. Existing text-mining tools struggle to establish the synthesis-structure-performance relationships critical for catalyst knowledge discovery, as they rarely connect synthesis protocols in one section with the resulting material properties and performance outcomes reported elsewhere. Here, we present CATDA (Corpus-aware Automated Text-to-Graph Catalyst Discovery Agent), a long-context large language model (LLM)–driven agentic framework that reads full documents and distills them into actionable, provenance–tracked knowledge graphs linking material properties, multi-step synthesis, conditions, and testing outcomes. Applied at corpus scale, CATDA extracts data with near-human fidelity (F1 = 0.983) and a 12-fold speedup over manual curation. This structured knowledge is made accessible through two synergistic applications: a DatasetAgent for exporting machine-learning-ready tables, and a CatAgent providing a conversational, citation-linked interface for interactive discovery. The high-quality dataset enabled the training of a predictive model for ethylbenzene conversion, while simultaneously exposing systemic challenges such as feature sparsity and protocol heterogeneity in the source literature. By transforming the literature into a queryable and computable resource, CATDA offers a scalable route to accelerate large-scale data analysis, quantitative modeling, and rational catalyst design paradigm.
Supplementary materials
Title
Supplementary Information
Description
Supplementary Information for
Distilling Knowledge from Catalysis Literature with Long-Context LLM Agents
Actions



![Author ORCID: We display the ORCID iD icon alongside authors names on our website to acknowledge that the ORCiD has been authenticated when entered by the user. To view the users ORCiD record click the icon. [opens in a new tab]](https://www.cambridge.org/engage/assets/public/coe/logo/orcid.png)