Abstract
Recent progress in language models has shown great potential, yet their performance in highly specialized engineering fields—requiring strict precision, logical rigor, and nuanced design reasoning—remains insufficiently explored. This work presents EngiBench, a comprehensive evaluation framework designed to systematically assess model capabilities in engineering knowledge, computation, reasoning, and judgment across diverse sub-domains. To improve reliability and interpretability, we propose a Hierarchical Reasoning Prompting (HRP) strategy that mirrors the structured thinking process of human engineers, involving problem decomposition, knowledge application, analytical reasoning, and validation. We further construct EngiBench-QA, a curated dataset of expert-annotated question–answer pairs, along with EngiBench-Mini for human expert comparison. Experimental results show that HRP notably enhances model accuracy and consistency, surpassing conventional prompting and achieving strong alignment with expert-level reasoning. Nonetheless, challenges persist in open-ended design, complex simulation, and managing overconfidence.



![Author ORCID: We display the ORCID iD icon alongside authors names on our website to acknowledge that the ORCiD has been authenticated when entered by the user. To view the users ORCiD record click the icon. [opens in a new tab]](https://www.cambridge.org/engage/assets/public/coe/logo/orcid.png)