This article explores the integration of large language models (LLMs) and AI research agents into global benchmarking frameworks, with a focus on data for the public good. Against a backdrop of shrinking funding and rising demand for scalable and reproducible assessments, we ask whether AI can assume core roles in indicator development, evidence discovery, and policy evaluation without compromising contextual nuance or democratic legitimacy. Building on pilot experiments conducted within the Global Data Barometer (GDB), we employed a phased, adaptive methodology that tested workflow-based platforms and deep research agents across tasks ranging from legal interpretation to multisource policy analysis. The preliminary findings suggest that while AI systems show strong potential for automating structured assessments, they falter on complex, fragmented, or normatively loaded indicators, raising concerns about opacity, overinterpretation, and inclusivity. To navigate these tensions, we propose a hybrid human-AI architecture that combines standardized workflows, adaptive agent capabilities, and critical human oversight. Central to this model is the concept of a dynamic evidence infrastructure, designed to embed participatory validation and enhance transparency. By reframing automation as augmentation, the study contributes both an empirical, domain-specific assessment of the opportunities and limits of AI-assisted benchmarking and a theoretical framework for sustainable, context-aware evaluation in the age of AI. We argue that the success of AI-assisted benchmarking should be measured not only in efficiency gains but also in its ability to strengthen legitimacy, accountability, and inclusiveness in data ecosystems worldwide.