While the joint modeling of item responses and response times (RTs) has received considerable attention, most existing approaches remain limited to dichotomous items and are not applicable to assessments involving polytomous or mixed-format items. To address this limitation, this article proposes a novel joint modeling framework for graded item responses and RTs. Specifically, we develop a conditional RT model given item responses and integrate it with a marginal response model based on Samejima’s graded response model, yielding a conditional joint model for graded item responses and RTs. The model is then embedded within a two-level hierarchical framework to account for the relationship between ability and speed at the population level. A key methodological contribution is the development of a stochastic approximation EM (SAEM) algorithm for estimating the proposed model, which efficiently computes its marginal maximum likelihood estimates. Simulation studies demonstrate the accurate parameter recovery of the SAEM algorithm and indicate that the proposed model outperforms the hierarchical model assuming conditional independence across various testing conditions. Finally, an empirical analysis using data from the 2022 Programme for International Student Assessment illustrates the effectiveness of the graded response–response time model in large-scale assessments.