This paper addresses the current state of coreference resolution evaluation, in which different measures (notably, MUC, B3, CEAF, and ACE-value) are applied in different studies. None of them is fully adequate, and their measures are not commensurate. We enumerate the desiderata for a coreference scoring measure, discuss the strong and weak points of the existing measures, and propose the BiLateral Assessment of Noun-Phrase Coreference, a variation of the Rand index created to suit the coreference task. The BiLateral Assessment of Noun-Phrase Coreference rewards both coreference and non-coreference links by averaging the F-scores of the two types, does not ignore singletons – the main problem with the MUC score – and does not inflate the score in their presence – a problem with the B3 and CEAF scores. In addition, its fine granularity is consistent over the whole range of scores and affords better discrimination between systems.
Email your librarian or administrator to recommend adding this journal to your organisation's collection.
* Views captured on Cambridge Core between September 2016 - 25th March 2017. This data will be updated every 24 hours.