Hostname: page-component-89b8bd64d-b5k59 Total loading time: 0 Render date: 2026-05-08T15:41:34.251Z Has data issue: false hasContentIssue false

Anisotropic span embeddings and the negative impact of higher-order inference for coreference resolution: An empirical analysis

Published online by Cambridge University Press:  25 January 2024

Feng Hou
Affiliation:
School of Natural and Computational Sciences, Massey University, Auckland, New Zealand
Ruili Wang*
Affiliation:
School of Natural and Computational Sciences, Massey University, Auckland, New Zealand
See-Kiong Ng
Affiliation:
Institute of Data Science, National University of Singapore, Singapore
Fangyi Zhu
Affiliation:
Institute of Data Science, National University of Singapore, Singapore
Michael Witbrock
Affiliation:
School of Computer Science, University of Auckland, Auckland, New Zealand
Steven F. Cahan
Affiliation:
University of Auckland Business School, Auckland, New Zealand
Lily Chen
Affiliation:
Research School of Accounting, College of Business & Economics, Australian National University, Australia
Xiaoyun Jia
Affiliation:
Institute of Governance, Shandong University, Qingdao, China
*
Corresponding author: Ruili Wang; Email: ruili.wang@massey.ac.nz
Rights & Permissions [Opens in a new window]

Abstract

Coreference resolution is the task of identifying and clustering mentions that refer to the same entity in a document. Based on state-of-the-art deep learning approaches, end-to-end coreference resolution considers all spans as candidate mentions and tackles mention detection and coreference resolution simultaneously. Recently, researchers have attempted to incorporate document-level context using higher-order inference (HOI) to improve end-to-end coreference resolution. However, HOI methods have been shown to have marginal or even negative impact on coreference resolution. In this paper, we reveal the reasons for the negative impact of HOI coreference resolution. Contextualized representations (e.g., those produced by BERT) for building span embeddings have been shown to be highly anisotropic. We show that HOI actually increases and thus worsens the anisotropy of span embeddings and makes it difficult to distinguish between related but distinct entities (e.g., pilots and flight attendants). Instead of using HOI, we propose two methods, Less-Anisotropic Internal Representations (LAIR) and Data Augmentation with Document Synthesis and Mention Swap (DSMS), to learn less-anisotropic span embeddings for coreference resolution. LAIR uses a linear aggregation of the first layer and the topmost layer of contextualized embeddings. DSMS generates more diversified examples of related but distinct entities by synthesizing documents and by mention swapping. Our experiments show that less-anisotropic span embeddings improve the performance significantly (+2.8 F1 gain on the OntoNotes benchmark) reaching new state-of-the-art performance on the GAP dataset.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2024. Published by Cambridge University Press
Figure 0

Figure 1. The histogram distributions of $f$ and $P$ values. We recorded 6564 $f$ and $P$ values respectively during the last two epochs of training C2F+BERT-base+AA. $f$ is the average across the embedding dimension; $P$ is the max value of antecedents.

Figure 1

Figure 2. The degree of anisotropy of span embeddings is measured by the average cosine similarity between uniformly randomly sampled spans. (Figures 2-6 are generated using the method of Ethayarajh (2019)).

Figure 2

Figure 3. For contextualized word representations, the degree of anisotropy is measured by the average cosine similarity between uniformly randomly sampled words. The higher the layer, the more anisotropic. Embeddings of layer 0 are the input layer word embeddings.

Figure 3

Figure 4. Intra-sentence similarities of contextualized representations. The intra-sentence similarity is the average cosine similarity between each word representation in a sentence and their mean.

Figure 4

Figure 5. Self-similarities of contextualized representations. Self-similarity is the average cosine similarity between representations of the same word in different contexts.

Figure 5

Algorithm 1: Data augmentation for learning less-anisotropic span embeddings

Figure 6

Table 1. Hyperparameter settings for our experiments

Figure 7

Table 2. Results on the test set of the OntoNotes English data from the CoNLL-2012 shared task. HOI modules are in bold. The rightmost column is the main evaluation metric, the average F1 of MUC, $B^3$, $CEAF_{\phi _{4}}$. BERT, SpanBERT, and ELECTRA are the large model. The results of the models with citations are copied verbatim from the original papers

Figure 8

Table 3. Performance on the test set of GAP corpus. The metrics are F1 scores on Masculine and Feminine examples, Overall F1 score, and a Bias factor(F/M). BERT, SpanBERT, and ELECTRA are the large model. The models are trained using OntoNotes and tested on the test set of GAP. $\ast$ denotes the model is re-implemented following Joshi et al. (2020)

Figure 9

Figure 6. The average cosine similarity between randomly sampled spans. The less similar between spans, the less anisotropic are the span embeddings. LAIR and DSMS help learn less-anisotropic span embeddings.

Figure 10

Figure 7. The cosine similarity between The pilots’ and The flight attendants, an example from OntoNotes dev set.

Figure 11

Table 4. Ablation studies of higher-order inference for BERT-base, ELECTRA-base, and SpanBERT-base. The metric is the average F1 score on the OntoNotes dev set and test set using different combinations of hyperparameters. The HOI method used is the Attended Antecedent Lee et al. (2018)

Figure 12

Table 5. Ablation studies on different $\alpha$ values for different models