Joint learning of text alignment and abstractive summarization for long documents via unbalanced optimal transport

Abstract Recently, neural abstractive text summarization (NATS) models based on sequence-to-sequence architecture have drawn a lot of attention. Real-world texts that need to be summarized range from short news with dozens of words to long reports with thousands of words. However, most existing NATS models are not good at summarizing long documents, due to the inherent limitations of their underlying neural architectures. In this paper, we focus on the task of long document summarization (LDS). Based on the inherent section structures of source documents, we divide an abstractive LDS problem into several smaller-sized problems. In this circumstance, how to provide a less-biased target summary as the supervision for each section is vital for the model’s performance. As a preliminary, we formally describe the section-to-summary-sentence (S2SS) alignment for LDS. Based on this, we propose a novel NATS framework for the LDS task. Our framework is built based on the theory of unbalanced optimal transport (UOT), and it is named as UOTSumm. It jointly learns three targets in a unified training objective, including the optimal S2SS alignment, a section-level NATS summarizer, and the number of aligned summary sentences for each section. In this way, UOTSumm directly learns the text alignment from summarization data, without resorting to any biased tool such as ROUGE. UOTSumm can be easily adapted to most existing NATS models. And we implement two versions of UOTSumm, with and without the pretrain-finetune technique. We evaluate UOTSumm on three publicly available LDS benchmarks: PubMed, arXiv, and GovReport. UOTSumm obviously outperforms its counterparts that use ROUGE for the text alignment. When combined with UOTSumm, the performance of two vanilla NATS models improves by a large margin. Besides, UOTSumm achieves better or comparable performance when compared with some recent strong baselines.


Introduction
Text summarization is the procedure of identifying the most important information from source text and producing a concise and readable summary (Mani, 1999). Generally speaking, the summarization models can be divided into two types: extraction or abstraction (Gambhir and Gupta, 2017). The extractive approach directly extracts snippets, such as sentences or phrases from the original documents as the summary. In contrast, the abstractive approach uses natural language generation techniques to produce fluent summaries and it may generate expressions not directly Gigaword (Rush et al., 2015) n e w s GovReport (Huang et al., 2021) government report 9409 296 19.8 553 20.0 For GovReport, its documents are organized in a form of multi-level sections. The method of dividing its documents to a form of one-level sections is discussed in Section 5.1.
existing in the source document. Nowadays, neural abstractive text summarization (NATS) models (Shi et al., 2021) based on sequence-to-sequence (Seq2Seq) architecture (Sutskever, Vinyals, and Le, 2014) are prevailing. In practice, different types of documents vary greatly in length.
In Table 1, we present the length statistics of several popular summarization datasets. It can be observed that news stories are shorter than 800 words on average. In contrast, the average length of research papers exceeds 3000 words, and the average length of government reports even exceeds 9000 words. Most existing NATS models treat source document and summary as two single sequences, which works well in summarizing documents of short and medium lengths. However, limited by the underlying neural architectures of NATS models, this practice leads to some de facto difficulties when applied to long documents. Previous studies show that vanilla LSTM (Hochreiter and Schmidhuber, 1997) and vanilla Transformer (Vaswani et al., 2017) can effectively handle sequences of several hundred words at most (Khandelwal et al., 2018;Dai et al., 2019). Besides, the memory and time complexities of computing Transformer grow quadratically with the sequence length. This constraint also limits the application of Transformer in long documents, since long sequences easily run out of GPU memory.
In this paper, we study the challenging setting of long document summarization (LDS), where one source document includes thousands of words and one summary includes hundreds of words. Under the length constraint of neural architectures, one common practice adopted by NATS models is to set a length limit and truncate the exceeded part. However, this simple practice discards useful information beyond the prescribed length limit. To handle the longer sequence, one approach is to design sophisticated network structures to capture the long-range dependency, such as hierarchically encoding the discourse structure (Webber and Joshi, 2012) of documents (Cohan et al., 2018), or introducing the long-span attention mechanisms (Zaheer et al., 2020). As another approach, extractive-and-abstractive methods extract some snippets first and then paraphrase them (Pilault et al., 2020). Recently, Gidiotis and Tsoumakas (2020) propose a simple and effective method for LDS, which is named as divide-and-conquer (DANCER). DANCER decomposes a LDS problem into multiple smaller problems, reduces computational complexity, and achieves good performance. Concretely, it breaks a long document and its summary into several pairs of document section and corresponding summary. A NATS model is trained to summarize the sections of a document separately, and these partial summaries are then combined as a complete summary. Text alignment refers to the correspondence between two pieces of text. For DANCER, the alignment between section and summary sentences is necessary to decide which sentences in summary should be treated as the target of one section. To achieve text alignment, DANCER utilizes ROUGE (Lin, 2004). However, ROUGE only matches tokens in a superficial and exact way, which does not support synonyms or paraphrasing. And as an approach to text comparison, ROUGE deviates from human judgment (Kryscinski et al., 2019;Fabbri et al., 2021). For these reasons, ROUGE-based text alignment also deviates from human judgment. This gap leaves some room for improving the NATS models that require ROUGE at the training stage. It is natural to ask the following questions: is ROUGE inevitable to achieve text alignment? is it possible to directly learn the text alignment from summarization data without utilizing ROUGE?
In this paper, we propose a novel framework for LDS. Our method treats summarizing a long document as an ensemble of summarizing its contained sections. As a preliminary step, we formally describe the section-to-summary-sentence (S2SS) alignment for LDS. Based on this, we propose a joint training objective to formulate LDS as an unbalanced optimal transport (UOT) (Chizat et al., 2015) problem. Accordingly, our method is named as UOT-based summarizer (UOTSumm). UOTSumm achieves multiple goals simultaneously: it jointly learns the optimal S2SS alignment and a section-level NATS summarizer, it also learns the number of aligned summary sentences for each section. At training stage, UOTSumm directly learns S2SS alignment from summarization data, without utilizing any external tool such as ROUGE. In terms of concrete implementation, UOTSumm comprises two modules: a Section-to-Summary (Sec2Summ) module and an aligned summary sentence counter (ASSC) module. The Sec2Summ module takes document sections as the input and outputs the corresponding abstractive summaries. ASSC module records the number of generated sentences for each section. We adopt an alternating optimization technique (Bezdek and Hathaway, 2002) to train UOTSumm, such that ASSC module and Sec2Summ module are alternately updated. UOTSumm includes a universal training objective for LDS, and its Sec2Summ module can be any existing NATS model. In this paper, we implement UOTSumm with two popular NATS models: Pointer-generator networks (PG-Net) (See, Liu, and Manning, 2017) and BART (Lewis et al., 2020). They represent two paradigms of NATS models: learning from scratch and fine-tuning from a pre-trained model. We evaluate these two UOTSumm variants on three public LDS benchmarks: PubMed, arXiv (Cohan et al., 2018), and GovReport (Huang et al., 2021). With a purely data-driven approach to text alignment, UOTSumm obviously outperforms its counterparts that are based on ROUGE. And when combined with UOTSumm, the improved PG-Net and BART also outperform their respective vanilla models by a large margin. On PubMed and arXiv, UOTSumm fine-tuned from BART outperforms some recent strong baseline models that are specifically designed for the LDS task. On GovReport, UOTSumm fine-tuned from BART achieves comparable performance with the state-of-the-art model. Besides, to thoroughly investigate the functions of each component, we introduce and study three ablation models for UOTSumm. At last, we also study some practical cases and conduct a human evaluation to show the advantages of UOTSumm.
The contributions of this paper are as follows: counterparts that are based on ROUGE. Besides, UOTSumm outperforms several recent competitive baseline models that are particularly designed for LDS. • UOTSumm includes a universal training objective for LDS, which is applicable to any existing NATS model. When equipped with UOTSumm, two popular NATS models, that is PG-Net and BART, markedly outperform their vanilla implementations.

Abstractive summarization of long document
Modern NATS models are built based on Seq2Seq architecture (Shi et al., 2021). Seq2Seq architecture first aggregates information from input text sequence with an encoder, and then generate output text sequence with a decoder. Common neural networks that serve as encoder or decoder can be LSTM (Hochreiter and Schmidhuber, 1997) or Transformer (Vaswani et al., 2017). If one source document and its summary are treated as two single sequences, then Seq2Seq architecture can be directly applied to abstractive summarization task. This practice works when the documents to be summarized are not too long, but it is not suitable for long documents with thousands of words. On some standard language modeling benchmarks, it is observed that LSTM language model is capable of using about 200 tokens of context on average (Khandelwal et al., 2018), and the effective context is shorter for vanilla Transformer (Dai et al., 2019). One approach to solving this issue is hierarchical encoding (Nallapati et al. 2016;Cohan et al., 2018). This approach decomposes a long document into chunks, where one chunk can be one sentence or one section. Each chunk is encoded by a lower-level encoder first, and then the chunk sequence is encoded by an upper-level encoder. With a hierarchical attention mechanism, the decoder attends to a chunk first and then attends to a concrete word. As another approach to LDS, extractive-and-abstractive methods explicitly conduct content selection from source text first, and then rewrite a summary based on the selected content (Jing and McKeown, 1999;Gehrmann, Deng, and Rush, 2018;Liu et al., 2018;Chen and Bansal, 2018;Pilault et al., 2020;Zhao, Saleh, and Liu, 2020). Compared to training from scratch (Rush, Chopra, and Weston, 2015;See et al., 2017), pretrain-finetune paradigm (Devlin et al., 2019;Lewis et al., 2020) boosts performance of NATS models (Liu and Lapata, 2019;Raffel et al., 2020;Zhang et al., 2020a), since it utilizes the transferred knowledge from large-scale external corpora. Pretrain-finetune paradigm is usually implemented based on Transformer. Self-attention mechanism is a cornerstone component of Transformer. However, the memory and computational requirements of self-attention grow quadratically with sequence length, which limits its application in LDS task. To tackle the quadratic characteristic, one approach is to modify self-attention mechanism such that the quadratic complexity is reduced (Tay et al., 2022). To this end, sparse attention represents a class of methods, that forces each token to attend to only part of the context. For example, BigBird (Zaheer et al., 2020), Longformer (Beltagy, Peters, and Cohan, 2020), LoBART (Manakul and Gales, 2021), and Poolingformer (Zhang et al., 2021) adopt fixed attention patterns on some local contexts. Reformer (Kitaev, Kaiser, and Levskaya, 2019) and Sinkhorn Attention (Tay et al., 2020) try to learn attention patterns. Different from the above methods concentrating on self-attention mechanism, Hepos (Huang et al., 2021) modifies the encoder-decoder attention with head-wise positional strides to pinpoint salient information from source documents. Recently, Koh et al. (2022) give an empirical survey on datasets, models, and metrics for LDS task.

Text alignment in summarization
The concept of text alignment widely exists in both extractive and abstractive summarization tasks. For example, in supervised extractive text summarization, due to abstractive rephrasing of summary sentences, there is no explicit signal about which sentences should be extracted.
To generate supervision signals, one common approach (Nallapati, Zhai, and Zhou, 2017) is to heuristically label a subset of sentences from source document, which has the maximum ROUGE score with the ground-truth summary. This process finds an alignment between summary and some snippets from document. Besides, to achieve training-stage content selection, text alignment also plays an important role in abstractive methods (Manakul and Gales, 2021) and extractiveand-abstractive methods (Liu et al., 2018;Pilault et al., 2020). To sum up, for most summarization models, ROUGE is a long-standing and common workhorse for training-stage text alignment.

Optimal transport
As the foundation of UOTSumm, related works of optimal transport (OT) (Villani, 2008;Peyré and Cuturi, 2019) are discussed in this section. The theory of OT originates from Monge's problem (Monge, 1781) of moving sand with the least effort. Kantorovich (1942Kantorovich ( , 2006) relaxed Monge's problem to a formulation of moving mass between two probability distributions. OT seeks the most efficient way of transforming one histogram to another when a cost function is given. It provides a tool to compare empirical probability distributions. In particular, UOT (Chizat et al., 2015;Liero, Mielke, and Savaré, 2018) tackles the case when two histograms have different total mass. Recently, OT and UOT have been extensively applied to various machine learning (Frogner et al., 2015;Kolouri et al., 2017) and natural language processing (NLP) (Kusner et al., 2015;Zhang et al., 2017;Clark, Celikyilmaz, and Smith, 2019;Zhao et al., 2019) problems. As the applications in NLP, OT is usually used to compare two sets of embeddings and serves as a distance measure. More specifically, OT distance and its variants can be applied to measure the document distance (Kusner et al., 2015;Yokoi et al., 2020), or to measure the similarity between words across multiple languages (Alvarez-Melis and Jaakkola, 2018;Xu et al., 2021). Moreover, one benefit of applying OT to NLP task is the interpretability, which provides an explicit alignment between tokens. In this line of research, OT is applied to improve text generation (Chen et al., , 2020a, to achieve sparse and explainable text alignment (Swanson, Yu, and Lei, 2020), or to automatically evaluate the machine-generated texts (Clark et al., 2019;Zhao et al., 2019;Zhang et al., 2020b;Chen et al., 2020b).

A theoretical model of text summarization
In this section, we review some concepts from a theoretic text summarization model proposed by Peyrard (2019), which is helpful for understanding UOTSumm. The theoretic model is established based on information theory (Shannon, 1948). Its basic viewpoint is as follow: texts are represented by probability distributions over semantic units (Bao et al., 2011). Take summarization data as an example, characters, words, n-grams, phrases, sentences, and sections in documents or summaries can be treated as semantic units. Based on this viewpoint, some intuitively used concepts in summarization, such as importance, redundancy, relevance, and informativeness are rigorously defined. This abstract model remains in theory. How to utilize it to guide various real-world summarization tasks is an underexplored but meaningful topic.

Sequence-to-sequence learning
We first briefly review the training objective of Seq2Seq (Sutskever et al., 2014) architecture, which is a cornerstone of NATS models. Denote input word sequence as w in = w in 1 , w in 2 , · · · , w in I , and output word sequence as w out = w out 1 , w out 2 , · · · , w out J . The objective of training a Seq2Seq architecture is to maximize the probability of observing w out on condition that w in is observed: where θ denotes the trainable parameters of NATS. The typical auto-regressive training objective decomposes P θ (w out |w in ) into the product of a series of conditional probability, which predicts next word conditioned on the current context. It is equivalent to minimizing negative log likelihood (NLL) loss L θ w in , w out as follows:

Unbalanced optimal transport
In this section, we provide some background knowledge on OT and UOT, which would help understand some technical aspects of our proposed framework. Let ·, · stand for the Frobenius dot-product between two matrices of the same size. Given a cost matrix C ∈ R m×n + and two positive histograms a ∈ R m + and b ∈ R n + , Kantarovich's formulation (Kantorovich, 1942(Kantorovich, , 2006) of OT problem is as follows: where (a, b) = {P ∈ R m×n + |P1 n = a, P T 1 m = b} is the set of all feasible transport plans. A basic requirement of the OT problem in Formula (3) is that two histograms should have the same total mass: However, many practical problems do not satisfy this constraint in nature. To tackle this issue, UOT (Chizat et al., 2015;Liero et al., 2018) relaxes the hard equality constraint P ∈ (a, b) in Formula (3) to allow mass variation: min P∈R m×n + P, C + τ 1 KL(P1 n ||a) + τ 2 KL P T 1 m ||b .
Here, following terminologies in this area, P1 n ∈ R m and P T 1 m ∈ R n are named as marginal vectors. Mass variation refers to the discrepancy between marginal of the transport plan P and the mass of one side. It is measured with Kullback-Leibler (KL) divergence KL( · || · ) defined as (5), τ 1 and τ 2 are hyper-parameters for controlling how much mass variation is penalized as opposed to the transportation cost. When τ 1 → +∞ and τ 2 → +∞, UOT problem in Formula (5) is equivalent to the standard OT problem in Formula (3).

Section to summary sentence alignment
Section structures are widely existing in long documents of various genres, since it is natural to split a long document into subdivisions to relieve the burden of readers. One section usually consists of a series of sentences. It can be a paragraph for fictions or a section for research papers, as long as one section is coherently organized and related to a single topic. Formally, the training set for text summarization is usually organized as a set of document-summary article pairs. For each document-summary pair, the source long document contains m sections: where each section s i contains i sentences {x k } i k=1 ; and the summary contains n sentences: {y j } n j=1 . Text summarization is a lossy procedure, and the information in summary is only part of its source document. Correspondingly, as indicated in Table 1, the average summary length is much shorter than the average document length. The arrow from s i to y j , symbolizes P i,j > 0, that is, there exists some degree of alignment between s i and y j . Some sections may be aligned to more than one summary sentence, and some sections may not be aligned to any summary sentence. One summary sentence must be aligned to at least one section. Now, we introduce the notion of S2SS alignment for the LDS task. We assign one unit score for each summary sentence y j , which represents the total amount of information contained in y j . This setting ensures all the summary sentences {y j } n j=1 are equally treated. We define a score P i,j ∈ [0, 1] to measure the amount of information that summary sentence y j gets from section s i . When P i,j is larger, sentence y j gets more information from section s i . As two extreme cases, P i,j = 0 indicates that y j is irrelevant to s i , and P i,j = 1 indicates that y j is generated exclusively from s i . For each y j , its information must come from the source document, hence we have: ∀j, m i=1 P i,j = 1. In contrast, each section s i may provide information for at most n summary sentences. Besides the above explanation, P i,j can also be understood from the following perspectives: 1. P i,j measures the possibility that y j is one summary sentence of section s i .
2. P i,j measures the degree of S2SS alignment between sentence y j and section s i .
We name the matrix P as S2SS alignment plan between {s i } m i=1 and {y j } n j=1 . We give an illustration of S2SS alignment in Figure 1. In practice, it is very common that one summary sentence is totally based on one particular section. A natural question is: is it suitable to formulate P i,j as a continuous variable in [0, 1], instead of a discrete 0-1 variable? In the next, we discuss two situations that justify the rationality of our formulation. First, one sentence summarizes content from several different sections. In this situation, a discrete 0-1 variable cannot measure the amount of information from different source sections. Second, different sections have overlapping information, and content of one summary sentence is based on the overlapping part. In this situation, the summary sentence should be equally possible to be aligned to any section. In Table 2, we present two cases. Case 1 satisfies the first situation, and Case 2 satisfies the second situation.
It should be pointed out that oracle S2SS alignment plan is existing, which can be decided by human judgment. But usually, no explicit oracle alignment is annotated for real-world documents.

Divide-and-conquer approach to LDS
To summarize a long document, one direct approach is to treat source document and summary as two single sequences and adopt the following objective for training: max θ P θ y 1 y 2 · · · y n | s 1 s 2 · · · s m .
(6) related sentence in this section: . . . In the present work, we aim to study the unique growth of nanofibers of silica and borosilicate glass using femtosecond laser radiation at mhz repetition rate under ambient condition, which is defined by rather a different mechanism. . . . Here, the overline symbol denotes sequential concatenation of texts. However, existing neural architectures are not good at handling very long sequences.
To handle LDS problem, DANCER (Gidiotis and Tsoumakas, 2020) breaks it into several smaller-sized problems. The authors assume that one summary sentence can be aligned to exactly one section. Since oracle alignment is unavailable, they use the ROUGE tool to obtain a surrogate S2SS alignment. Concretely, ROUGE-L precision is computed between each summary sentence and each document sentence, and the summary sentence is aligned to section containing document sentence of the highest precision score. Then, a set of source-target pairs is constructed as {(s i , {y i 1 , y i 2 , · · · })}, (i ∈ I). Here, s i is a section with at least one aligned summary sentence, I denotes the set of indices, and y i 1 , y i 2 , · · · follows order in the original summary: i 1 < i 2 < · · · . Based on this surrogate S2SS alignment plan, DANCER adopts the following objective to train a NATS model: Compared with the objective in Formula (6), the sequence length involved in Formula (7) is much shorter, which is computationally easier. We point out that DANCER's approach to obtain the surrogate S2SS alignment has some room for improvement: 1. Some studies show that ROUGE is a biased approach to text comparison (Kryscinski et al., 2019;Fabbri et al., 2021), since it only relies on superficial and exact token matching. Hence the alignment constructed by DANCER differs from human judgment. NATS models trained on inexactly aligned source-target pairs are also biased. Besides, other approaches to text evaluation also have respective shortcomings (Kryscinski et al., 2019;Fabbri et al., 2021). Therefore, it is interesting to consider learning text alignment directly from data, without relying on any external tool. 2. At training stage, both source sections {s i } m i=1 and summary sentences {y j } n j=1 are available for constructing the surrogate S2SS alignment. This process recognizes sections that are useful for training a NATS model. However, at inference stage, no summary sentence is available to decide which sections should be adopted for generation. For DANCER, a heuristic is adopted to match section headings with a prepared keywords list including "introduction," "methods," "conclusion," etc. a They recognize the matched sections as important ones and adopt them for the generation at inference stage. This heuristic is less rigorous. It is hard to be transferred to the other domain, or long documents without section headings. 3. For DANCER, there is no way to decide the number of generated sentences for each section at inference stage. One common stopping criterion is to set a threshold length for the generation, which neglects the differences among sections.

Joint learning of text alignment and abstractive summarization
In this section, we first briefly describe the architecture of UOTSumm, and then introduce its training objective. UOTSumm is made up of two modules: a Section-to-Summary (Sec2Summ) module with trainable parameters 1 , and an ASSC module with trainable parameters 2 . Sec2Summ module learns to summarize each section from {s i } m i=1 . Any existing NATS model based on encoder-decoder architecture can serve as the Sec2Summ module of UOTSumm. ASSC module adopts the representations of sections {s i } m i=1 , that is, section embeddings from the encoder of Sec2Summ module, as its input. ASSC module is made up of a sequence encoder, for example LSTM, to model the context of document, and predicts the number of aligned summary sentence ϕ 2 i for each section s i . And the vector ϕ 2 is defined as: ϕ 2 = ϕ 2 1 , ϕ 2 2 , · · · , ϕ 2 m T . The architecture of UOTSumm is presented in Figure 2.
We propose the following joint optimization problem w.r.t. P, 1 , and 2 , as the training objective for UOTSumm: In Problem (8), 1 n ∈ R n and 1 m ∈ R m denote all-one vectors, the cost matrix C 1 is defined as a For arXiv and PubMed, a list of section types and corresponding keywords adopted by DANCER is presented Table 7. where L 1 ( * , * ) is the loss function of Sec2Summ module. Usually, NLL loss L θ ( * , * ) defined in Equation (2) is adopted. Roughly speaking, the first term P, C 1 in the objective of Problem (8) conducts abstractive summarization and S2SS alignment jointly, and the second term KL(P1 n ||ϕ 2 ) automatically learns the number of aligned summary sentences for each section. In the next, we explain Problem (8) in more detail.
1. When C 1 and ϕ 2 are fixed as constants, the UOT problem in Formula (8) is a special form of Problem (5), where only the source-document side is relaxed with KL divergence. 2. From the constraint of Problem (8), it can be observed that variable P satisfies exactly the same requirements as the S2SS alignment plan defined in Section 4.1.
3. The term P, C 1 can be written as a summation form: which has the following properties: (a) For any summary sentence y j , the constraint m i=1 P i,j = 1 ensures that we must use one unit amount of aligned sections {s i } in total to minimize the term m i=1 P i,j L 1 (s i , y j ). (b) To minimize m i=1 P i,j L 1 (s i , y j ) for each j, the i-th term P i,j L 1 (s i , y j ) with a smaller loss value of L 1 (s i , y j ) leads to a larger value of P i,j . In contrast, an unaligned pair (s i , y j ), that is, when P i,j = 0, does not serve as training data for Sec2Summ module, (c) Minimizing the term P, C 1 accomplishes two purposes: finding the aligned sections {s i } for each summary sentence y j and using the set of aligned pairs {(s i , y j )} to train the Sec2Summ module. 4. The second term KL(P1 n ||ϕ 2 ) in the training objective is explained as follows: (a) ∀i, n j=1 P i,j is the number of summary sentences that are aligned to the i-th section s i . (b) Minimizing the term KL(P1 n ||ϕ 2 ) helps to train the parameters 2 of ASSC module, so that ϕ 2 i ≥ 0 is a good estimation of the number of aligned summary sentences for section s i . (c) Computing ϕ 2 in forward propagation only requires sections {s i } m i=1 from the source side document. At inference stage, ASSC module can decide the number of generated sentences for each section when ground-truth summary is unavailable. (d) Learning ϕ 2 does not utilize any section heading, such as "introduction," "methods," "conclusion" in scientific papers. Therefore, different from DANCER, UOTSumm can be directly applied to any type of long articles as long as they are organized in sections, even when section headings are unavailable. 5. As one perspective, the joint training objective in Problem (8) can be understood as: the Sec2Summ module with parameters 1 , the ASSC module with parameters 2 , and the S2SS alignment plan P are optimal at the same time. 6. Problem (8) can be understood from the viewpoint of OT (Peyré and Cuturi, 2019) as follows: (a) Using the terminologies discussed in Section 2.4, we stipulate that section is the semantic unit of source documents, and sentence is the semantic unit of summaries. (b) One source document is represented as a probability distribution over sections, where its mass is unknown in advance. Hence, we parameterize its mass as a learnable vector ϕ 2 . One target summary is represented as a probability distribution over sentences, where each summary sentence has the equal amount of mass. (c) UOTSumm tries to move information from a distribution of source sections to a distribution of summary sentences with the least-effort way. (d) The moving cost is measured by loss values of a NATS model. This setting is reasonable, since the loss value is smaller when one sentence is more prone to be the summary of one section. (e) The mass vector ϕ 2 in the source side is learnable. We cannot guarantee that source and target sides always have the same amount of total mass, which is prescribed by the balanced OT problem in Formula (3). In other words, m i=1 ϕ 2 i = n j=1 m i=1 P i,j = n is not always guaranteed. Therefore, we adopt the unbalanced formulation in Problem (8).

Training and inference strategies
We propose to apply an alternating optimization method (Bezdek and Hathaway, 2002) to train the joint objective of UOTSumm in Formula (8). The basic idea is: we fix two variables from {P, 1 , 2 } and optimize the remaining variable and repeat this procedure in a rotating way for each training iteration. The detailed training procedure of UOTSumm is presented in Algorithm 1, and one loop of Algorithm 1 is visualized in Figure 3. Although our methods have two network modules with separate optimizers, their training is alternate and the whole system is in an end-to-end fashion.

Algorithm 1 Training framework of UOTSumm
Require: The whole dataset of paired documents and summaries: . 1: repeat 2: Get one batch of document-summary pairs.
3: For each document, encode sections {s i } m i=1 with the section encoder 4: For each pair of section and summary sentence (s i , y j ), conduct tentative decoding with the summary decoder and get the loss value L 1 (s i , y j ). Compute L 1 (s i , y j ) for all the possible combinations s i , y j 1≤i≤m,1≤j≤n , and construct the cost matrix C 1 in Formula (9).

5: Use the section embeddings from
Step (3) as the input of ASSC module, and compute the number of aligned summary sentences ϕ 2 i for each section s i . 6: Solve UOT Problem (11) with Algorithm 2, and get the solution P * . 7: Assign each summary sentence y j to section s i with the largest alignment score P * ij , and get a S2SS alignment set s˜i, y˜i 1 , y˜i 2 , · · · . If one section is not assigned with any summary sentence, it is excluded from the alignment set.
8: For the S2SS alignment set s˜i, y˜i 1 , y˜i 2 , · · · , concatenate the aligned summary sentences for each section and get the set s˜i, y˜i 1 y˜i 2 · · · . The concatenation operation follows the order in the original summary. 9: Fix the parameters 2 of ASSC module. Conduct decoding with the summary decoder for the set s˜i, y˜i 1 y˜i 2 · · · , compute the average of L 1 s˜i, y˜i 1 y˜i 2 · · · as the training objective, and update the parameters 1 of Sec2Summ module by back propagation.
10: Fix the parameters 1 of Sec2Summ module. Compute the average of KL P * 1 n ||ϕ 2 over the whole batch as the training objective, and update the parameters 2 of ASSC module by back propagation.
11: until The termination criterion of Sec2Summ module is satisfied on the validation set.
Ensure: The UOTSumm model with trained parameters 1 and 2 .
When 1 and 2 are fixed, we need to solve the UOT problem in Formula (8), where C 1 and ϕ 2 are constants in this case. Considering the computational efficiency, we adopt the practice in Cuturi (2013); Frogner et al. (2015) and solve the following entropy-regularized UOT problem: Here, H(P) is an entropy regularization term defined as: H(P) = − i,j P ij (log(P ij ) − 1). We choose the hyper-parameter ε as a small positive value, so that Problem (11) is a good approximation of the original UOT problem in Formula (8). We utilize Sinkhorn algorithm in log domain (Chizat et al., 2018;Schmitzer, 2019) to solve Problem (11), and its details Step (4) of Algorithm 1, we compute L 1 (s i , y j ) for each pair of section and summary sentence (s i , y j ) given the fixed 1 . We name this procedure as tentative decoding, since these loss values are not used for back propagation. For document sections {s i } m i=1 and summary sentences {y j } n j=1 , there are m × n possible combinations of pairs (s i , y j ) in total. Then a natural question is: whether is tentative decoding very slow? Fortunately, we observe that it is not an issue in practice. For UOTSumm implemented with Pytorch (Paszke et al., 2019), in each iteration, we observe that the total time of tentative decoding is similar to the total time of decoding step for back propagation. Since it is not an emphasis of our paper, we only infer the reason for this phenomenon. We infer that when the decoded results are not adopted for back propagation, Pytorch stores fewer intermediate variables and requires less computation.
We formulate the alignment score between summary sentence y j and section s i as a continuous variable P i,j ∈ [0, 1]. The advantages of this formulation have been discussed in Sections 4.1 and 4.3. However, it brings difficulty for Sec2Summ module in the procedure of alternating optimization. Consider the following situation. After Step (6) of Algorithm 1, for a summary sentence y j , more than one alignment scores P * i,j are positive. To update Sec2Summ module, a unique target summary sequence is required for Seq2Seq learning. Then, the difficulty is: for y j with positive scores on multiple sections, which section should it be aligned to? As indicated in Step (7) of Algorithm 1, we make a compromise and adopt an approximate strategy. The experimental results in Section 5.3 show that this approximation is suitable for practical usage, and we leave a more accurate algorithm of optimizing the objective in Problem (8) as the future work. The inference procedure of UOTSumm for summary generation is presented in Algorithm 3. It is made up of two steps: section selection and abstractive summarization of the selected sections. It should be highlighted that computing ϕ 2 only requires sections {s i } m i=1 from the source side document. Hence at inference stage, ASSC module can decide the number of generated sentences for each section when the ground-truth summary is unavailable.

Ablation models
In this section, we introduce three ablation models of UOTSumm. As discussed in Section 4.3, for section and summary pair (s i , y j ), a smaller loss value of L 1 (s i , y j ) leads to a larger alignment score of P i,j . Then, one question is: based on this relationship, can we simplify the alignment procedure in Algorithm 1? To this end, we remove Step (6) and modify the alignment strategy in Step (7) as: assign each summary sentence y j to section s i with the smallest C 1 ij . Besides, we replace the KL divergence term in Step (10) with a squared loss m i=1 (ϕ * i − ϕ 2 i ) 2 , where ϕ * i is the number of aligned summary sentences for section s i . We treat this modified procedure as one ablation model of UOTSumm, and name it as "simple alignment." ASSC module learns to record the number of generated sentences for each section. To investigate the effectiveness of this module, we further simplify simple alignment as the second ablation model. We replace the regression objective of ASSC module with a classification objective. Concretely, we use 1 as the label if ϕ * i > 0, use 0 as the label if ϕ * i = 0, and adopt a binary cross entropy loss for training. Since this ablation model does not learn the number of generated sentences, we set a threshold to restrict the length of generated summary. In practice, the threshold value depends on dataset, we try different values and report the best results in the experiment.
For summarization task that requires to generate multiple sentences, one common problem is: the model may generate repetitive or similar sentences. For DANCER-based methods, this problem may be more serious because summary sentences are independently produced from different sections. To handle this issue, we adopt trigram blocking (Paulus, Xiong, and Socher, 2018;Liu and Lapata, 2019) as a post-processing step in the inference procedure. The details are included in Algorithm 3. To investigate its effect, UOTSumm without trigram blocking is treated as one ablation model.

Datasets and evaluation metrics
We adopt three popular LDS datasets for evaluation: arXiv, PubMed, and GovReport. The statistics of these three datasets are included in Table 1. Their descriptions and pre-processing steps are as follows.
• arXiv and PubMed (Cohan et al., 2018) are two datasets collected from research papers.
One research paper is treated as source, and its abstract is treated as summary. We adopt the same way of splitting as in Cohan et al. (2018). For arXiv, the sizes of training/validation/testing sets are 203, 037/ 6436/ 6440. For PubMed, the sizes of training/validation/testing sets are 119, 924/ 6633/ 6658. The documents in these two datasets are pre-processed as one-level sections by the original authors, and we follow the same way of dividing sections. • GovReport (Huang et al., 2021) contains long reports published by U.S. Government Accountability Office to fulfill requests by congressional members, and Congressional Research Service covering researches on a broad range of national policy issues. We adopt the same way of splitting as in Huang et al. (2021), and the sizes of training/validation/testing sets are 17, 519/ 974/ 973. For GovReport, its documents are organized in a form of multi-level sections. Concretely, each document is made up of several sections, each section contains several subsections and/or several paragraphs, and each subsection contains several paragraphs. b To accommodate UOTSumm, we need to transform documents to a form of one-level sections. Our primary consideration is: the length of each section should be moderate. To this end, we use "paragraphs" as the key c to recursively iterate the multi-level structures, and concatenate all the paragraphs of each "paragraphs" as one section.
As shown in Table 1, the documents of BillSum dataset are relatively longer than those from news datasets. We manually check BillSum, and get the following findings. Although the average section number of documents is 4.4, the sentences are usually concentrated in one or two sections. The documents contain many very short sections with only one useless sentence, whose headings are "short title," "effective date," "funding," etc. Therefore, BillSum is not suitable for b To avoid confusion, the italic "section," "subsection," and "paragraph" refer to the structures defined by the original dataset authors (Huang et al., 2021). c The data of GovReport are stored with a multi-level dictionary data structure, and "paragraphs" is one dictionary key.
DANCER-based summarization models, since the motivation of these models is to reduce input length by splitting long document into several moderate-length sections. We do not consider BillSum as an evaluation dataset. Currently, ROUGE is the default and most popular metric for evaluating summarization models. However, as discussed in Fabbri et al. (2021), ROUGE has some shortcomings, and some other evaluation metrics make up for these disadvantages. To make the comparison more complete and convincing, we adopt three automatic evaluation metrics in this paper: ROUGE, BERTScore, and MoverScore. They are described as follows.
• ROUGE (recall-oriented understudy for gisting evaluation) (Lin, 2004) measures the number of overlapping textual units, that is n-grams and word sequences, between a generated summary and its ground-truth reference. F-measures of ROUGE-1, ROUGE-2, and ROUGE-L are reported. • BERTScore (Zhang et al., 2020b) computes token-level similarity scores by aligning generated summary and ground-truth reference. Instead of exact matches, it computes token similarity using contextualized token embeddings from BERT (Devlin et al., 2019). F1-measure of BERTScore is reported. • MoverScore (Zhao et al., 2019) utilizes the Word Mover's Distance (Kusner et al., 2015) to compare a generated summary and its ground-truth reference. It operates over n-gram embeddings pooled from BERT representations.
SummEval d (Fabbri et al., 2021) is a unified and easy-to-use toolkit, which contains common evaluation metrics for text summarization. We utilize SummEval with its default settings to compute the above three metrics.

Implementations
UOTSumm includes a general-purpose training objective for LDS task. Its implementation is made up of two modules: a Sec2Summ module and an ASSC module. The Sec2Summ module can be any existing NATS model. NATS models can be grouped into two categories: learning-fromscratch (Rush et al., 2015;See et al., 2017), and adopting the pretrain-finetune paradigm (Liu and Lapata, 2019;Lewis et al., 2020;Zhang et al., 2020a). To demonstrate the universality and effectiveness of UOTSumm, we choose one typical NATS model from each category and adapt it to UOTSumm. Their details are described as follows.
• PG-Net (See et al., 2017) is a representative NATS model of learning-from-scratch. We follow most settings of vanilla PG-Net. e The vocabulary size is set to 50, 000. For arXiv and PubMed, we adopt the vocabulary provided by Cohan et al. (2018). For GovReport, since Huang et al. (2021) did not provide a vocabulary, we consider the most frequent 50, 000 words in the training set as the vocabulary. • BART (Lewis et al., 2020) is a representative of pretrain-finetune paradigm. We use the publicly released BART model fine-tuned on CNN/DM f (Hermann et al., 2015) to initialize model parameters. We implement based on the AllenNLP g wrapper of BART. h We follow most settings of its vanilla implementation i except the learning rate, which is tuned from: {1e −5 , 1.5e −5 , 3e −5 , 5e −5 }. For ASSC module, Adam (Kingma and Ba, 2014) optimizer is adopted for training with a learning rate of 1e −5 . All the experiments are conducted on one NVIDIA TITAN RTX GPU with 24 GB memory, or one NVIDIA RTX A6000 GPU with 48 GB memory, depending on dataset and model sizes. At the testing stage, we adopt a beam size of 4 for all the variants of UOTSumm. We only implement ablation models for BART-based UOTSumm, which adopt the same experimental settings as the full implementation.

Baselines and results
In this section, we compare UOTSumm with some competitive NATS baselines. We implement two variants for UOTSumm. Generally speaking, pretrain-finetune-based NATS models are more powerful than learning-from-scratch, since the former benefit from transferred knowledge of external corpus. To ensure fairness of comparison, baselines are accordingly classified into two groups. To compare with PG-Net-based UOTSumm, we adopt learning-from-scratch NATS models as follows.
• Seq2Seq (Chopra, Auli, and Rush, 2016;Nallapati et al., 2016), a Seq2Seq NATS model equipped with attention. • PG-Net (See et al., 2017), a NATS model featured with the copying (Gu et al., 2016) and the coverage (Tu et al., 2016) mechanisms. • Discourse-Aware (Cohan et al., 2018), a NATS model equipped with a hierarchical encoder to capture the discourse structure of the document and a discourse-aware decoder. • Ext + TLM (Pilault et al., 2020), an extractive-and-abstractive summarization model based on Transformer. Its extractive stage relies on ROUGE to produce the ground-truth extraction targets. • reinforce-selected sentence rewriting (RSSR) (Chen and Bansal, 2018), an extractive-andabstractive NATS model. j RSSR is made up of an extractor which extracts sentences, and an abstractor which rewrites the extracted sentences as a summary. The extractor and the abstractor are bridged together with policy-based reinforcement learning. • DANCER + PG-Net (Gidiotis and Tsoumakas, 2020), the DANCER framework combined with PG-Net. The authors did not provide code for this version of DANCER.
To compare with BART based UOTSumm, baselines are chosen from the following pretrainfinetune based NATS models.
• PEGASUS , a self-supervised pre-training objective specifically designed for text summarization. Some important sentences are masked and generated as one output sequence conditioned on the remaining sentences. Pre-trained PEGASUS is often adopted by the other NATS models for fine-tuning. • BigBird + PEGASUS (Zaheer et al., 2020), fine-tuning PEGASUS for BigBird. BigBird combines sliding window, global, and random token attentions in its encoder. • DANCER + PEGASUS (Gidiotis and Tsoumakas, 2020), fine-tuning PEGASUS for DANCER. • BART (Lewis et al., 2020), a denoising auto-encoder for pre-training Seq2Seq models. • MCS + BART (Manakul and Gales, 2021), a multitask content selection model with sentence-level extractive labeling. Its training-stage content selection relies on ROUGE.
• DYLE + RoBERTa + BART (Mao et al. 2022), a dynamic latent extraction approach for abstractive LDS. DYLE is made up of an extractor which is initialized with RoBERTa , and a generator which is initialized with BART. • LED + BART (Beltagy et al., 2020), fine-tuning BART for a Longformer variant.
Longformer's attention mechanism combines a local windowed attention with a task motivated global attention. • Stride Patterns (Child et al. 2019), a sparse factorization of the self-attention matrix which reduces the quadratic computational complexity. • LSH (Kitaev et al., 2019), which replaces dot-product attention with the locality-sensitive hashing to reduce complexity. • Sinkhorn Attention (Tay et al., 2020), which segments a sequence into blocks and adopt a learnable Sinkhorn sorting network to reduce complexity. • Hepos (Huang et al., 2021), an efficient encoder-decoder attention mechanism with headwise positional strides to pinpoint salient information from source document.
Some of the above baseline models do not have reported results on arXiv, PubMed, or GovReport. Besides, for all the existing results on these three datasets, only ROUGE scores are reported. Based on these two facts, we reproduce the following baseline models: RSSR, PEGASUS, BigBird, PEGASUS-based DANCER, BART, and LED. In this way, the results of BERTScore and MoverScore for these models can be obtained, and baselines for three datasets are more consistent. All the pre-trained Transformer models are downloaded from Huggingface models. k We only reproduce DANCER on arXiv and PubMed, l because DANCER requires some human-designed rules for selecting sections, which are unavailable for GovReport. For PEGASUS, BigBird, and DANCER, our reproduced ROUGE scores have some minute differences with the scores reported in original papers. We follow the practice of Zaheer et al. (2020) and report all the versions of ROUGE scores. The results of UOTSumm and the baseline models on arXiv, PubMed, and GovReport are reported in Tables 3-5, respectively. For UOTSumm, we report results of four variants: the full implementation, and three ablation models introduced in Section 4.5. Baseline models specially designed for LDS task are marked with the symbol ♣. The symbol ‡ denotes that the results are produced by us, while results without ‡ are taken from the original papers.
We can draw the following conclusions from the results.

On arXiv and PubMed, PG-Net-based UOTSumm outperforms PG-Net by a large margin.
On all the three datasets, UOTSumm finetuned from BART outperforms BART by a large margin. The performance gain partly comes from the DANCER approach, which is also adopted by DANCER. 2. On arXiv and PubMed, PG-Net-based UOTSumm outperforms PG-Net based DANCER, and finetuned UOTSumm outperforms finetuned DANCER, in terms of all the evaluation metrics. UOTSumm directly learns S2SS alignment from data, while DANCER achieves S2SS alignment via ROUGE. The improvements demonstrate that our purely data-driven approach captures better text alignment than ROUGE. 3. The listed pretrain-finetune-based baselines are all recent competitive NATS models. We compare BART-based UOTSumm with them and analyze the results as follows.
k https://huggingface.co/models. l We use the repository provided by original paper authors: https://github.com/AlexGidiotis/DANCER-summ. In this page, the authors mentioned that this implementation is different from that used in their paper. Hence, the reproduced scores are slightly different from scores reported in Gidiotis and Tsoumakas (2020).    , Tables 4 and 5, we use R-1, R-2, R-L, BERT-S, and Mover-S as the abbreviations for ROUGE-1, ROUGE-2, ROUGE-L, BERTScore, and MoverScore, respectively. The boldface type indicates that the model achieves the best performance in terms of the corresponding evaluation metric.
To facilitate a fair comparison, the group of pretrain-finetune-based models is separately presented in the bottom part of each table.
(a) On arXiv and PubMed, UOTSumm finetuned from BART outperforms all these baselines in terms of all the evaluation metrics. To investigate the statistical significance of the comparison, we further conduct the following experiment. We use the stratified random sampling (Noreen, 1989) to sample document-summary pairs from the testing set of arXiv and PubMed. We create the subgroups (a.k.a. strata) based on the sentence number of ground-truth summary. Concretely, three subgroups are specified: a subgroup with short summaries, a subgroup with medium-length summaries, and a subgroup with long summaries. We use proportionate sampling to get 1000 document-summary pairs from each subgroup, and get 3000 document-summary pairs in total from each testing dataset. We calculate statistical significance level based on the bootstrap test. On both arXiv and PubMed, BART-based UOTSumm is statistically significantly better than any pretrain-finetune-based baseline model for all the evaluation metrics, with the statistical significance level p < 0.05.     The boldface type indicates that the model achieves the best performance in terms of the corresponding evaluation metric.
(b) On GovReport, UOTSumm finetuned from BART is comparable with the baseline model DYLE and outperforms all the other baseline models. It should be highlighted that DYLE is the state-of-the-art model on GovReport. It inherits knowledge from the powerful pre-trained model RoBERTa besides BART, while our method only utilizes knowledge from BART. Hence the comparison between UOTSumm and DYLE is not fair, and it favors DYLE. As shown in Table 1, the average number of sections and summary sentences is large for GovReport. Hence, the performance of UOTSumm demonstrates that it is a suitable choice in this setting. 4. Consider the ablation model simple alignment, its results are close to the full implementation of UOTSumm. Roughly speaking, simple alignment is a good simplification of UOTSumm. However, some details need to be investigated. Besides the way of aligning summary sentences to sections, another difference between simple alignment and UOTSumm is: the labels for training ASSC modules are different. Simple alignment uses ϕ * of integer values as labels, while UOTSumm uses P * 1 n which allows continuous values. The latter one is more close to the real world. Because it accommodates two situations discussed The boldface type indicates that the model achieves the best performance in terms of the corresponding evaluation metric.
in Section 4.1: one sentence summarizes content from several different sections, and one summary sentence is based on the overlapping information of several different sections. We manually checked some documents from three benchmark datasets. We found that the first situation happens infrequently, while the second situation is rather common. In research papers, it is often the case that the content of one summary sentence appears in the "introduction" section, the "conclusion" section, and some other sections. Case 2 in Table 2 is one typical example. In this case, the summary sentence contributes close scores (e.g., 0.33, 0.33, and 0.34) to scalar components corresponding to Sections 1-3 in P * 1 n , In contrast, the summary sentence contributes one-hot scores (e.g., 0.0, 0.0, and 1.0) to scalar components corresponding to Sections 1-3 in ϕ * . Then, the ASSC module of simple alignment is trained with inaccurate labels. Because any of three sections can serve as the source of the summary sentence, but only one section is chosen. We conjecture this is the reason why the performance of UOTSumm is slightly better than simple alignment. Besides, the formulation of UOTSumm is explainable from the viewpoint of OT, which is rather graceful. To sum up, the full implementation of UOTSumm is more advantageous than simple alignment. 5. Consider the second ablation model. When ASSC module is removed, the performance of simple alignment degrades obviously, especially on GovReport. This observation demonstrates the importance of ASSC module and can be easily explained. Without ASSC module, the same number of tokens are generated for different sections at inference stage. In practice, the lengths of summaries for different sections cannot be always the same. Another obvious advantage of ASSC module is: the number of generated sentences are automatically learned from data, which avoids human's effort to choose a threshold. 6. In most cases, trigram blocking improves scores of evaluation metrics for UOTSumm on three datasets, but the improvements are minute. This observation suggests that sentence repetition is not a severe problem for UOTSumm. One very interesting phenomenon is: on PubMed and GovReport, trigram blocking improves scores of ROUGE and MoverScore, while slightly degrades scores of BERTScore. Currently, only ROUGE scores are reported in most summarization papers. This phenomenon suggests that we should keep alert to the performance gain brought by trigram blocking: does it really reduce the semantic repetition, or just improve ROUGE scores? 7. The results of BigBird on GovReport are strange, which are explained as follows. The model size of BigBird is too large, it cannot be normally finetuned on one NVIDIA RTX A6000 GPU even when batch size is set to 1. Hence we freeze the encoder, and only tune the decoder. Besides, Huggingface website does not provide a pre-trained model from general domains. It only provide models that are finetuned on three datasets: arXiv, PubMed, and BigPatent (Sharma, Li, and Wang, 2019). We finetuned these three versions of BigBird on GovReport, and report their results. The results show that the domain of GovReport is closer to BigPatent, than arXiv or PubMed. Besides, if we can get a GPU of larger memory size, the BigBird baseline is expected to get better results on GovReport. 8. It should be highlighted that UOTSumm is a universal framework for LDS task, which can be applicable to any existing NATS model. It consistently improves performance when combined with PG-Net or BART. When combined with a more powerful NATS model, UOTSumm is expected to bring some further performance gain. We leave this topic as the future work.

Case studies and human evaluation
In this part, we investigate some practical cases to show the advantages of UOTSumm at training stage. First, we compare the ability of S2SS alignment between UOTSumm and ROUGE at training stage. For ROUGE-based S2SS alignment, we follow the practice of DANCER (Gidiotis and Tsoumakas, 2020). Specifically, for each sentence y from the summary {y j } n j=1 , we compute the ROUGE-L precision between y and each document sentence where LCS(x, y) is the length of the longest common sub-sequence (LCS) between x and y. Then, summary sentence y is aligned to section s i = {x k } i k=1 , which contains the highest scored sentence x. For our method, we utilize the well-trained UOTSumm model. We freeze its model parameters and execute several steps, i.e., from Step (3) to Step (7), of Algorithm 1. Then, each summary sentence y is aligned to one section s i by UOTSumm. We choose three document-summary pairs from the training sets of arXiv, PubMed, and GovReport, and execute the above two S2SS alignment procedures. Limited by space, we select one representative summary sentence for each case. ROUGE-based S2SS alignment method explicitly conducts the sentence-to-sentence alignment, thus we present the aligned document sentences and corresponding scores of ROUGE-L precision. Since UOTSumm directly conducts section-to-sentence alignment, we manually judge the Table 6. Cases to demonstrate the differences in text alignment between UOTSumm and ROUGE-L precision We use the yellow background to highlight the longest common sub-sequence of the summary sentence and the sentence aligned by ROUGE-L precision. semantically related sentences from the aligned section and present them. For both methods, we also present the headings of the aligned sections. The results are presented in Table 6, from which we can draw the following conclusions.
For all the cases, UOTSumm and ROUGE-based method align the summary sentence to the different sections. UOTSumm correctly conducts S2SS alignment, while the ROUGE-based method aligns the summary sentence to a wrong source sentence. This observation suggests that ROUGE-based text alignment does not correlate with human judgment in some situations. For ROUGE-L precision in Formula (12), the denominator is the length of sentence x. Hence, the ROUGE-based method is prone to select a shorter sentence as long as it has a common sub-sequence with the summary sentence. It cannot detect two pieces of texts adopting different words while preserving the same meanings. In contrast, as shown in Formula (9), UOTSumm relies on the neural architectures for text alignment, which is good at understanding literally different paraphrases and disturbed word order. In the next, we discuss Case 3 in more detail. We use x ROUGE and x UOTSumm to denote the sentences aligned by ROUGE-based method and by UOTSumm, respectively. The correctly aligned sentence x UOTSumm gets a lower score: ROUGE-L precision (x UOTSumm , y) = 14.63. In contrast, the wrongly aligned sentence x ROUGE gets a higher score: ROUGE-L precision (x ROUGE , y) = 42.11. This is because x ROUGE and y share a long common sub-sequence, although the rest parts of two sentences are totally irrelevant. For x UOTSumm and y, we use blue and purple fonts to highlight the clauses that are semantically equivalent. Apparently, x UOTSumm and y swap two primary clauses. Swapping the order of two clauses while preserving the sentence meaning is a very common linguistic phenomenon. However, it greatly degrades the ROUGE-L precision which strictly relies on the word order. To sum up, these cases show that correctly conducting S2SS alignment is one reason that UOTSumm outperforms DANCER, since UOTSumm is supervised by a less-biased target. In this part, we investigate one case to show the advantages of UOTSumm at inference stage. We choose one document from the test set of PubMed, and utilize the trained UOTSumm model to generate its summary. In Table 8, we present the section headings of this document, one summary sentence generated by UOTSumm, and the related ground-truth sentence. In brackets after the heading names, we list the numbers of generated sentences for the corresponding sections, which is computed by Step (5) of Algorithm 3. The sentence generated by UOTSumm well captures the meaning of one ground-truth sentence, while it is generated from the section with heading "web page." As mentioned in Section 4.2, at inference stage, DANCER relies on a heuristic of heading matching to decide which section should be adopted for summary generation. As a reference, we replicate their heuristic matching method in Table 7. Except the heading "introduction," the other headings of this document cannot be matched with any section type in Table 7. Hence, the section with heading "web page" will never be adopted by DANCER for summary generation. This case demonstrates that since UOTSumm does not rely on the heading but directly learns the number of generated sentences for each section from the content, it can utilize any document section for summary generation at inference stage.
In this part, we analyze Case 1 in Table 2, in which UOTSumm does not work very well. We use UOTSumm-based method to conduct S2SS alignment for this case. Sections 1-3 get alignment scores of 0.04, 0.92, and 0.03, respectively. All the other sections get an alignment score of 0.01 in total. UOTSumm-based method successfully aligns the summary sentence to Sections 1-3, which are indeed the source of this summary sentence. However, the distribution of alignment one summary sentence generated by UOTSumm from the source section with the heading "web page": The prosite website was redesigned and new predictive tools were implemented to assign more detailed functional information to the scanned proteins.

the ground-truth sentence:
During the last 2 years, the documentation and the scan prosite web pages were redesigned to add more functionalities.
scores obviously violates human judgment. We infer the reason as follows. For UOTSumm-based method, S2SS alignment is mainly decided by loss values of a NATS model. Since the word sequences "theoretical status" and "phenomenological applications" are both very short when compared with length of the summary sentence, current NATS models are prone to predict the summary sentence with some large loss values. demonstrates its effectiveness in learning text alignment from data directly, as the future work, we will explore UOT formulations in the other NLP settings that involve text alignment.