Hostname: page-component-89b8bd64d-b5k59 Total loading time: 0 Render date: 2026-05-08T18:24:42.156Z Has data issue: false hasContentIssue false

Query-based summarization of discussion threads

Published online by Cambridge University Press:  16 April 2019

Suzan Verberne*
Affiliation:
Leiden Institute for Advanced Computer Science, Leiden University, Leiden, The Netherlands
Emiel Krahmer
Affiliation:
Tilburg School of Humanities, Tilburg University, Tilburg, The Netherlands
Sander Wubben
Affiliation:
Tilburg School of Humanities, Tilburg University, Tilburg, The Netherlands
Antal van den Bosch
Affiliation:
Centre for Language Studies, Radboud University, Nijmegen, The Netherlands Meertens Institute, Amsterdam, The Netherlands
*
*Corresponding author. Email: s.verberne@liacs.leidenuniv.nl
Rights & Permissions [Opens in a new window]

Abstract

In this paper, we address query-based summarization of discussion threads. New users can profit from the information shared in the forum, Please check if the inserted city and country names in the affiliations are correct. if they can find back the previously posted information. However, discussion threads on a single topic can easily comprise dozens or hundreds of individual posts. Our aim is to summarize forum threads given real web search queries. We created a data set with search queries from a discussion forum’s search engine log and the discussion threads that were clicked by the user who entered the query. For 120 thread–query combinations, a reference summary was made by five different human raters. We compared two methods for automatic summarization of the threads: a query-independent method based on post features, and Maximum Marginal Relevance (MMR), a method that takes the query into account. We also compared four different word embeddings representations as alternative for standard word vectors in extractive summarization. We find (1) that the agreement between human summarizers does not improve when a query is provided that: (2) the query-independent post features as well as a centroid-based baseline outperform MMR by a large margin; (3) combining the post features with query similarity gives a small improvement over the use of post features alone; and (4) for the word embeddings, a match in domain appears to be more important than corpus size and dimensionality. However, the differences between the models were not reflected by differences in quality of the summaries created with help of these models. We conclude that query-based summarization with web queries is challenging because the queries are short, and a click on a result is not a direct indicator for the relevance of the result.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
© The Author(s) 2019
Figure 0

Figure 1. A screenshot of three messages in a thread on the Viva forum (left) with the translation to English (right). The bottom message contains a quote of an earlier post.

Figure 1

Table 1. Example queries with titles of clicked threads in the Viva query log. The queries and titles have been translated to English for the reader’s convenience. “peach1990” is a forum user name

Figure 2

Figure 2. A screenshot of the post-selection interface. The query (“Zoekvraag”) is on top of the screen and stays visible when scrolling. The left column (“Volledige topic”) shows the full thread while the right column (“Jouw selectie”) shows the rater”s selected threads (with the first post always selected). In the blue header, the category and title of the thread are given. Each cell is one post, starting with the author name and the timestamp.

Figure 3

Table 2. Word2vec models that we compare for the representativeness features. The Wikipedia model was pre-trained the Viva models were trainedbyus. All models were trained with a window size of 11 and a minimum count of 5 for words to be included in the model

Figure 4

Table 3. Post features

Figure 5

Table 4. Statistics of the reference set

Figure 6

Figure 3. Inter-rater agreement in terms of Fleiss κ, Cohen’s κ, Jaccard similarity coefficient, and ROUGE-2, for four thread sets: the data published in Verberne et al. (2017) (post selection without query), a subset from that data that contains only the threads that are also in the query sample; the current data with queries; and the current data for the threads with a mean relevance score of at least 3 for the given query according to the human summarizers.

Figure 7

Figure 4. Dispersion of the Jaccard similarity index between individual summaries for the same thread, either both without query, both with query, or one with and one without query.

Figure 8

Table 5. Evaluation of models for representativeness: 5 different representations for posts and threads (words or word embeddings trained on two different corpora) and 2 different similarity metrics (cosine similarity or 1 minus the word mover’s distance).

Figure 9

Table 6. Post features that are significant predictors for the number of selected votes, sorted by the absolute value of the regression coefficient β; the independent variable with the largest effect (either positive or negative) is on top of the list.

Figure 10

Figure 5. Precision-Recall curve for all methods on query–based post selection. For the MMR methods, λ = 1.0. The word2vec model is Word2vec_Viva320. “Oracle ranking” is the ranking of the posts based on the number of votes by the human summarizers. Iso-F1 = 0.5 denotes the curve for all combinations of Precision and Recall for which F1 = 0.5.

Figure 11

Figure 6. The effect of the parameter λ in MMR on the quality of the query-based summaries, in terms of F1 and ROUGE-2, at an cutoff of 9 posts. λ = 0.0 means only the diversity component is used; λ = 1.0 means that only the query similarity component is used.

Figure 12

Table 7. Precision, Recall, F1, ROUGE-2 Recall and ROUGE-2 Precision scores for all methods, with a summary length of 9 posts. All scores are means over the 5 reference summaries and the thread–query combinations.

Figure 13

Figure 7. Variables are significant predictors (P < 0.05) for the post score (number of votes) in the second-level LRM for the query-dependent data, with their beta coefficients.

Figure 14

Figure 8. The mean F1 scores for summaries consisting of nine posts per thread–query combination, comparing PF (query-independent postfeatures) to Combined 2 (post features with query similarity).