Skip to main content
×
×
Home

MUSED: A multimedia multi-document dataset for topic segmentation

  • PEDRO MOTA (a1), MAXINE ESKENAZI (a2) and LUÍSA COHEUR (a3)
Abstract

Research on topic segmentation has recently focused on segmenting documents by taking advantage of documents covering the same topics. In order to properly evaluate such approaches, a dataset of related documents is needed. However, existing datasets are limited in the number of related documents per domain. In addition, most of the available datasets do not consider documents from different media sources (PowerPoints, videos, etc.), which pose specific challenges to segmentation. We fill this gap with the MUltimedia SEgmentation Dataset (MUSED), a collection of documents manually segmented, from different media sources, in seven different domains, with an average of twenty related documents per domain. In this paper, we describe the process of building MUSED. A multi-annotator study is carried out to determine if it is possible to observe agreement among human judges and characterize their disagreement patterns. In addition, we use MUSED to compare the state-of-the-art topic segmentation techniques, including the ones that take advantage of related documents. Moreover, we study the impact of having documents from different media sources in the dataset. To the best of our knowledge, MUSED is the first dataset that allows a straightforward evaluation of both single- and multiple-documents topic segmentation techniques, as well as to study how these behave in the presence of documents from different media sources. Results show that some techniques are, indeed, sensitive to different media sources, and also that current multi-document segmentation models do not outperform previous models, pointing to a research line that needs to be boosted.

Copyright
Footnotes
Hide All

*This work was supported by national funds through Fundação para a Ciência e a Tecnologia (FCT) with reference UID/CEC/50021/2013, also under projects LAW-TRAIN (H2020-EU.3.7, contract 653587), and through the Carnegie Mellon Portugal Program under Grant SFRH/BD/51917/2012.

Footnotes
References
Hide All
Alemi, A., and Ginsparg, P. 2015. Text segmentation based on semantic word embeddings. ArXiv e-prints, 1503.05543.
Balagopalan, A., and Damodar, A., 2012. Automatic keyphrase extraction and segmentation of video lectures. In Proceedings of the International Conference on Technology Enhanced Education, Amritapuri, India: ICTEE 2012, pp. 110.
Bougouin, A., Boudin, F., and Daille, B., 2013. Topicrank: graph-based topic ranking for keyphrase extraction. In Proceedings of the International Joint Conference on Natural Language Processing, Nagoya, Japan: Asian Federation of Natural Language Processing, pp. 543551.
Choi, F. Y., 2000. Advances in domain independent linear text segmentation. In Proceedings of the North American Chapter of the Association for Computational Lingustics, Seattle, Washington, USA: Association for Computational Linguistics, pp. 2633.
Cohen, J., 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20 (1): 37.
Du, L., Buntine, W. L., and Johnson, M., 2013. Topic segmentation with a structured topic model. In Proceedings of the Human Language Technologies North American Chapter of the Association for Computational Lingustics, Atlanta, Georgia, USA: Association for Computational Lingustics, pp. 190200.
Du, L., Pate, J., and Johnson, M., 2015. Topic segmentation with an ordering-based topic model. In Proceedings of the Association for the Advancement of Artificial Intelligence Conference, Austin, Texas, USA: AAAI Press, pp. 22322238.
Eisenstein, J., and Barzilay, R., 2008. Bayesian unsupervised topic segmentation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Honolulu, Hawaii: Association for Computational Linguistics, pp. 334343.
Eisenstein, J., 2009. Hierarchical text segmentation from multi-scale lexical cohesion. In Proceedings of the Human Language Technologies North American Chapter of the Association for Computational Lingustics, Boulder, Colorado, USA: Association for Computational Lingustics, pp. 353361.
Fournier, C., 2013. Evaluating text segmentation using boundary edit distance. In Proceedings of the Annual Meeting of the Association for Computational Lingustics, Sofia, Bulgaria: Association for Computational Lingustics, pp. 17021712.
Francis, W. N., and Kucera, H., 1979. The Brown Corpus: A Standard Corpus of Present-Day Edited American English. Brown University: Lingustics Department.
Frey, B. J., and Dueck, D., 2007. Clustering by passing messages between data points. Science 315 (5814): 972977.
Galley, M., McKeown, K., Fosler, E., and Jing, H., 2003. Discourse segmentation of multi-party conversation. In Proceedings of the Annual Meeting on Association for Computational Lingustics, Sapporo, Japan: Association for Computational Lingustics, pp. 562569.
Haghighi, A., and Vanderwende, L., 2009. Annotating semantic relations combining facts and opinions. In Proceedings of the 3rd Linguistic Annotation Workshop, Suntec, Singapore: Association for Computational Lingustics, pp. 362370.
Halliday, M. A., and Hasan, R., 1976. Cohesion in English. London: Longman.
Hearst, M. A., 1997. Texttiling: segmenting text into multi-paragraph subtopic passages. Computational Lingustics 23 (1): 3364.
Hsueh, P., Moore, J., and Renals, S., 2006. Automatic segmentation of multiparty dialogue. In Proceedings of the European Chapter of the Association for Computational Linguistics, Trento, Italy: Association for Computational Lingustics, pp. 273280.
Jain, S., and Neal, R., 2004. A split-merge Markov chain monte carlo procedure for the dirichlet process mixture model. Journal of Computational and Graphical Statistics 13 (1): 158182.
Jameel, S., and Lam, W., 2013. An unsupervised topic segmentation model incorporating word order. In Proceedings of the International Conference on Research and Development in Information Retrieval, Dublin, Ireland: ACM, pp. 203212.
Janin, A., Ang, J., Bhagat, S., and Wrede, B., 2004. The ICSI meeting project: resources and research. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing Workshop, Montreal, Canada: Prentice Hall, pp. 364367.
Johnson, N., Kotz, S., and Balakrishnan, N., 1997. Discrete Multivariate Distributions. New York: Wiley-Interscience.
Joty, S., Carenini, G., and Ng, R., 2013. Topic segmentation and labeling in asynchronous conversations. Journal of Artificial Intelligence Research 47 (1): 521573.
Kazantseva, A., and Szpakowicz, S., 2011. Linear text segmentation using affinity propagation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Edinburgh, U.K.: Association for Computational Lingustics, pp. 284293.
Kazantseva, A., and Szpakowicz, S., 2012. Topical segmentation: a study of human performance and a new measure of quality. In Proceedings of the Human Language Technologies North American Chapter of the Association for Computational Lingustics, Montreal, Canada: Association for Computational Lingustics, pp. 211220.
Krippendorff, K., 2004. Content Analysis: An Introduction to its Methodology. London: SAGE Publications.
Malioutov, I., and Barzilay, R., 2006. Minimum cut model for spoken lecture segmentation. In Proceedings of the International Conference on Computational Lingustics, Sydney, Australia: Association for Computational Lingustics, pp. 2532.
Minwoo, J., and Ivan, T., 2010. Multi-document topic segmentation. In Proceedings of the Association for Computational Lingustics International Conference on Information and Knowledge Management, Toronto, Canada: ACM, pp. 11191128.
Mota, P., Eskenazi, M., and Coheur, L., 2016. Multi-document topic segmentation using Bayesian estimation. In Proceedings of the International Workshop on Semantic Multimedia, Laguna Hills, CA, USA: IEEE, pp. 443447.
Nguyen, V. A., Boyd-Graber, J., Resnik, P., Cai, D. A., Midberry, J. E., and Wang, Y., 2014. Modeling topic control to detect influence in conversations using nonparametric topic models. Machine Learning 95 (3): 381421.
Noh, H., Jeong, M., Lee, S., Lee, J., and Lee, G., 2010. Script-description pair extraction from text documents of english as second language podcast. In Proceedings of the International Conference on Computer Supported Education, Valencia, Spain: SciTePress, pp. 510.
Passonneau, R. J., and Litman, D. J., 1997. Discourse segmentation by human and automated means. Computational Lingustics 23 (1): 103139.
Pennington, J., Socher, R., and Manning, C. D., 2014. Glove: global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Doha, Qatar: Association for Computational Lingustics, pp. 15321543.
Pevzner, L., and Hearst, M. A., 2002. A critique and improvement of an evaluation metric for text segmentation. Computational Lingustics 28 (1): 1936.
Prince, V., and Labadie, A., 2007. Segmentation based on document understanding for information retrieval. In Proceedings of the International Conference on Application of Natural Language to Information Systems, Paris, France: Berlin: Springer, pp. 295304.
Purver, M., Griffiths, T. L., Körding, K. P., and Tenenbaum, J. B., 2006. Unsupervised topic modelling for multi-party spoken discourse. In Proceedings of the International Conference on Computational Lingustics, Sydney, Australia: Association for Computational Linguistics, pp. 1724.
Riedl, M., and Biemann, C., 2012. Topictiling: a text segmentation algorithm based on LDA. In Proceedings of the Association for Computational Lingustics Student Research Workshop, Jeju Island, Korea: Association for Computational Linguistics, pp. 3742.
Scott, W. A., 1955. Reliability of content analysis: the case of nominal scale coding. Public Opinion Quarterly 19 (3): 321325.
Shah, R., Yu, Y., Shaikh, A., Tang, S., and Zimmermann, R., 2014. ATLAS: automatic temporal segmentation and annotation of lecture videos based on modelling transition time. In Proceedings of the Association for Computational Lingustics International Conference on Multimedia, Orlando, Florida, USA: ACM, pp. 209212.
Shah, R., Yu, Y., Shaikh, A., Tang, S., and Zimmermann, R., 2015. TRACE: linguistic-based approach for automatic lecture video segmentation leveraging Wikipedia texts. In Proceedings of the International Symposium on Multimedia, Miami, Florida, USA: IEEE, pp. 217220.
Shah, R., and Zimmermann, R., 2017. Multimodal Analysis of User-Generated Multimedia Content. Cham, Switzerland: Springer International Publishing.
Shahaf, D., Guestrin, C., and Horvitz, E., 2012. Trains of thought: generating information maps. In Proceedings of the International Conference on World Wide Web, Lyon, France: ACM, pp. 899908.
Shrout, P. E., and Fleiss, J. L., 1979. Intraclass correlations: uses in assessing rater reliability. Psychological Bulletin 86 (2): 420428.
Sun, B., Mitra, P., Giles, L., Yen, J., and Zha, H., 2007. Topic segmentation with shared topic detection and alignment of multiple documents. In Proceedings of Association for Computational Lingustics Special Interest Group on Information Retrieval, Amsterdam, The Netherlands: ACM, pp. 199206.
Utiyama, M., and Isahara, H., 2001. A statistical model for domain-independent text segmentation. In Proceedings of the Annual Meeting on Association for Computational Lingustics, Toulouse, France: Association for Computational Linguistics, pp. 499506.
Walker, H., Dallas, W., and Willis, J., 1990. Clinical Methods: The History, Physical, and Laboratory Examinations. Boston: Butterworths.
Ward, N. G., Werner, S. D., Novick, D. G., Shriberg, E. E., Oertel, C., and Kawahara, T. 2013. The similar segments in social speech task. In Working Notes Proceedings of the MediaEval Workshop, Barcelona, Spain.
Watanabe, S., Iwata, T., Hori, T., Sako, A., and Ariki, Y., 2011. Topic tracking language model for speech recognition. Computer Speech and Language 25 (2): 440461.
Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Natural Language Engineering
  • ISSN: 1351-3249
  • EISSN: 1469-8110
  • URL: /core/journals/natural-language-engineering
Please enter your name
Please enter a valid email address
Who would you like to send this to? *
×

Metrics

Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed