Skip to main content Accessibility help
×
Home

Assessing multilingual multimodal image description: Studies of native speaker preferences and translator choices

  • STELLA FRANK (a1), DESMOND ELLIOTT (a2) and LUCIA SPECIA (a3)

Abstract

Two studies on multilingual multimodal image description provide empirical evidence towards two questions at the core of the task: (i) whether target language speakers prefer descriptions generated directly in their native language, as compared to descriptions translated from a different language; (ii) whether images improve human translation of descriptions. These results provide guidance for future work in multimodal natural language processing by first showing that on the whole, translations are not distinguished from native language descriptions, and second delineating and quantifying the information gained from the image during the human translation task.

Copyright

References

Hide All
Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C. L., and Parikh, D. 2015. VQA: Visual question answering. In Proceedings of the International Conference on Computer Vision (ICCV), Santiago, Chile.
Bernardi, R., Cakici, R., Elliott, D., Erdem, A., Erdem, E., Ikizler-Cinbis, N., Keller, F., Muscat, A., and Plank, B. 2016. Automatic description generation from images: a survey of models, datasets, and evaluation measures. Journal of Artificial Intelligence Research, 55: 409442.
Berzak, Y., Barbu, A., Harari, D., Katz, B., and Ullman, S., 2015. Do you see what I mean? Visual resolution of linguistic ambiguities. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 14771487.
Chen, D. L., and Dolan, W. B. 2011. Building a persistent workforce on Mechanical Turk for multilingual data collection. In Proceedings of the 3rd Human Computation Workshop (HCOMP –2011), San Francisco, CA, USA.
Denkowski, M., and Lavie, A., 2014. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the 9th Workshop on Statistical Machine Translation, Baltimore, Maryland, pp. 376380.
Duygulu, P., Barnard, K., de Freitas, J. F., and Forsyth, D. A., 2002. Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary. In Proceedings of European Conference on Computer Vision, Copenhagen, Denmark, pp. 97112.
Elliott, D., Frank, S., Barrault, L., Bougares, F., and Specia, L., 2017. Findings of the second shared task on multimodal machine translation and multilingual image description. In Proceedings of the 2nd Conference on Machine Translation, vol. 2: Shared Task Papers, Copenhagen, Denmark, pp. 215233.
Elliott, D., Frank, S., Sima’an, K., and Specia, L. 2016. Multi30K: multilingual English-German image descriptions. In Proceedings of the 5th Workshop on Vision and Language, Berlin, Germany.
Elliott, D., and Keller, F., 2014. Comparing automatic evaluation measures for image description. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, Maryland, pp. 452457.
Elliott, D., and Kleppe, M. 2016. 1 million captioned Dutch newspaper images. In Calzolari, N., Choukri, K., Declerck, T., Goggi, S., Grobelnik, M., Maegaard, B., Mariani, J., Mazo, H., Moreno, A., Odijk, J., and Piperidis, S., (eds.), In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC-2016), Paris, France: European Language Resources Association (ELRA).
Gella, S., Lapata, M., and Keller, F., 2016. Unsupervised visual sense disambiguation for verbs using multimodal embeddings. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, pp. 182192.
Graham, Y., Baldwin, T., Moffat, A., and Zobel, J. 2015. Can machine translation systems be evaluated by the crowd alone? Natural Language Engineering, 23 (1): 330.
Hardmeier, C., Nakov, P., Stymne, S., Tiedemann, J., Versley, Y., and Cettolo, M., 2015. Pronoun-focused MT and cross-lingual pronoun prediction: Findings of the 2015 DiscoMT shared task on pronoun translation. In Proceedings of the 2nd Workshop on Discourse in Machine Translation, Lisbon, Portugal, pp. 116.
Hitschler, J., Schamoni, S., and Riezler, S., 2016. Multimodal pivots for image caption translation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, pp. 23992409.
Hodosh, M., Young, P., and Hockenmaier, J. 2013. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research, 47: 853899.
Hollink, L., Bedjeti, A., van Harmelen, M., and Elliott, D. 2016. A corpus of images and text in online news. In Calzolari, N., Choukri, K., Declerck, T., Goggi, S., Grobelnik, M., Maegaard, B., Mariani, J., Mazo, H., Moreno, A., Odijk, J., and Piperidis, S., (eds.), Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC-2016), Paris, France: European Language Resources Association (ELRA).
Kilickaya, M., Erdem, A., Ikizler-Cinbis, N., and Erdem, E., 2017. Re-evaluating automatic metrics for image captioning. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Long Papers, Valencia, Spain, pp. 199209.
Lazaridou, A., Pham, N. T., and Baroni, M., 2015. Combining language and vision with a multimodal skip-gram model. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, Colorado, pp. 153163.
Li, X., Lan, W., Dong, J., and Liu, H., 2016. Adding Chinese captions to images. In Proceedings of the ACM International Conference on Multimedia Retrieval, New York, NY, USA, pp. 271275.
Miyazaki, T., and Shimizu, N., 2016. Cross-lingual image caption generation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, pp. 17801790.
Ordonez, V., Kulkarni, G., and Berg, T. L., 2011. Im2text: describing images using 1 million captioned photographs. In Proceedings of the Advances in Neural Information Processing Systems, Granada, Spain, pp. 11431151.
Papineni, K., Roukos, S., Ard, T., and Zhu, W.-J., 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, Pennsylvania, pp. 311318.
Rajendran, J., Khapra, M. M., Chandar, S., and Ravindran, B., 2016. Bridge correlational neural networks for multilingual multimodal representation learning. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, CA, USA, pp. 171181.
Ramanathan, V., Joulin, A., Liang, P., and Fei-Fei, L., 2014. Linking people in videos with “their” names using coreference resolution. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, pp. 95110.
Ramisa, A., Yan, F., Moreno-Noguer, F., and Mikolajczyk, K. 2017. BreakingNews: Article annotation by image and text processing. IEEE Transactions on Pattern Analysis and Machine Intelligence, PP (99):1–1.
Silberer, C., and Lapata, M., 2014. Learning grounded meaning representations with autoencoders. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Long Papers), Baltimore, Maryland, pp. 721732.
Specia, L., Frank, S., Sima’an, K., and Elliott, D., 2016. A shared task on multimodal machine translation and crosslingual image description. In Proceedings of the 1st Conference on Machine Translation, Shared Task Papers, WMT, Berlin, Germany, pp. 540550.
Unal, M. E., Citamak, B., Yagcioglu, S., Erdem, A., Erdem, E., Cinbis, N. I., and Cakici, R. 2016. Tasviret: Görüntülerden otomatik türkçe açıklama oluşturma İçin bir denektaçı veri kümesi (TasvirEt: A benchmark dataset for automatic Turkish description generation from images). In IEEE Sinyal İşleme ve İletişim Uygulamaları Kurultayı (SIU 2016).
van Miltenburg, E., Elliott, D., and Vossen, P., 2017. Cross-linguistic differences and similarities in image descriptions. In Proceedings of the 10th International Conference on Natural Language Generation, Santiago de Compostela, Spain, pp. 2130.
Yoshikawa, Y., Shigeto, Y., and Takeuchi, A., 2017. STAIR captions: Constructing a large-scale Japanese image caption dataset. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (vol. 2: Short Papers), Vancouver, Canada, pp. 417421.
Young, P., Lai, A., Hodosh, M., and Hockenmaier, J. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2: 6778.
Zhao, J., Wang, T., Yatskar, M., Ordonez, V., and Chang, K.-W., 2017. Men also like shopping: Reducing gender bias amplification using corpus-level constraints. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 29412951.

Metrics

Altmetric attention score

Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed