Skip to main content Accessibility help
×
Home

Combining n-grams and deep convolutional features for language variety classification

  • Matej Martinc (a1) and Senja Pollak (a1) (a2)

Abstract

This paper presents a novel neural architecture capable of outperforming state-of-the-art systems on the task of language variety classification. The architecture is a hybrid that combines character-based convolutional neural network (CNN) features with weighted bag-of-n-grams (BON) features and is therefore capable of leveraging both character-level and document/corpus-level information. We tested the system on the Discriminating between Similar Languages (DSL) language variety benchmark data set from the VarDial 2017 DSL shared task, which contains data from six different language groups, as well as on two smaller data sets (the Arabic Dialect Identification (ADI) Corpus and the German Dialect Identification (GDI) Corpus, from the VarDial 2016 ADI and VarDial 2018 GDI shared tasks, respectively). We managed to outperform the winning system in the DSL shared task by a margin of about 0.4 percentage points and the winning system in the ADI shared task by a margin of about 0.2 percentage points in terms of weighted F1 score without conducting any language group-specific parameter tweaking. An ablation study suggests that weighted BON features contribute more to the overall performance of the system than the CNN-based features, which partially explains the uncompetitiveness of deep learning approaches in the past VarDial DSL shared tasks. Finally, we have implemented our system in a workflow, available in the ClowdFlows platform, in order to make it easily available also to the non-programming members of the research community.

  • View HTML
    • Send article to Kindle

      To send this article to your Kindle, first ensure no-reply@cambridge.org is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about sending to your Kindle. Find out more about sending to your Kindle.

      Note you can select to send to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be sent to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

      Find out more about the Kindle Personal Document Service.

      Combining n-grams and deep convolutional features for language variety classification
      Available formats
      ×

      Send article to Dropbox

      To send this article to your Dropbox account, please select one or more formats and confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your <service> account. Find out more about sending content to Dropbox.

      Combining n-grams and deep convolutional features for language variety classification
      Available formats
      ×

      Send article to Google Drive

      To send this article to your Google Drive account, please select one or more formats and confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your <service> account. Find out more about sending content to Google Drive.

      Combining n-grams and deep convolutional features for language variety classification
      Available formats
      ×

Copyright

This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.

Corresponding author

*Corresponding author. Email: matej.martinc@ijs.si

References

Hide All
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J. and Kudlur, M. (2016). Tensorflow: A system for large-scale machine learning. In OSDI, vol. 16, pp. 265283.
Ali, A., Dehak, N., Cardinal, P., Khurana, S., Yella, S.H., Glass, J. and Renals, S. (2015). Automatic dialect detection in arabic broadcast speech. In Proceedings of Interspeech, pp. 29342938. San Francisco, USA: ISCA.
Ali, M. (2018a). Character level convolutional neural network for German dialect identification. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), pp. 172177. Santa Fe, New Mexico, USA: Association for Computational Linguistics.
Ali, M. (2018b). Character level convolutional neural network for Arabic dialect identification. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), pp. 122127. Santa Fe, New Mexico, USA: Association for Computational Linguistics.
Ali, M. (2018c). Character level convolutional neural network for Indo-Aryan language identification. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), pp. 283287. Santa Fe, New Mexico, USA: Association for Computational Linguistics.
Alvarez-Carmona, M.A., López-Monroy, A.P., Montes-y-Gómez, M., Villasenor-Pineda, L. and Escalante, H.J. (2015). INAOE’s participation at PAN’15: Author profiling task. In Working Notes Papers of the CLEF. Toulouse, France: CEUR Workshop Proceedings.
Belinkov, Y. and Glass, J. (2016). A character-level convolutional neural network for distinguishing similar languages and dialects. In Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), pp. 145152. Osaka, Japan: The COLING 2016 Organizing Committee.
Basile, A., Dwyer, G., Medvedeva, M., Rawee, J., Haagsma, H. and Nissim, M. (2017). N-gram: New Groningen author-profiling model. In CLEF 2017 Evaluation Labs and Workshop - Working Notes Papers. Dublin, Ireland: CEUR Workshop Proceedings.
Bergstra, J., Bastien, F., Breuleux, O., Lamblin, P., Pascanu, R., Delalleau, O. and Bengio, Y. (2011). Theano: Deep learning on GPUs with python. In NIPS 2011, BigLearning Workshop, Granada, Spain, vol. 3, pp. 148.
Bestgen, Y. (2017). Improving the character n-gram model for the DSL task with BM25 weighting and less frequently used feature sets. In proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), pp. 115123. Valencia, Spain: Association for Computational Linguistics.
Bjerva, J. (2016). Byte-based language identification with deep convolutional networks. In Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), pp. 119125. Osaka, Japan: The COLING 2016 Organizing Committee.
Chollet, F. (2015). Keras: Deep learning library for theano and tensorflow. https://keras.io.
Cianflone, A. and Kosseim, L. (2017). N-gram and neural language models for discriminating similar languages. In Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), pp. 243250. Osaka, Japan: The COLING 2016 Organizing Committee.
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K. and Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of Machine Learning Research 12, 24932537.
Çöltekin, Ç. and Rama, T. (2016). Discriminating similar languages with linear SVMs and neural networks. In Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), pp. 1524. Osaka, Japan: The COLING 2016 Organizing Committee.
Çöltekin, Ç., Rama, T. and Blaschke, V. (2018). Tübingen-Oslo team at the VarDial 2018 evaluation campaign: An analysis of n-gram features in language variety identification. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), pp. 5565. Santa Fe, New Mexico, USA: Association for Computational Linguistics.
Criscuolo, M. and Aluisio, S.M. (2017). Discriminating between similar languages with word-level convolutional neural networks. In Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), pp. 124130. Valencia, Spain: Association for Computational Linguistics.
Devlin, J., Chang, M.W., Lee, K. and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Goutte, C., Léger, S. and Carpuat, M. (2014). The NRC system for discriminating similar languages. In Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects, pp. 139145. Dublin, Ireland: Association for Computational Linguistics and Dublin City University.
Goutte, C., Léger, S., Malmasi, S. and Zampieri, M. (2016). Discriminating similar languages: Evaluations and explorations. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), pp. 1800–1807. Portorož, Slovenia: European Language Resources Association.
Jauhiainen, T., Jauhiainen, H. and Lindén, K. (2018a). HeLI-based experiments in Swiss German dialect identification. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), pp. 254262. Santa Fe, New Mexico, USA: Association for Computational Linguistics.
Jauhiainen, T., Jauhiainen, H. and Lindén, K. (2018b). Iterative language model adaptation for Indo-Aryan language identification. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), pp. 6675. Santa Fe, New Mexico, USA: Association for Computational Linguistics.
Joulin, A., Grave, E., Bojanowski, P. and Mikolov, T. (2016). Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL), pp. 427431. Valencia, Spain: Association for Computational Linguistics.
Kingma, D.P. and Ba, J. (2015). Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR). San Diego, California, USA: DBLP.
Kranjc, J., Podpečan, V. and Lavrač, N. (2012). ClowdFlows: A cloud based scientific workflow platform. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 816819. Bristol, UK: Springer.
López-Monroy, A.P., Montes-y-Gómez, M., Escalante, H.J. and Pineda, L.V. (2014). Using intra-profile information for author profiling. In CLEF (Working Notes), pp. 1116–1120. Sheffield, UK: CEUR Workshop Proceedings.
Malmasi, S. and Dras, M. (2015). Language identification using classifier ensembles. In Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects, pp. 3543. Hissar, Bulgaria: Association for Computational Linguistics.
Malmasi, S. and Zampieri, M. (2016a). Arabic dialect identification in speech transcripts. In Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), pp. 106113. Osaka, Japan: The COLING 2016 Organizing Committee.
Malmasi, S., Zampieri, M., Ljubešić, N., Nakov, P., Ali, A. and Tiedemann, J. (2016b). Discriminating between similar languages and arabic dialect identification: A report on the third DSL shared task. In Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), pp. 114. Osaka, Japan: The COLING 2016 Organizing Committee.
Martinc, M. and Pollak, S. (2018). Reusable workflows for gender prediction. In Language Resources and Evaluation Conference (LREC 2018) Proceedings, pp. 515520. Miyazaki, Japan: European Language Resources Association.
McKinney, W. (2011). Pandas: A foundational Python library for data analysis and statistics. Python for High Performance and Scientific Computing, pp. 19.
Miura, Y., Taniguchi, T., Taniguchi, M. and Ohkuma, T. (2017). Author profiling with word + character neural attention network. In CLEF (Working Notes). Dublin, Ireland: CEUR Workshop Proceedings.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O. and Vanderplas, J. (2011). Scikit-learn: Machine learning in Python. Journal of machine learning research 12, 28252830.
Rangel, F., Rosso, P., Koppel, M., Stamatatos, E. and Inches, G. (2013). Overview of the author profiling task at PAN 2013. In CLEF Conference on Multilingual and Multimodal Information Access Evaluation, pp. 352365. Valencia, Spain: Springer.
Rangel Pardo, F.M., Celli, F., Rosso, P., Potthast, M., Stein, B. and Daelemans, W. (2015). Overview of the 3rd author profiling task at PAN 2015. In CLEF 2015 Evaluation Labs and Workshop Working Notes Papers, pp. 18. Toulouse, France: CEUR Workshop Proceedings.
Rangel, F., Rosso, P., Verhoeven, B., Daelemans, W., Potthast, M. and Stein, B. (2016). Overview of the 4th author profiling task at PAN 2016: Cross-genre evaluations. In Balog, K. et al. (ed.) Working Notes Papers of the CLEF 2016 Evaluation Labs. CEUR Workshop Proceedings, pp. 750784. Évora, Portugal: CEUR Workshop Proceedings.
Rangel, F., Rosso, P., Potthast, M. and Stein, B. (2017). Overview of the 5th author profiling task at pan 2017: Gender and language variety identification in twitter. In Working Notes Papers of the CLEF. Dublin, Ireland: CEUR Workshop Proceedings.
Robertson, S. and Zaragoza, H. (2009). The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval 3(4), 333389.
Samardzic, T., Scherrer, Y. and Glaser, E. (2016). Archimob-a corpus of spoken Swiss German. In Proceedings of LREC 2016, pp. 4061–4066. Portorož, Slovenia: European Language Resources Association.
Stamatatos, E., Daelemans, W., Verhoeven, B., Potthast, M., Stein, B., Juola, P. and Barrón-Cedeño, A. (2014). Overview of the author identification task at PAN 2014. In CLEF 2014 Evaluation Labs and Workshop Working Notes Papers, Sheffield, UK, 2014, pp. 121. Sheffield, UK: CEUR Workshop Proceedings.
Tan, L., Zampieri, M., Ljubešic, N. and Tiedemann, J. (2014). Merging comparable data sources for the discrimination of similar languages: The DSL corpus collection. In Proceedings of the 7th Workshop on Building and Using Comparable Corpora (BUCC), pp. 1115. Reykjavik, Iceland: European Language Resources Association.
Vollenbroek, M.B., Carlotto, T., Kreutz, T., Medvedeva, M., Pool, C., Bjerva, J. and Nissim, M. (2016). Gronup: Groningen user profiling. In Notebook for PAN at CLEF, pp. 846857. Évora, Portugal: CEUR Workshop Proceedings.
Zampieri, M., Tan, L., Ljubešić, N. and Tiedemann, J. (2014). A report on the DSL shared task 2014. In Proceedings of the first workshop on applying NLP tools to similar languages, varieties and dialects, pp. 5867. Dublin, Ireland: Association for Computational Linguistics and Dublin City University.
Zampieri, M., Tan, L., Ljubešić, N., Tiedemann, J. and Nakov, P. (2015). Overview of the DSL shared task 2015. In Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects, pp. 19. Hissar, Bulgaria: Association for Computational Linguistics.
Zampieri, M., Malmasi, S., Ljubešić, N., Nakov, P., Ali, A., Tiedemann, J. and Aepli, N. (2017). Findings of the VarDial evaluation campaign 2017. In Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), pp. 115. Valencia, Spain: Association for Computational Linguistics.
Zampieri, M., Malmasi, S., Nakov, P., Ali, A., Shon, S., Glass, J. and Van der Lee, C. (2018). Language identification and morphosyntactic tagging: The second VarDial evaluation campaign. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), pp. 117. Santa Fe, New Mexico, USA: Association for Computational Linguistics.

Keywords

Metrics

Altmetric attention score

Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed