Natural Language Engineering: Volume 27 -

Neural machine translation of low-resource languages using SMT phrase pair injection
Sukanta Sen, Mohammed Hasanuzzaman, Asif Ekbal, Pushpak Bhattacharyya, Andy Way
Published online by Cambridge University Press:

17 June 2020, pp. 271-292
- Article
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Neural machine translation (NMT) has recently shown promising results on publicly available benchmark datasets and is being rapidly adopted in various production systems. However, it requires high-quality large-scale parallel corpus, and it is not always possible to have sufficiently large corpus as it requires time, money, and professionals. Hence, many existing large-scale parallel corpus are limited to the specific languages and domains. In this paper, we propose an effective approach to improve an NMT system in low-resource scenario without using any additional data. Our approach aims at augmenting the original training data by means of parallel phrases extracted from the original training data itself using a statistical machine translation (SMT) system. Our proposed approach is based on the gated recurrent unit (GRU) and transformer networks. We choose the Hindi–English, Hindi–Bengali datasets for Health, Tourism, and Judicial (only for Hindi–English) domains. We train our NMT models for 10 translation directions, each using only 5–23k parallel sentences. Experiments show the improvements in the range of 1.38–15.36 BiLingual Evaluation Understudy points over the baseline systems. Experiments show that transformer models perform better than GRU models in low-resource scenarios. In addition to that, we also find that our proposed method outperforms SMT—which is known to work better than the neural models in low-resource scenarios—for some translation directions. In order to further show the effectiveness of our proposed model, we also employ our approach to another interesting NMT task, for example, old-to-modern English translation, using a tiny parallel corpus of only 2.7K sentences. For this task, we use publicly available old-modern English text which is approximately 1000 years old. Evaluation for this task shows significant improvement over the baseline NMT.

Text classification with semantically enriched word embeddings
N. Pittaras, G. Giannakopoulos, G. Papadakis, V. Karkaletsis
Published online by Cambridge University Press:

06 April 2020, pp. 391-425
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
The recent breakthroughs in deep neural architectures across multiple machine learning fields have led to the widespread use of deep neural models. These learners are often applied as black-box models that ignore or insufficiently utilize a wealth of preexisting semantic information. In this study, we focus on the text classification task, investigating methods for augmenting the input to deep neural networks (DNNs) with semantic information. We extract semantics for the words in the preprocessed text from the WordNet semantic graph, in the form of weighted concept terms that form a semantic frequency vector. Concepts are selected via a variety of semantic disambiguation techniques, including a basic, a part-of-speech-based, and a semantic embedding projection method. Additionally, we consider a weight propagation mechanism that exploits semantic relationships in the concept graph and conveys a spreading activation component. We enrich word2vec embeddings with the resulting semantic vector through concatenation or replacement and apply the semantically augmented word embeddings on the classification task via a DNN. Experimental results over established datasets demonstrate that our approach of semantic augmentation in the input space boosts classification performance significantly, with concatenation offering the best performance. We also note additional interesting findings produced by our approach regarding the behavior of term frequency - inverse document frequency normalization on semantic vectors, along with the radical dimensionality reduction potential with negligible performance loss.

A systematic review of unsupervised approaches to grammar induction
Vigneshwaran Muralidaran, Irena Spasić, Dawn Knight
Published online by Cambridge University Press:

27 October 2020, pp. 647-689
- Article
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
This study systematically reviews existing approaches to unsupervised grammar induction in terms of their theoretical underpinnings, practical implementations and evaluation. Our motivation is to identify the influence of functional-cognitive schools of grammar on language processing models in computational linguistics. This is an effort to fill any gap between the theoretical school and the computational processing models of grammar induction. Specifically, the review aims to answer the following research questions: Which types of grammar theories have been the subjects of grammar induction? Which methods have been employed to support grammar induction? Which features have been used by these methods for learning? How were these methods evaluated? Finally, in terms of performance, how do these methods compare to one another? Forty-three studies were identified for systematic review out of which 33 described original implementations of grammar induction; three provided surveys and seven focused on theories and experiments related to acquisition and processing of grammar in humans. The data extracted from the 33 implementations were stratified into 7 different aspects of analysis: theory of grammar; output representation; how grammatical productivity is processed; how grammatical productivity is represented; features used for learning; evaluation strategy and implementation methodology. In most of the implementations considered, grammar was treated as a generative-formal system, autonomous and independent of meaning. The parser decoding was done in a non-incremental, head-driven fashion by assuming that all words are available for the parsing model and the output representation of the grammar learnt was hierarchical, typically a dependency or a constituency tree. However, the theoretical and experimental studies considered suggest that a usage-based, incremental, sequential system of grammar is more appropriate than the formal, non-incremental, hierarchical view of grammar. This gap between the theoretical as well as experimental studies on one hand and the computational implementations on the other hand should be addressed to enable further progress in computational grammar induction research.

Mining, analyzing, and modeling text written on mobile devices
K. Vertanen, P.O. Kristensson
Published online by Cambridge University Press:

10 October 2019, pp. 1-33
- Article
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
We present a method for mining the web for text entered on mobile devices. Using searching, crawling, and parsing techniques, we locate text that can be reliably identified as originating from 300 mobile devices. This includes 341,000 sentences written on iPhones alone. Our data enables a richer understanding of how users type “in the wild” on their mobile devices. We compare text and error characteristics of different device types, such as touchscreen phones, phones with physical keyboards, and tablet computers. Using our mined data, we train language models and evaluate these models on mobile test data. A mixture model trained on our mined data, Twitter, blog, and forum data predicts mobile text better than baseline models. Using phone and smartwatch typing data from 135 users, we demonstrate our models improve the recognition accuracy and word predictions of a state-of-the-art touchscreen virtual keyboard decoder. Finally, we make our language models and mined dataset available to other researchers.

Processing negation: An introduction to the special issue
Eduardo Blanco, Roser Morante
Published online by Cambridge University Press:

14 December 2020, pp. 119-120
- Article
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

Temporally anchored spatial knowledge: Corpora and experiments
Alakananda Vempala, Eduardo Blanco
Published online by Cambridge University Press:

20 May 2020, pp. 519-543
- Article
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
This article presents a two-step methodology to annotate temporally anchored spatial knowledge on top of OntoNotes. We first generate potential knowledge using semantic roles or syntactic dependencies and then crowdsource annotations to validate the potential knowledge. The resulting annotations indicate how long entities are or are not located somewhere and temporally anchor this spatial information. We present an in-depth corpus analysis comparing the spatial knowledge generated by manipulating roles or dependencies. Experiments show that working with syntactic dependencies instead of semantic roles allows us to generate more potential entity-related spatial knowledge and obtain better results in a realistic scenario, that is, with predicted linguistic information.

Nonuniform language in technical writing: Detection and correction
Weibo Wang, Aminul Islam, Abidalrahman Moh’d, Axel J. Soto, Evangelos E. Milios
Published online by Cambridge University Press:

06 March 2020, pp. 293-314
- Article
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Technical writing in professional environments, such as user manual authoring, requires the use of uniform language. Nonuniform language refers to sentences in a technical document that are intended to have the same meaning within a similar context, but use different words or writing style. Addressing this nonuniformity problem requires the performance of two tasks. The first task, which we named nonuniform language detection (NLD), is detecting such sentences. We propose an NLD method that utilizes different similarity algorithms at lexical, syntactic, semantic and pragmatic levels. Different features are extracted and integrated by applying a machine learning classification method. The second task, which we named nonuniform language correction (NLC), is deciding which sentence among the detected ones is more appropriate for that context. To address this problem, we propose an NLC method that combines contraction removal, near-synonym choice, and text readability comparison. We tested our methods using smartphone user manuals. We finally compared our methods against state-of-the-art methods in paraphrase detection (for NLD) and against expert annotators (for both NLD and NLC). The experiments demonstrate that the proposed methods achieve performance that matches expert annotators.

Recent advances in processing negation
Roser Morante, Eduardo Blanco
Published online by Cambridge University Press:

17 December 2020, pp. 121-130
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
Negation is a complex linguistic phenomenon present in all human languages. It can be seen as an operator that transforms an expression into another expression whose meaning is in some way opposed to the original expression. In this article, we survey previous work on negation with an emphasis on computational approaches. We start defining negation and two important concepts: scope and focus of negation. Then, we survey work in natural language processing that considers negation primarily as a means to improve the results in some task. We also provide information about corpora containing negation annotations in English and other languages, which usually include a combination of annotations of negation cues, scopes, foci, and negated events. We continue the survey with a description of automated approaches to process negation, ranging from early rule-based systems to systems built with traditional machine learning and neural networks. Finally, we conclude with some reflections on current progress and future directions.

Towards syntax-aware token embeddings
Diana Nicoleta Popa, Julien Perez, James Henderson, Eric Gaussier
Published online by Cambridge University Press:

08 July 2020, pp. 691-720
- Article
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Distributional semantic word representations are at the basis of most modern NLP systems. Their usefulness has been proven across various tasks, particularly as inputs to deep learning models. Beyond that, much work investigated fine-tuning the generic word embeddings to leverage linguistic knowledge from large lexical resources. Some work investigated context-dependent word token embeddings motivated by word sense disambiguation, using sequential context and large lexical resources. More recently, acknowledging the need for an in-context representation of words, some work leveraged information derived from language modelling and large amounts of data to induce contextualised representations. In this paper, we investigate Syntax-Aware word Token Embeddings (SATokE) as a way to explicitly encode specific information derived from the linguistic analysis of a sentence in vectors which are input to a deep learning model. We propose an efficient unsupervised learning algorithm based on tensor factorisation for computing these token embeddings given an arbitrary graph of linguistic structure. Applying this method to syntactic dependency structures, we investigate the usefulness of such token representations as part of deep learning models of text understanding. We encode a sentence either by learning embeddings for its tokens and the relations between them from scratch or by leveraging pre-trained relation embeddings to infer token representations. Given sufficient data, the former is slightly more accurate than the latter, yet both provide more informative token embeddings than standard word representations, even when the word representations have been learned on the same type of context from larger corpora (namely pre-trained dependency-based word embeddings). We use a large set of supervised tasks and two major deep learning families of models for sentence understanding to evaluate our proposal. We empirically demonstrate the superiority of the token representations compared to popular distributional representations of words for various sentence and sentence pair classification tasks.

Transfer learning for Turkish named entity recognition on noisy text
Emre Kağan Akkaya, Burcu Can
Published online by Cambridge University Press:

28 January 2020, pp. 35-64
- Article
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
In this article, we investigate using deep neural networks with different word representation techniques for named entity recognition (NER) on Turkish noisy text. We argue that valuable latent features for NER can, in fact, be learned without using any hand-crafted features and/or domain-specific resources such as gazetteers and lexicons. In this regard, we utilize character-level, character n-gram-level, morpheme-level, and orthographic character-level word representations. Since noisy data with NER annotation are scarce for Turkish, we introduce a transfer learning model in order to learn infrequent entity types as an extension to the Bi-LSTM-CRF architecture by incorporating an additional conditional random field (CRF) layer that is trained on a larger (but formal) text and a noisy text simultaneously. This allows us to learn from both formal and informal/noisy text, thus improving the performance of our model further for rarely seen entity types. We experimented on Turkish as a morphologically rich language and English as a relatively morphologically poor language. We obtained an entity-level F1 score of 67.39% on Turkish noisy data and 45.30% on English noisy data, which outperforms the current state-of-art models on noisy text. The English scores are lower compared to Turkish scores because of the intense sparsity in the data introduced by the user writing styles. The results prove that using subword information significantly contributes to learning latent features for morphologically rich languages.

Universal Lemmatizer: A sequence-to-sequence model for lemmatizing Universal Dependencies treebanks
Jenna Kanerva, Filip Ginter, Tapio Salakoski
Published online by Cambridge University Press:

27 May 2020, pp. 545-574
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
In this paper, we present a novel lemmatization method based on a sequence-to-sequence neural network architecture and morphosyntactic context representation. In the proposed method, our context-sensitive lemmatizer generates the lemma one character at a time based on the surface form characters and its morphosyntactic features obtained from a morphological tagger. We argue that a sliding window context representation suffers from sparseness, while in majority of cases the morphosyntactic features of a word bring enough information to resolve lemma ambiguities while keeping the context representation dense and more practical for machine learning systems. Additionally, we study two different data augmentation methods utilizing autoencoder training and morphological transducers especially beneficial for low-resource languages. We evaluate our lemmatizer on 52 different languages and 76 different treebanks, showing that our system outperforms all latest baseline systems. Compared to the best overall baseline, UDPipe Future, our system outperforms it on 62 out of 76 treebanks reducing errors on average by 19% relative. The lemmatizer together with all trained models is made available as a part of the Turku-neural-parsing-pipeline under the Apache 2.0 license.

Is your document novel? Let attention guide you. An attention-based model for document-level novelty detection
Tirthankar Ghosal, Vignesh Edithal, Asif Ekbal, Pushpak Bhattacharyya, Srinivasa Satya Sameer Kumar Chivukula, George Tsatsaronis
Published online by Cambridge University Press:

24 April 2020, pp. 427-454
- Article
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Detecting, whether a document contains sufficient new information to be deemed as novel, is of immense significance in this age of data duplication. Existing techniques for document-level novelty detection mostly perform at the lexical level and are unable to address the semantic-level redundancy. These techniques usually rely on handcrafted features extracted from the documents in a rule-based or traditional feature-based machine learning setup. Here, we present an effective approach based on neural attention mechanism to detect document-level novelty without any manual feature engineering. We contend that the simple alignment of texts between the source and target document(s) could identify the state of novelty of a target document. Our deep neural architecture elicits inference knowledge from a large-scale natural language inference dataset, which proves crucial to the novelty detection task. Our approach is effective and outperforms the standard baselines and recent work on document-level novelty detection by a margin of $\sim$3% in terms of accuracy.

Constrained BERT BiLSTM CRF for understanding multi-sentence entity-seeking questions
Danish Contractor, Barun Patra, Mausam, Parag Singla
Published online by Cambridge University Press:

13 February 2020, pp. 65-87
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
We present the novel task of understanding multi-sentence entity-seeking questions (MSEQs), that is, the questions that may be expressed in multiple sentences, and that expect one or more entities as an answer. We formulate the problem of understanding MSEQs as a semantic labeling task over an open representation that makes minimal assumptions about schema or ontology-specific semantic vocabulary. At the core of our model, we use a BiLSTM (bidirectional LSTM) conditional random field (CRF), and to overcome the challenges of operating with low training data, we supplement it by using BERT embeddings, hand-designed features, as well as hard and soft constraints spanning multiple sentences. We find that this results in a 12–15 points gain over a vanilla BiLSTM CRF. We demonstrate the strengths of our work using the novel task of answering real-world entity-seeking questions from the tourism domain. The use of our labels helps answer 36% more questions with 35% more (relative) accuracy as compared to baselines. We also demonstrate how our framework can rapidly enable the parsing of MSEQs in an entirely new domain with small amounts of training data and little change in the semantic representation.

Computational generation of slogans
Khalid Alnajjar, Hannu Toivonen
Published online by Cambridge University Press:

03 June 2020, pp. 575-607
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
In advertising, slogans are used to enhance the recall of the advertised product by consumers and to distinguish it from others in the market. Creating effective slogans is a resource-consuming task for humans. In this paper, we describe a novel method for automatically generating slogans, given a target concept (e.g., car) and an adjectival property to express (e.g., elegant) as input. Additionally, a key component in our approach is a novel method for generating nominal metaphors, using a metaphor interpretation model, to allow generating metaphorical slogans. The method for generating slogans extracts skeletons from existing slogans. It then fills a skeleton in with suitable words by utilizing multiple linguistic resources (such as a repository of grammatical relations, and semantic and language models) and genetic algorithms to optimize multiple objectives such as semantic relatedness, language correctness, and usage of rhetorical devices. We evaluate the metaphor and slogan generation methods by running crowdsourced surveys. On a five-point Likert scale, we ask online judges to evaluate whether the generated metaphors, along with three other metaphors generated using different methods, highlight the intended property. The slogan generation method is evaluated by asking crowdsourced judges to rate generated slogans from five perspectives: (1) how well is the slogan related to the topic, (2) how correct is the language of the slogan, (3) how metaphoric is the slogan, (4) how catchy, attractive, and memorable is it, and (5) how good is the slogan overall. Similarly, we evaluate existing expert-made slogans. Based on the evaluations, we analyze the method and provide insights regarding existing slogans. The empirical results indicate that our metaphor generation method is capable of producing apt metaphors. Regarding the slogan generator, the results suggest that the method has successfully produced at least one effective slogan for every evaluated input.

Focus of negation: Its identification in Spanish
Mariona Taulé, Montserrat Nofre, Mónica González, Maria Antònia Martí
Published online by Cambridge University Press:

08 July 2020, pp. 131-152
- Article
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
This article describes the criteria for identifying the focus of negation in Spanish. This work involved an in-depth linguistic analysis of the focus of negation through which we identified some 10 different types of criteria that account for a wide variety of constructions containing negation. These criteria account for all the cases that appear in the NewsCom corpus and were assessed in the annotation of this corpus. The NewsCom corpus consists of 2955 comments posted in response to 18 different news articles from online newspapers. The NewsCom corpus contains 2965 negative structures with their corresponding negation marker, scope, and focus. This is the first corpus annotated with focus in Spanish and it is freely available. It is a valuable resource that can be used both for the training and evaluation of systems that aim to automatically detect the scope and focus of negation and for the linguistic analysis of negation grounded in real data.

Syntax-ignorant N-gram embeddings for dialectal Arabic sentiment analysis
Hala Mulki, Hatem Haddad, Mourad Gridach, Ismail Babaoğlu
Published online by Cambridge University Press:

16 March 2020, pp. 315-338
- Article
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Arabic sentiment analysis models have recently employed compositional paragraph or sentence embedding features to represent the informal Arabic dialectal content. These embeddings are mostly composed via ordered, syntax-aware composition functions and learned within deep neural network architectures. With the differences in the syntactic structure and words’ order among the Arabic dialects, a sentiment analysis system developed for one dialect might not be efficient for the others. Here we present syntax-ignorant, sentiment-specific n-gram embeddings for sentiment analysis of several Arabic dialects. The novelty of the proposed model is illustrated through its features and architecture. In the proposed model, the sentiment is expressed by embeddings, composed via the unordered additive composition function and learned within a shallow neural architecture. To evaluate the generated embeddings, they were compared with the state-of-the art word/paragraph embeddings. This involved investigating their efficiency, as expressive sentiment features, based on the visualisation maps constructed for our n-gram embeddings and word2vec/doc2vec. In addition, using several Eastern/Western Arabic datasets of single-dialect and multi-dialectal contents, the ability of our embeddings to recognise the sentiment was investigated against word/paragraph embeddings-based models. This comparison was performed within both shallow and deep neural network architectures and with two unordered composition functions employed. The results revealed that the introduced syntax-ignorant embeddings could represent single and combinations of different dialects efficiently, as our shallow sentiment analysis model, trained with the proposed n-gram embeddings, could outperform the word2vec/doc2vec models and rival deep neural architectures consuming, remarkably, less training time.

Imparting interpretability to word embeddings while preserving semantic structure
Lütfi Kerem Şenel, İhsan Utlu, Furkan Şahinuç, Haldun M. Ozaktas, Aykut Koç
Published online by Cambridge University Press:

09 June 2020, pp. 721-746
- Article
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
As a ubiquitous method in natural language processing, word embeddings are extensively employed to map semantic properties of words into a dense vector representation. They capture semantic and syntactic relations among words, but the vectors corresponding to the words are only meaningful relative to each other. Neither the vector nor its dimensions have any absolute, interpretable meaning. We introduce an additive modification to the objective function of the embedding learning algorithm that encourages the embedding vectors of words that are semantically related to a predefined concept to take larger values along a specified dimension, while leaving the original semantic learning mechanism mostly unaffected. In other words, we align words that are already determined to be related, along predefined concepts. Therefore, we impart interpretability to the word embedding by assigning meaning to its vector dimensions. The predefined concepts are derived from an external lexical resource, which in this paper is chosen as Roget’s Thesaurus. We observe that alignment along the chosen concepts is not limited to words in the thesaurus and extends to other related words as well. We quantify the extent of interpretability and assignment of meaning from our experimental results. Manual human evaluation results have also been presented to further verify that the proposed method increases interpretability. We also demonstrate the preservation of semantic coherence of the resulting vector space using word-analogy/word-similarity tests and a downstream task. These tests show that the interpretability-imparted word embeddings that are obtained by the proposed framework do not sacrifice performances in common benchmark tests.

Sentiment analysis in Turkish: Supervised, semi-supervised, and unsupervised techniques
Cem Rıfkı Aydın, Tunga Güngör
Published online by Cambridge University Press:

17 April 2020, pp. 455-483
- Article
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Although many studies on sentiment analysis have been carried out for widely spoken languages, this topic is still immature for Turkish. Most of the works in this language focus on supervised models, which necessitate comprehensive annotated corpora. There are a few unsupervised methods, and they utilize sentiment lexicons either built by translating from English lexicons or created based on corpora. This results in improper word polarities as the language and domain characteristics are ignored. In this paper, we develop unsupervised (domain-independent) and semi-supervised (domain-specific) methods for Turkish, which are based on a set of antonym word pairs as seeds. We make a comprehensive analysis of supervised methods under several feature weighting schemes. We then form ensemble of supervised classifiers and also combine the unsupervised and supervised methods. Since Turkish is an agglutinative language, we perform morphological analysis and use different word forms. The methods developed were tested on two datasets having different styles in Turkish and also on datasets in English to show the portability of the approaches across languages. We observed that the combination of the unsupervised and supervised approaches outperforms the other methods, and we obtained a significant improvement over the state-of-the-art results for both Turkish and English.

Linguistic knowledge-based vocabularies for Neural Machine Translation
Noe Casas, Marta R. Costa-jussà, José A. R. Fonollosa, Juan A. Alonso, Ramón Fanlo
Published online by Cambridge University Press:

02 July 2020, pp. 485-506
- Article
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Neural Networks applied to Machine Translation need a finite vocabulary to express textual information as a sequence of discrete tokens. The currently dominant subword vocabularies exploit statistically-discovered common parts of words to achieve the flexibility of character-based vocabularies without delegating the whole learning of word formation to the neural network. However, they trade this for the inability to apply word-level token associations, which limits their use in semantically-rich areas and prevents some transfer learning approaches e.g. cross-lingual pretrained embeddings, and reduces their interpretability. In this work, we propose new hybrid linguistically-grounded vocabulary definition strategies that keep both the advantages of subword vocabularies and the word-level associations, enabling neural networks to profit from the derived benefits. We test the proposed approaches in both morphologically rich and poor languages, showing that, for the former, the quality in the translation of out-of-domain texts is improved with respect to a strong subword baseline.

Investigating translated Chinese and its variants using machine learning
Hai Hu, Sandra Kübler
Published online by Cambridge University Press:

03 April 2020, pp. 339-372
- Article
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Translations are generally assumed to share universal features that distinguish them from texts that are originally written in the same language. Thus, we can argue that these translations constitute their own variety of a language, often called translationese. However, translations are also influenced by their source languages and thus show different characteristics depending on the source language. Consequently, we argue that these variants constitute different “dialects” of translations into the same target language. Studies using machine learning techniques on Indo-European languages have investigated the universal characteristics of translationese and how translations from various source languages differ. However, for typologically very different languages such as Chinese, there are only few corpus studies that tap into the intricate relation between translations and the originals, as well as into the relations among translations themselves. In this contribution, we investigate the following questions: (1) What are the characteristics of Chinese translationese, both in general and with respect to different source languages? (2) Can we find differences not only at the lexical but also on the syntactic level? and (3) Based on the characteristics found in the previous questions, which of the proposed laws and universals can we corroborate based on our evidence from Chinese? We use machine learning to operationalize determining the importance of different characteristics and comparing their importance for our Chinese dataset with characteristics previously reported in studies on English. In addition, our methodology allows us to add syntactic features, which have rarely been used to study translations into Chinese. Our results show that Chinese translations as a whole can be reliably distinguished from non-translations, even based on only five features. More interestingly, typological traces from the source languages can often be found in their translations, therefore creating what we call dialects of translationese. For instance, translations from two Altaic languages exhibit more noun repetition and less frequent use of pronouns. Additionally, some characteristics that are not discriminative for English work well for Chinese, possibly because the distance between Chinese and the source languages is greater than that in English studies.

Natural Language Processing

Refine listing

Actions for selected content:

Natural Language Engineering, Volume 27 - November 2021

Article

Neural machine translation of low-resource languages using SMT phrase pair injection

Text classification with semantically enriched word embeddings

Survey Paper

A systematic review of unsupervised approaches to grammar induction

Article

Mining, analyzing, and modeling text written on mobile devices

Introduction

Processing negation: An introduction to the special issue

Article

Temporally anchored spatial knowledge: Corpora and experiments

Nonuniform language in technical writing: Detection and correction

Survey Paper

Recent advances in processing negation

Article

Towards syntax-aware token embeddings

Transfer learning for Turkish named entity recognition on noisy text

Universal Lemmatizer: A sequence-to-sequence model for lemmatizing Universal Dependencies treebanks

Is your document novel? Let attention guide you. An attention-based model for document-level novelty detection

Constrained BERT BiLSTM CRF for understanding multi-sentence entity-seeking questions

Computational generation of slogans

Focus of negation: Its identification in Spanish

Syntax-ignorant N-gram embeddings for dialectal Arabic sentiment analysis

Imparting interpretability to word embeddings while preserving semantic structure

Sentiment analysis in Turkish: Supervised, semi-supervised, and unsupervised techniques

Linguistic knowledge-based vocabularies for Neural Machine Translation

Investigating translated Chinese and its variants using machine learning

Natural Language Processing

Refine listing

Actions for selected content:

Save Search

Natural Language Engineering, Volume 27 - November 2021

Article

Survey Paper

Article

Introduction

Article

Survey Paper

Article