Deception detection in text and its relation to the cultural dimension of individualism/collectivism

Deception detection is a task with many applications both in direct physical and in computer-mediated communication. Our focus is on automatic deception detection in text across cultures. We view culture through the prism of the individualism/collectivism dimension and we approximate culture by using country as a proxy. Having as a starting point recent conclusions drawn from the social psychology discipline, we explore if differences in the usage of specific linguistic features of deception across cultures can be confirmed and attributed to norms in respect to the individualism/collectivism divide. We also investigate if a universal feature set for cross-cultural text deception detection tasks exists. We evaluate the predictive power of different feature sets and approaches. We create culture/language-aware classifiers by experimenting with a wide range of n-gram features based on phonology, morphology and syntax, other linguistic cues like word and phoneme counts, pronouns use, etc., and token embeddings. We conducted our experiments over 11 datasets from 5 languages i.e., English, Dutch, Russian, Spanish and Romanian, from six countries (US, Belgium, India, Russia, Mexico and Romania), and we applied two classification methods i.e, logistic regression and fine-tuned BERT models. The results showed that our task is fairly complex and demanding. There are indications that some linguistic cues of deception have cultural origins, and are consistent in the context of diverse domains and dataset settings for the same language. This is more evident for the usage of pronouns and the expression of sentiment in deceptive language. The results of this work show that the automatic deception detection across cultures and languages cannot be handled in a unified manner, and that such approaches should be augmented with knowledge about cultural differences and the domains of interest.


Introduction
Automated deception detection builds on years of research in interpersonal psychology, philosophy, sociology, communication studies and computational models of deception detection [Vrij, 2008a, Granhag et al., 2014. Textual data of any form, such as consumer reviews, news articles, social media comments, political speeches, witnesses' reports, etc., are currently in the spotlight of deception research [Granhag et al., 2014]. What contributed to this vivid interest is the enormous production of textual data and the advances in computational linguistics. In many cases, text is either the only available source for extracting deception cues, or the most affordable and less intrusive one, compared to approaches based on Magnetic Resonance Imaging (MRI) [Lauterbur, 1973] and Electrodermal Activity (EDA) [Critchley and Nagai, 2013]. In this work, we exploit natural language processing (NLP) techniques and tools for automated text-based deception detection, and focus on the relevant cultural and language factors. As many studies suggest, deception is an act that depends on many factors such as personality [Fornaciari et al., 2013, Levitan et al., 2015, age [Sweeney and Ceci, 2014], gender [Tilley et al., 2005, Toma et al., 2008, Fu et al., 2008, or culture [Taylor et al., 2014, Leal et al., 2018. All these factors affect the way and the means one uses to deceive. The vast majority of works in automatic deception detection take an "one-size-fits-all" approach, failing to adapt the techniques based on such factors. Only recently, research efforts that take into account such parameters started to appear .
Culture and language are tightly interconnected since language is a means of expression, embodiment and symbolization of cultural reality [Kramsch, 2011] and as such, differences among cultures are reflected in language usage. According to previous studies [Rotman, 2012, Taylor et al., 2014, Leal et al., 2018, this also applies to the expression of deception among people belonging to different cultures (a detailed analysis related to this point is provided in Section 2.2). The examination of the influence of cultural properties in deception detection is extremely important since differences in social norms may lead to misjudgments and misconceptions and consequently can impede fair treatment and justice [Jones andNewburn, 2001, Taylor et al., 2014]. The globalization of criminal activities that employ face-to-face communication, (e.g., when illegally trafficking people across borders) or digital communication (e.g., phishing in e-mail or social media), as well as the increasing number of people passing interviews in customs and borders all over the world are only some scenarios that make the incorporation of cultural aspects in the research of deception detection a necessity. Since the implicit assumption made about the uniformity of linguistic indicators of deception comes in conflict with prior work from psychological and sociological disciplines, our three research goals are: (a). Can we verify the prior body of work which states that linguistic cues of deception are expressed differently, e.g., are milder or stronger, across cultures due to different cultural norms? More specifically, we want to explore how the individualism/collectivism divide defines the usage of specific linguistic cues [Taylor et al., 2014[Taylor et al., , 2017. Individualism and collectivism constitute a well-known division of cultures, and concern the degree in which members of a culture value more individual over group goals and vice versa [Triandis et al., 1988]. Since cultural boundaries are difficult to define precisely when collecting data, we use datasets from different countries assuming that they reflect at an aggregate level the dominant cultural aspects that relate to deception in each country. In other words, we use countries as proxies for cultures, following in that respect Hofstede [2001]. We also experiment with datasets originating from different text genres (e.g., reviews about hotels and electronics, opinions about controversial topics, transcripts from radio programs etc.).
(b). Explore which language indicators and cues are more effective to detect deception given a piece of text, and identify if a universal feature set, that we could rely on for detection deception tasks exists. On top of that, we investigate the volatility of cues across different domains by keeping the individualism/collectivism and language factors steady, whenever we have appropriate datasets at our disposal.
(c). In conjunction with the previous research goal, we create and evaluate the performance of a wide range of binary classifiers for predicting the truthfulness and deceptiveness of text.
These three research goals have not been addressed before, at least from this point of view. Regarding the first goal, it is particularly useful to confirm some of the previously reported conclusions about deception and culture under the prism of individualism/collectivism with a larger number of samples and from populations beyond the closed environments of university campuses and small communities used by the original studies. For the other two research goals, we aim at providing an efficient methodology for the deception detection task, exploring the boundaries and limitations of the options and tools currently available for different languages.
To answer our first and second research goals, we performed statistical tests on a set of linguistic cues of deception already proposed in bibliography, placing emphasis on those reported to differentiate across the individualism/collectivism divide. We conducted our analysis on datasets originating from six countries, namely United States of America, Belgium, India, Russia, Romania, and Mexico, which are seen as proxies of cultural features at an aggregate level. Regarding the third research goal, the intuition is to explore different approaches for deception detection, ranging from methodologies that require minimal linguistics tools for each language (such as word n-grams), to approaches that require deeper feature extraction (e.g., syntactic features obtained via language-specific parsers) or language models that require training on large corpora, either in separation or in combination. One of our challenges is the difficulty to collect and produce massive and representative deception detection datasets. This problem is amplified by the diversity of languages and cultures, combined with the limited linguistic tools for under-researched languages despite recent advances [Conneau et al., 2018, Alyafeai et al., 2020, Hu et al., 2020, Hedderich et al., 2020. To this end, we exploit various widely available related datasets for languages with adequate linguistic tools. We also create a new dataset based on transcriptions from a radio game. For each language under research, we created classifiers using a wide range of n-gram features from several levels of linguistic analysis namely, phonological, morphological and syntactic, along with other linguistic cues of deception and token embeddings. We provide the results of the experiments from logistic regression classifiers, as well as fine-tuned BERT models. Regarding BERT, we have experimented with settings specific to each particular language, based on the corresponding monolingual models, as well as with a cross-language setting using the multilingual model [Devlin et al., 2019].
In the remainder of this paper, we first present the relevant background (Section 2), including both theoretical work and computational work relevant to deception and deception detection, with emphasis on the aspects of culture and language. We then proceed with the presentation of the datasets that we utilized (Section 3), the feature extraction process (Section 4), and the statistical evaluation of linguistic cues (Section 5). Subsequently, we present and discuss the classification schemes and the evaluation results, comparing them with related studies (Section 6). Finally, we conclude and provide some future directions for this work (Section 7).

Deception in psychology and communication
Several theories back up the observation that people speak, write and behave differently when they are lying than when they are telling the truth. Freud was the first who observed that the subconscious feelings of people about someone or something are reflected in how they behave and the word choices they make [Freud, 1914]. The most influential theory that connects specific linguistic cues with the truthfulness of a statement is the Undeutsch hypothesis [Undeutsch, 1967[Undeutsch, , 1989. This hypothesis asserts that statements of real-life experiences derived from memory differ significantly in content and quality from fabricated ones, since the invention of a fictitious memory requires more cognitive creativity and control than remembering an actually experienced event.
On this basis, a great volume of research work examines which linguistic features are more suitable to distinguish a truthful from a deceptive statement. These linguistic features can be classified roughly into four categories: word counts, pronoun use, emotion words and markers of cognitive complexity. The results for these dimensions have been contradictory and researchers seem to agree that cues are heavily context-dependent. More specifically, the importance of specific linguistic features tends to change based on many parameters such as the type of text, e.g., dialogue, narrative [Picornell, 2013], the medium of the communication, e.g., face-to-face, computer-mediated [Zhou et al., 2004, Hancock et al., 2007, Zhou and Zhang, 2008, Rubin, 2010, deception type [Frank and Ekman, 1997], how motivated the deceiver is [Frank and Ekman, 1997] etc. There is also a volume of work that examines how the conditions that the experiments were performed in, e.g., sanctioned, unsanctioned, influence the accuracy results and the behaviour of the participants [Feeley and deTurck, 1998, Dunbar et al., 2015.
Given the volatility of the results within even the context of a specific language, the implicit assumption made about the universality of deception cues can lead to false alarms or misses. Differences in social norms and etiquette, anxiety and awkwardness that may stem from the language barrier (when speakers do not use their native languages) can distort judgments. A reasonable argument is that, since the world's languages differ in many ways, the linguistic cues which might have been identified as deceptive in one language might not been applicable to another. For example, a decrease in first person personal pronoun use is an indicator of deception in English [Hauch et al., 2015]. What happens though in languages where personal pronoun use is not always overt such as in Italian, Spanish, Greek and Romanian (i.e., null subject languages)? In addition, modifiers (i.e., adjectives and adverbs), prepositions, verbs are also commonly examined cues. But not all languages use the same grammatical grammatical categories; for example, Russian and Polish have no articles [Newman et al., 2003, Zhou et al., 2004, Spence et al., 2012.
All psychology and communication studies that involve participants from different cultural groups, asking them to identify truth and fabrications within the same and different cultural group, conclude to the same result about the accuracy rate of predictions. More specifically, as Table 1 indicates, the accuracy rate in all the studies dropped to chance when judgments were made across cultures, whereas for within culture judgments it was in line with the rest of the bibliography, that places accuracy to be typically slightly better than chance [DePaulo et al., 1985]. Indeed, deception detection turns out to be a very challenging task for humans. It is indicative that even in studies that involve people who have worked for years at jobs that require training in deception detection, such as investigators or customs inspectors, the results are not significantly better [Ekman and O'Sullivan, 1991]. These results are usually attributed to truth bias, i.e., the tendency of humans to actively believe or passively presume that another person is honest, despite even evidence to the contrary [DePaulo et al., 1985, Vrij, 2008b. The further impairment in accuracy in across culture studies is attributed to the norm violation model. According to this model, people infer deception whenever the communicator violates what the receiver anticipates as being normative behaviour, and this is evident in both verbal and non-verbal communication [Taylor et al., 2014]. Table 1: Social psychology studies on within and across culture deception detection.

Reference Description
Within & Across culture accuracy (%) Bond et al. [1990] Jordanian and US undergraduate students were videotaped while telling lies and truths for the examination of a deceiver's nonverbal behaviour.
how the degree of the individualism and collectivism of each culture, influences the usage of specific linguistic indicators in deceptive and truthful verbal behaviour. The participants were recruited from community and religious centres across North West England, and were self-assigned to one of the groups. The task was to write one truthful and one deceptive statement about a personal experience, or an opinion and counter-opinion in English. In the study, the collectivist group (Black African and South Asian) decreased the usage of pronouns when lying, and used more first-person and fewer third-person pronouns to distance the social group from the deceit. In contrast, the individualistic group (White European and White British) used fewer first-person and more third-person pronouns, to distance themselves from the deceit.
In these works, Taylor stated the hypothesis that affect in deception is related to cultural differences. This hypothesis was based on previous related work that explored the relation between sentiment and deception across cultures, which is briefly summarized in Table 2. The results though refute the original hypothesis, showing that the use of positive affect while lying was consistent among all the cultural groups. More specifically, participants used more positive affect words and fewer words with negative sentiment when they were lying, compared to when they were truthful. Based on his findings, emotive language during deception may be a strategy for deceivers to maintain social harmony.  [1991] Collectivists, in order to avoid conflict and to protect social harmony, may be more engaged to friendly emotions rather than to more unattached emotions like anger. Seiter et al. [2002] Collectivists consider lying more socially acceptable behaviour. To this end, emotions, as proposed by [Vrij et al., 2010] might not emerge in the first place. Matsumoto et al. [2008] Individualism is connected with emotional expression, whereas collectivists are more probable to restraint their emotional expression.
According to the same study, the use of negations is a linguistic indicator of deception in the collectivist group, but is unimportant for the individualist group. Negations have been studied a lot with respect to differences among cultures and the emotions they express. Stoytcheva et al. [2014] conclude that Asian languages speakers are more likely to use negations than English speakers, due to preference to the indirect style of communication. Moreover, Mafela [2013] states that for South African languages the indirect style of communication leads to the usage of negation constructs for the expression of positive meanings.
Contextual details is a cognition factor also examined in Taylor's works. According to the related literature, contextual details such as the spatial arrangement of people or objects, occur naturally when people describe existing events from their memory. The key finding of this study suggests that this is actually true for the relatively more individualistic participants, e.g., European. For the collectivist groups though, spatial details were less important while experiencing the event at the first place and subsequently, during recall. As a result, individualist cultures tend to provide fewer perceptual details and more social details when they are lying, a trend that changes in collectivist cultures. Table 3 summarizes all the above findings.
It is important to mention that the discrepancies on linguistic cues between individualist and collectivist groups were not confirmed for all types of examined lies, namely lies about opinions and experiences. In more details, the analysis showed that pronoun use and contextual embedding (e.g., the "circumstances") varied when participants lied about experiences, but not when they lied about opinions. By contrast, the affect-related language of the participants varied when they lied about opinions, but not experiences. All the above findings indicate that it does not suffice to conceptualize liars as people motivated "to not get caught", since additional factors influence the way they lie, what they do not conceal, what they have to make up, who they want to protect, etc. Leal et al. [2018] investigate if differences in low and high context culture communication styles can be incorrectly interpreted as cues of deceit in verbal communication. Through collective interviews, they studied British interviewees as a representatives of low-context cultures, and Chinese and Arabs as representatives of high-context cultures. The key findings of this work revealed that indeed differences between cultures are more prominent than differences between truth tellers and liars, and this can lead to communication errors.

Automated text-based deception detection
From a computational perspective, the task of deception detection that focuses on pursuing linguistic indicators in text is mainly approached as a classification task that exploits a wide range of features. In this respect, most research work Table 3: Summary of differences in language use between truthful and deceptive statements across the four cultural groups examined in the work of Taylor et al. [2014Taylor et al. [ , 2017. Differences in pronoun usage and perceptual details were confirmed when participants lied about experiences, whereas affective language differences were confirmed when participants lied about opinions. (-) no difference, (I) individualism, (C) collectivism, (↑↑, ↓↓) suggest larger differences between truthful and deceptive statements.

Language indicator White British
combines psycholinguistic indicators drawn from prior work on deception [DePaulo et al., 1985, Porter and Yuille, 1996, Newman et al., 2003] along with n-gram features (mainly word n-grams), in order to enhance predictive performance in a specific context. As already stated, the psycholinguistic indicators seem to have a strong discriminating power in most of the studies, although the quantitative predominance in truthful or deceptive texts is extremely sensitive to parameters, such as how motivated the deceiver is, the medium of communication and the overall context. The number of words that express negative and positive emotions, the number of pronouns, verbs, and adjectives, and the sentence length are among the most frequently used features. Hirschberg et al. [2005] obtain psycholinguistic indicators by using the lexical categorization program LIWC [Pennebaker et al., 2001] along with other features to distinguish between deceptive and non-deceptive speech. In the work of Gîrlea et al. [2016] psycholinguistic deception and persuasion features were used for the identification of deceptive dialogues using as a dataset dialogues taken from the party game Werewolf (also known as Mafia) 1 . For the extraction of the psycholinguistic features, the MPQA subjectivity lexicon 2 was used, as well as manually created lists. Various LIWC psycholinguistic, morphological and n-gram features for tackling the problem of the automatic detection of deceptive opinion spam 3 are examined by Ott et al. [2011Ott et al. [ , 2013. These feature sets were tested in a linear Support Vector Machine (SVM) [Cortes and Vapnik, 1995]. In these two works Ott et al. [2011Ott et al. [ , 2013 provide two datasets with deceptive and truthful opinions, one with positive sentiment reviews [Ott et al., 2011] and one with negative sentiment [Ott et al., 2013]. These datasets, either in isolation or combined, have been used as a gold standard in many works. Kleinberg et al. [2018] examined the hypothesis that the number of named entities is higher in truthful than in deceptive statements, by comparing the discriminative ability of named entities with a lexicon word count approach (LIWC) and a measure of sentence specificity. The results suggest that named entities may be a useful addition to existing approaches. Feng et al. [2012] investigated how syntactic stylometry can help in text deception detection. The features were obtained from Context Free Grammar (CFG) parse trees and were tested over four different datasets, spanning from product reviews to essays. The results showed improved performance compared to several baselines that were based on shallower lexico-syntactic features.
Discourse and pragmatics have also been used for the task of deception detection. Rhetorical Structure Theory (RST) and Vector Space Modeling (VSM) are the two theoretical components that have been applied by Rubin and Vashchilko [2012] in order to set apart deceptive and truthful stories. The authors proposed a two-step approach: in the first step, they analyzed rhetorical structures, discourse constituent parts and their coherence relations, whereas in the second, they applied a vector space model to cluster the stories by discourse feature similarity. Pisarevskaya and Galitsky [2019] also explored the hypothesis that deception in text should be visible from its discourse structure. They formulated the task of deception detection as a classification task using discourse trees, based on RST. For evaluation reasons, they created a dataset containing 2,746 truthful and deceptive complaints about banks in English, where the proposed solution achieved a classification accuracy of 70 per cent. results revealed the difficulties of building a robust cross-domain classifier. More specifically, the average accuracy of 86 per cent in the one-domain setting dropped to a range of 52 to 64 per cent in a cross-domain setting, where a dataset is kept for testing and the rest are used for training. LDA was also used by Jia et al. [2018] along with term frequency and word2vec [Mikolov et al., 2013] for the feature extraction step in a supervised approach to distinguish between fake and non-fake hotel and restaurant reviews. These different features types were examined both separately and in combination, while three classifiers were trained, namely logistic regression, SVM and multilayer perceptron (MLP) [Rumelhart and McClelland, 1987]. The evaluation was performed using the Yelp filter dataset 4 [Mukherjee et al., 2013a] and the experimental results showed that the combinations of LDA with logistic regression and LDA with MLP performed better with 81 per cent accuracy. The work of Martinez-Torres and Toral [2019] focuses on how features may change influenced by the nature of the text in terms of content and polarity. The proposed method examines three different features types based on a bag of words representation. The first type uses all the words in a vocabulary (after a preprocessing step), the second one selects word features that are uniquely associated with each class (deceptive, truthful) while the third one further extends the classes to four, also adding the sentiment polarity factor. The dataset of Ott et al. [2011Ott et al. [ , 2013 was used for the evaluation of the six classifiers (i.e., k-NN, logistic regression, SVM, random forest, gradient boosting and MLP) that were employed. Fontanarava et al. [2017] proposed combining a large number of reviews along with reviewer features for the detection of fake reviews. Some of the features were newly introduced for the task inspired by relevant research in fake news. Features were fed to a random forest classifier which was evaluated on the Yelp filter dataset. The results show that the combined features were beneficiary for the task studied.
Finally, various kinds of embeddings (e.g., token, node, character, document etc.) and deep learning approaches have been applied to the deception detection task. One of the first works is that of Ren and Ji [2017] that employs a Bidirectional Long Short-Term Memory network (BiLSTM) [Graves et al., 2013] to learn document level representations. A semi-supervised approach is employed in Yilmaz and Durahim [2018] for the detection of spam reviews, by using a combination of doc2vec [Le and Mikolov, 2014] and node2vec [Grover and Leskovec, 2016] embeddings. These embeddings are then fed into a logistic regression classifier to identify opinion spam.  proposed a deceptive review identification method that uses recurrent convolutional neural networks [Ming Liang and Xiaolin Hu, 2015] for opinion spam detection. The basic idea is that since truthful reviews have been written by people in the context of the real experience, while the deceptive ones are not, this contextual information can be exploited by the model. Aghakhani et al. [2018] adopted Generative Adversarial Networks (GANs) [Goodfellow et al., 2014] for the detection of deceptive reviews.

Non-English and multi-language research
Without a doubt, the English language engrosses the majority of the research interest for the task of deception detection, due to the bigger pool of English speaking researchers, the interest of industry for commercial exploitation and the abundance of linguistic resources. However, analogous approaches have been utilized also in other languages.
In the work of Verhoeven and Daelemans [2014], the task of deception detection from text for the Dutch language is explored by using an SVM with unigram features. In the absence of any related dataset, the authors proceeded with the construction of their own dataset. SVMs have also been used for deception detection in opinions written in Spanish with the use of the Spanish version of the LIWC [Almela et al., 2012].
Similarly, in the work of Tsunomori et al. [2015] a dialogue corpus for the Japanese language is presented and subsequently a binary classification based on decision trees over this corpus is performed using acoustic/prosodic, lexical and subject-dependent features. The comparison with a similar English corpus has shown interesting results. More specifically, while in the prosodic/acoustic features there were no differences between the two languages, in lexical features the results were greatly different. In English, noise, third person pronoun, and features indicating the presence of "Yes" or "No" were effective. In Japanese the lexical features used in this research were largely ineffective; and only one lexical feature, the one that indicated the presence of a verb base form, proved effective.
For the Chinese language, one of the first studies is that of Zhou and Sung [2008] who examined the computermediated communication of Chinese players engaged in the Werewolf game. Having as starting point prior research for English, they ended up with a list of features, (e.g., number of words, number of messages, average sentence length, average word length, total number of first-person and third person singular/plural pronouns) and they performed statistical analysis. Results revealed that, consistent with some studies for English speakers, the use of third person pronouns increased during deception. In Chinese though, there were no significant differences between the proportional use of first pronouns.
For spam detection in Arabic opinion texts an ensemble approach has been proposed by Saeed et al. [2019]. A stacking ensemble classifier, that combines a k-means classifier with a rule-based classifier, outperforms the rest of the examined approaches. Both classifiers use content-based features, like n-grams. Given the lack of datasets for fake reviews in Arabic, the authors use for evaluation purposes the translated version of the dataset of Ott et al. [2011Ott et al. [ , 2013. They also use this dataset for the automatic labelling of a large dataset of hotel reviews in Arabic [Elnagar et al., 2018]. A supervised approach is also utilized for deceptive review detection in Persian [Basiri et al., 2019]. In this work, POS tags, sentiment based features, and metadata (e.g., number of positive/negative feedback, overall product score, review length, etc.) are exploited to construct and compare various classifiers (e.g., naive Bayes, SVMs and decision trees). A dataset with 3,000 deceptive and truthful mobile reviews was gathered using customers reviews published in digikala.com. The labelling of the latter dataset was performed by using a majority voting on the answers of 11 questions previously designed for spam detection by human annotators.
Last but not least, to the best of our knowledge, the only work towards the creation of cross-cultural deception detection classifiers is the work of Perez-Rosas et al. Mihalcea, 2014]. Similar to our work, country is used as a proxy for culture. Using crowdsourcing, the authors collected four deception datasets. Two of them are in English, originating from the United States and from India, one in Spanish obtained from speakers from Mexico, and one in Romanian from people from Romania. Next, they built classifiers for each language using unigrams and psycholinguistic (based on LIWC) features. Then, they explored the detection of deception using training data originating from a different culture. To achieve this, they investigated two approaches. The first one is based on the translation of unigrams features, while the second one is based on the equivalent LIWC semantic categories. The performance, as expected, dropped in comparison with the within-culture classification and was similar for both approaches. The analysis for the psycholinguistic features showed that there are word classes in LIWC that only appear in some of the cultures, e.g., classes related to time appear in English texts written by Indian people and in Spanish texts but not in the US dataset. Lastly, they observed that deceivers in all cultures make use of negation, negative emotions, and references to others and that truth tellers use more optimism and friendship words, as well as references to themselves.

Datasets
We experimented with eleven datasets from six countries, namely United States, Belgium, India, Russia, Romania, and Mexico. We provide a detailed description of each dataset below, while Table 4 provides some statistics and summarizes important information for each dataset. We put much effort on the collection and the creation of the appropriate datasets. We wanted to experiment with fairly diverse cultures in terms of the degree of individualism/collectivism, having at the same time at our disposal basic linguistic tools and resources for the linguistic features extraction step.
In terms of the quantification of cultural diversity, we based our work on Hofstede's long-standing research on cultural differences [Hofstede, 2001]. Hofstede defined a framework that distinguishes six dimensions (power distance, individualism/collectivism, uncertainty avoidance, masculinity/femininity, long-term/short-term orientation and indulgence/restraint) along which cultures can be characterized. In his study, as in our work, country has been used as a proxy for culture. For each dimension, Hofstede's provides a score for each culture. Figure 1 depicts the cultural differences for the six aforementioned countries for the individualism dimension, which is the focus of our work. The individualism scores vary significantly, with United States possessing the highest one, and both Mexico and Romania the lowest. We acknowledge that treating entire countries as single points along the individualism/collectivism dimension may be an over-simplification, especially for large countries. In the United States, for example, there is heterogeneity and diversity between regions (e.g., between Deep South and Mountain West) and even in the same region there may be different cultural backgrounds. However, the United States can be considered individualistic at an aggregate level, although there is a measurable variation on the value of this dimension [Vandello andCohen, 1999, Taras et al., 2016].
The creation of reliable and realistic ground truth dataset for the deception detection task is considered a difficult task on its own [Fitzpatrick and Bachenko, 2012]. In our case, the selected corpora have been created using the traditional techniques for obtaining corpora for deception detection research, namely sanctioned and unsanctioned deception. Briefly, a sanctioned lie is a lie to satisfy the experimenter's instructions, e.g., participants are given a topic, while an unsanctioned lie is a lie that is told without any explicit instruction or permission from the researcher, e.g., diary studies and surveys in which participants recall lies already uttered. Crowdsourcing platforms, e.g., Amazon Mechanical Turk 5 , have also been used for the production of sanctioned content. In all sanctioned cases, a reward (e.g., a small payment) was given as a motivation. In addition, apart from the already existing datasets in the bibliography, we created a new dataset (see Section 3.4) that concerns spoken text from transcripts of a radio game show.

English -Deceptive Opinion Spam (OpSpam)
The OpSpam corpus 6 [Ott et al., 2011[Ott et al., , 2013 was created with the aim to constitute a benchmark for deceptive opinion spam detection and has been extensively used as such in subsequent research efforts. The authors approached the creation of the deceptive and truthful opinions in two distinct ways. Firstly, they chose hotel reviews as their domain, due to the abundance of such opinions on the Web, and focused on the 20 most popular hotels in Chicago and positive sentiment reviews. Deceptive opinions were collected by using Amazon Mechanical Turk. Quality was ensured by applying a number of filters, such as using highly rated turkers, located in the Unites States, and allowing only one submission per turker. Based on these restrictions, 400 deceptive positive sentiment opinions were collected. Secondly, the truthful opinions were collected from TripAdvisor 7 for the same 20 hotels as thoroughly described in Ott et al. [2011]. Only 5-star reviews were kept to collect reviews with positive sentiment, eliminating all non-English reviews, all reviews with less than 150 characters, and reviews of authors with no other reviews. This was an effort to eliminate possible spam from the online data. Then, 400 truthful comments were sampled to create a balanced dataset. The same procedure was followed for negative sentiment reviews, by collecting 400 more deceptive opinions with negative sentiment through Amazon Mechanical Turk, and 400 truthful with 1 or 2 star reviews from various online sites. For more details, see Ott et al. [2013].
Human performance was assessed with the help of volunteers. They asked three untrained undergraduate university students to read and judge the truthfulness and deceptiveness of a subset of the acquired datasets. An observation from the results is that human deception detection performance is greater for negative (61 per cent) rather than positive deceptive opinion spam (57 per cent). But in both cases, automated classifiers outperform human performance.

English -Boulder Lies and Truth Corpus (Boulder)
Boulder Lies and Truth corpus 8 [Salvetti et al., 2016] was developed at the University of Colorado Boulder and contains approximately 1,500 elicited English reviews of hotels and electronics for the purpose of studying deception in written language. Reviews were collected by crowdsourcing with Amazon Mechanical Turk. During data collection, a filter was used to accept US -only submissions [Salvetti, 2014]. The original corpus divides the reviews in three categories: • Truthful: a review about an object known by the writer, reflecting the real sentiment of the writer towards the object of the review.
• Opposition: a review about an object known by the writer, reflecting the opposite sentiment of the writer towards the object of the review (i.e., if the writers liked the object they were asked to write a negative review, and the opposite if they didn't like the object).
• Deceptive (i.e., fabricated): a review written about an object unknown to the writer, either positive or negative in sentiment.
This is one of the few available datasets that distinguish different types of deception (fabrications and lies). Since the dataset was constructed via turkers, the creators of the dataset took extra care to minimize the inherent risks, mainly the tendency of turkers to speed up their work and maximize their economic benefit through cheating. More specifically, the creators implemented several methods to validate the elicited reviews, checking for plagiarism efforts and the intrinsic quality of the reviews. We unified the two subcategories of deception (fabrication and lie), since the focus of this work is to investigate deceptive cues without regard to the specific type of deception.

English -DeRev
The DeRev dataset [Fornaciari and Poesio, 2014] comprises deceptive and truthful opinions about books. The opinions have been posted on Amazon.com. This is a dataset that provides "real life" examples on how language is used to express deceptive and genuine opinions, i.e., this is an example of a corpus of unsanctioned deception. Without a doubt, manually detecting deceptive posts in this case is a very challenging task, since it is impossible to find definite proof that a review is truthful or not. For that reason a lot of heuristic criteria were employed and only a small subset of the collected dataset that had high degree of confidence was accepted to be included in the gold standard dataset. In more details, only 236 out of the 6,819 reviews that were collected (118 deceptive and 118 truthful) constituted the final dataset. The starting point for identifying the deceptive and genuine clues that define the heuristic criteria was a series of articles 9,10,11,12 with suggestions and advice about how to unmask a deceptive review in the Web, as well as specific incidents of fake reviews that have been disclosed. Such clues are the absence of information about the purchase of the reviewed book, the use of nicknames, reviews that have been posted for the same book in a short period of time, and a reference to a suspicious book (i.e., a book whose authors have been accused of purchasing reviews, or 8 https://catalog.ldc.upenn.edu/LDC2014T24 9 http://www.guardian.co.uk/books/2012/sep/04/sock-puppetry-publish-be-damned 10 https://www.moneytalksnews.com/3-tips-for-spotting-fake-product-reviews-%E2%80%93-from-someone-who-wrote-them 11 http://www.nytimes.com/2011/08/20/technology/finding-fake-reviews-online.html 12 http://www.nytimes.com/2012/08/26/business/book-reviewers-for-hire-meet-a-demand-for-online-raves.html have admitted that they have done so). The truthfulness of the reviews was identified in a similar manner by reversing the cues. We performed a manual inspection, which confirmed that all of the 113 reviewers of the 236 reviews we used (excluding 8 reviewers whose accounts were no longer valid) had submitted at least one review marked by the platform as having been submitted in the United States. Hence, it is reasonable to assume that the vast majority of the reviewers were US-based.

English -Bluff The Listener (Bluff)
The "Wait Wait... Don't Tell Me!" is an hour-long weekly radio news panel game show produced by Chicago Public Media and National Public Radio (NPR) 13 that airs since 1998. One of the segments of this show is called "Bluff the Listener" in which a contestant listens to three thematically linked news reports from three panelists, one of which is truthful and the rest are fictitious. Most of the stories are humorous and somewhat beyond belief, e.g., a class to teach your dog Yiddish. The listener must determine the truthful story in order to win a prize, whereas at the same time the panelist that is picked is awarded with a point to ensure the motivation for all the participants. An archive of transcripts of this show is available since 2007 in the official web page of the show. We used these transcripts and we managed to retrieve and annotate 178 deceptive and 89 truthful stories. Consequently, we collected the participant's replies to calculate the human success rate. Interestingly, the calculated rate was about 68 per cent, which is quite high since in experimental studies of detecting deception, the accuracy of humans is typically only slightly better than chance, mainly due to truth bias as previously mentioned. This might be attributed to the fact that the panelists of the show have remained almost the same, and as a result the listeners might have learned their patterns of deception over time.
In addition, we have to stress that the intent of the panelists to deceive is intertwined with their intent to entertain and amuse their audience. Hence, it is interesting to examine if the linguistic cues of deception can be distorted by this double intent, and if they still suffice to discriminate between truth and deception even in this setting.

English/Spanish/Romanian -Cross-Cultural Deception
To the best of the authors' knowledge, this is the only available multicultural dataset constructed for cross-cultural deception detection 14 . It covers four different languages, EnglishUS (English spoken in the US), EnglishIndia (English spoken by Indian people), SpanishMexico (Spanish spoken in Mexico), and Romanian, approximating culture with the country of origin of the dataset. Each dataset consists of short deceptive and truthful essays for three topics: opinions on abortion, opinions on death penalty, and feelings about a best friend. The two English datasets were collected from English speakers using Amazon Mechanical Turk with a location restriction to ensure that the contributors are from the country of interest (United States and India). The Spanish and Romanian datasets were collected from native Spanish and Romanian speakers using a web interface. The participants for Spanish and Romanian have been recruited through contacts of the paper's authors. For all datasets, the participants were asked first to provide their truthful responses, and then their deceptive ones. In this work, we use all the available individual datasets. We detected a number of spelling errors and some systematic punctuation problems in both English datasets, with the spelling problems to be more prevalent in the EnglishIndia dataset. To this end, we decided to correct the punctuation errors, e.g., "kill it.The person", in a preprocessing step in both datasets.
Regarding the spelling errors, we found no correlation between the errors and the type of text (deceptive, truthful), and since the misspelled words were almost evenly distributed among both types of text, we did not proceed to any correction.

Dutch -CLiPS Stylometry Investigation (CLiPS)
CLiPS Stylometry Investigation (CSI) corpus 15 [Verhoeven and Daelemans, 2014] is a Dutch corpus containing documents of two genres namely essays and reviews. All documents were written by students of Linguistics & Literature at the University of Antwerp 16 , taking Dutch proficiency courses for native speakers, between 2012 and 2014. It is a multi-purpose corpus that serves in many stylometry tasks such as detection of age, gender, authorship, personality, sentiment, deception, genre. The place that authors grew up is provided in the metadata. On this basis, it is known that only 11.2 per cent of the participants grew up outside Belgium, with the majority of them (9.7 per cent of the total authors) grown up in the neighbouring country of the Netherlands.
This corpus, which concerns the review genre, contains 1,298 (649 truthful and 649 deceptive) texts. All review texts in the corpus are written by the participants as a special assignment for their course. Notice that the participants did 13 http://www.npr.org/programs/wait-wait-dont-tell-me 14 http://web.eecs.umich.edu/~mihalcea/downloads.html#CrossCulturalDeception 15 https://www.clips.uantwerpen.be/bibliography/csi-corpus 16 The city of Antwerp is the capital of Antwerp province in the Flemish Region of the Kingdom of Belgium.
not know the purpose of the review task. For the collection of the reviews students were asked to write a convincing review, positive or negative, about a fictional product while the truthful reviews reflect the author's real opinion on an existing product. All the reviews were written about products from the same five categories: smartphones, musicians, food chains, books, and movies.

Russian -Russian Deception Bank (Russian)
For the Russian language, we used the corpus of the rusProfilingLab 17 . It contains truthful and deceptive narratives written by the same individuals on the same topic ("How I spent yesterday" etc.). To minimize the effect of the observers paradox 18 , researchers did not explain the aim of the research to the participants. Participants that managed to deceive the trained psychologist who evaluated their responses, were rewarded with a cinema ticket voucher. The corpus consists of 113 deceptive and 113 truthful texts, written by 113 individuals (46 males and 67 females) who were university students and native Russian speakers. Each corpus text is accompanied by various metadata such as gender, age, results of a psychological test, etc.

English -Native English (NativeEnglish)
Finally, we combined all the datasets that were created from native English speakers (i.e., OpSpam, Boulder, DeRev, Bluff, and EnglishUS) in one dataset. The idea is to create one multi-domain dataset, big enough for training, where the input is provided by native speakers.

Features
In this section, we detail the feature selection and extraction processes. Furthermore, we explicitly define the features that we exploited for pinpointing differences between cultures.

Features extraction
We have experimented with three feature types along with their combinations, namely a plethora of linguistic cues (e.g., word counts, sentiment, etc.), various types of n-grams, and token embeddings. Linguistic indicators are extracted based on prior work, as already analyzed in Sections 2.3 and 2.4. Further, we have evaluated various types of ngrams in order to identify the most discriminative ones. The use of n-grams is among the earliest and more effective approaches for the task of deception detection. Ott et al. [2011] and Fornaciari et al. [2013] were among the first to use word n-grams for deception detection, while character n-grams and syntactic n-grams (defined below) have been used by Fusilier et al. [2015] and Feng et al. [2012] respectively. Lastly, due to the absence of a large training corpus, we tried to combine feature engineering and statistical models, in order to enhance the overall performance and get the best of both worlds. This approach is in line with recent research on deception detection that tries to leverage various types of features [Bhatt et al., 2018, Krishnamurthy et al., 2018, Siagian and Aritsugi, 2020.  [Taylor et al., 2014[Taylor et al., , 2017 are marked with an asterisk (*). The dot (•) marks non-normalized features. Absence of a tick marks the inability to extract this specific feature for this particular language. The N/A indicates that this feature is not applicable for this particular language.
Features English Dutch Russian Spanish Romanian  Table 5 presents the complete list of features for each language explored in this work. These features count specific cues in text, aiming to capture characteristics of deceptive and truthful language. These indicators have been conceptually divided into six categories, namely word counts, phoneme counts, pronoun use, sentiment, cognitive complexity, and relativity. The absence of a tick in Table 5 marks the inability to extract the specific feature, given the available linguistic tools and resources for each language while the "N/A" marks the non-existence of the particular feature in the specific language, i.e., articles in Russian.
Although we believe that most feature names are self-explanatory, we have to describe further the #hedges and #boosters features. Hedges is a term coined by the cognitive linguist George Lakoff [Lakoff, 1973] to describe words expressing some feeling of doubt or hesitancy (e.g., guess, wonder, reckon etc.). On the contrary, boosters are words that express confidence (e.g., certainly, apparently, apparent, always). Both are believed to correlate either positively or negatively with deception and thus are frequently used in related research work [Bachenko et al., 2008]. Regarding the important feature of pronouns, we consider first person pronouns in singular and plural form, e.g., I vs. we, mine vs. ours, etc., third person pronouns, e.g., they, indefinite pronouns, e.g., someone, anyone, etc., demonstrative pronouns (e.g., this, that, etc.), and the total number of pronouns. The linguistic tools used for the extraction of the features, e.g., POS taggers, named entity recognition tools, etc., are shown in Table 8. Some of the features were extracted with handcrafted lists authored or modified by us. Such features include filled pauses (e.g., ah, hmm etc.), motion verbs, hedge words, boosters. etc. Table 7 lists the sentiment analysis tools used for each language. We exploited, whenever possible, language-specific sentiment lexicons used in the bibliography and avoided the simple solution of automatically translating sentiment lexicons from American English. Related research [Mohammad et al., 2016] has shown that mistranslation (e.g., positive words translated as having neutral sentiment in the target language), cultural differences, and different sense distributions may lead to errors and may insert noise when translating sentiment lexicons. Analogously, we maintained the same practice for the rest of the features. When this was not feasible, we proceeded with just the translation of linguistic resources (mostly for the Russian language). For the #spatial words feature, that counts the number of spatial references in text, we followed a two-step process. We employed a combination of a named entity recognizer (NER) tool (see Table 8) and spatial lexicons for each language. The lexicons, principally gathered by us, contain spatially related words (e.g., under, nearby, etc.) for each language, while the named entity recognizer extracts location related entities from the corpora (e.g., Chicago, etc.). In the case of the English language, the existence of a spatial word in the text was computed using a dependency parse, in order to reduce false positives. The final value of this feature is the sum of the two values (spatial words and location named entities). For Romanian, we had to train our own classifier based on Conditional Random Fields (CRFs) [Lafferty et al., 2001, Finkel et al., 2005 by using as training corpus the RONEC [Dumitrescu and Avram, 2020a] corpus, a free, open-source resource that contains annotated named entities for 5,127 sentences in Romanian.
The values of the features were normalized depending on their type. For example, the #nasals feature was normalized by dividing with the total number of characters in the document, while the #prepositions with the number of tokens in the document. The features #words, #lemmas, #punctuaction marks, average word length, mean sentence length and mean preverb length were left non-normalized. For each sentiment lexicon, except for ANEW, we computed the score by applying the following formula to each document d of |d| tokens and each sentiment s (positive|negative): |d| The sentiment_strength for SentiWordNet is a value in the interval [0,1] while for the rest sentiment resources the values are either 0 or 1.
For the ANEW lingustic resource [Bradley and Lang, 1999] that rates words in terms of pleasure (affective valence), arousal, and dominance with values from 0 to 10, we only considered the normalized valence rating that expresses the degree of positivity or negativity of a word. The applying formula in this case is: Lastly, we included phoneme-related features in our analysis. Our hypothesis was that phonological features, captured by phonemes for text, will be more discriminative in spoken datasets, since the deceiver will put extra care to sound more truthful to the receiver, even subconsciously. This hypothesis is in line with an increasing volume of work that investigates the existence of non-arbitrary relations between phonological representation and semantics. This phenomenon is known as phonological iconicity and links a word's form with the emotion it expresses [Nastase et al., 2007, Schmidtke et al., 2014. Table 6 summarizes such representative works.  Fónagy [1961] This early study on Hungarian poems showed that sonorants (e.g., /l/, /m/) occur more often in tender, but plosives (e.g., /k/, /t/) more often in aggressive poems. Taylor and Taylor [1965] Evidence that pleasantness relations are language specific. Zajonc et al. [1989] Passages about Hell from Miltons "Paradise Lost" were found to contain significantly more front vowels and hard consonants than passages about Eden while the latter contained more medium back vowels. Whissell [1999] The analysis of phonemes in different sources (song lyrics, poetry, word lists, advertisements) show that plosives correlate with unpleasant words. Auracher et al. [2010] Multilingual analysis on poems found that plosive sounds are more likely to express a pleasant mood, whereas a relatively high frequency of nasal sounds indicates an unpleasant mood. Universality is claimed since the authors found the same trend independently of the language. Kraxenberger and Menninghaus [2016] This work failed to reproduce the results of Auracher et al. [2010].

Papantoniou and Konstantopoulos [2016]
The analysis of names of movie fictional characters showed among other findings the connection of the nasals with negative sentiment. en ANEW [Bradley and Lang, 1999] Normative emotional ratings for 3,188 words. It provides values in respect to pleasure, arousal and dominance of each term. FBS a [Hu and Liu, 2004] 6,786 words (2,006 positive and 4,783 negative). MPQA b [Riloff andWiebe, 2003, Wilson et al., 2005] Each word annotated for intensity (strong, weak). 6,885 words (2,718 positive and 4,912 negative). SentiWordNet c [Baccianella et al., 2010] All WordNet synsets automatically annotated for degrees of positivity, negativity, and neutrality/objectiveness. nl VU-sentiment-lexicon d [Maks et al., 2014] 9,237 words (3,314 positive and 5,923 negative). ru RuSentiLex e [Loukachevitch and Levchik, 2016] Lexicon generated through semi-automatic techniques, which contains 16,057 words (10,227 negative, 3,770 positive, 1,747 neutral and 291 either positive or negative based on context). es Spanish Sentiment Lexicon [Pérez-Rosas et al., 2012] It provides two polarity lexicons: a. an automatically generated with 2,496 concepts and b. a semi-automatically generated with 1,347 concepts. We employed the semiautomatically generated lexicon since it is the one with largest reported accuracy of approximately 90 per cent. ro RoSentiLex We translated the MPQA lexicon by using a bilingual Romanian-English dictionary [Mihalcea, 2014].

N-grams
We have evaluated several variations of n-grams from various levels of linguistic analysis to encode linguistic information. Given the diversity of the datasets, we used different types of n-grams to identify those that are more effective in discriminating deceptive and truthful content. For each n-gram type and for each dataset we extracted unigrams, bigrams, trigrams, unigrams+bigrams, bigrams+trigrams, and unigrams+bigrams+trigrams. Some examples are shown in Table 9. • Phoneme n-grams: These features were extracted from the phonetic representation of texts derived by applying the spelling-to-phoneme module of the espeak-ng speech synthesizer (see Table 8). We examined phoneme n-grams at the level of words.
• Character n-grams: Consecutive characters that can also belong to different words.
• Word n-grams: We examined versions with and without stemming and stopword removal.
• POS n-grams: POS n-grams are contiguous part-of-speech tag sequences such as adjective-noun-verb, noun-verb-adverb, and so on, that provide shallow grammatical information. We extracted POS n-grams using the appropriate POS-tagger for each language (see Table 8).
• Syntactic n-grams: syntactic n-grams (sn-grams) are constructed by following all the possible paths in dependency trees and keeping the labels of the dependencies (arcs) along the paths. We used Stanford's CoreNLP syntactic parser for the construction of dependency trees for the English datasets (see Table 8).

BERT embeddings
Regarding token embeddings, we used the contextualized embeddings from the BERT [Devlin et al., 2019] model. BERT, which stands for Bidirectional Encoder Representations from Transformers, is a language model based on a stack of Transformer encoder layers pretrained on a large unlabelled cross-domain corpus using masked language modeling and next-sentence prediction objectives. Since its introduction, BERT has achieved state-of-the-art results in many NLP tasks. In most cases, the best results are obtained by adding a shallow task-specific layer (e.g., a linear classifier) on top of a pretrained BERT model, and fine-tuning (further training) the pretrained BERT model jointly with the task-specific layer on a labelled task-specific dataset. In effect, each encoder layer of BERT builds token embeddings (dense vectors, each representing a particular token of the input text). The token embeddings of each encoder layer are revised by the next stacked encoder layer. A special classification embedding ([CLS]) is also included in the output of each layer, to represent the entire input text. In classification tasks, typically the [CLS] embedding of the top-most encoder layer is passed on to the task-specific classifier, which in our case decides if the input text is deceptive or not. We explore this approach in Section 6.2. We note that BERT uses a WordPiece tokenizer 19 [Schuster and Nakajima, 2012], which segments the input text in tokens corresponding to character sequences (possibly entire words, but also subwords or even single characters) that are frequent in the large corpus BERT is pretrained on. We also note that BERT's token embeddings are context-aware, i.e., different occurrences of the same token receive different embeddings when surrounded by different contexts. In Table 10 we provide details about the used BERT models. We exploit pretrained models on each language, as well as the multilingual BERT model, which is pretrained over Wikipedia in 104 languages.

Statistical evaluation of linguistic cues
In this section we conduct a statistical analysis of the linguistic cues (see Section 4.2) per dataset. In more details, we conduct a Mann-Whitney U test to identify the statistically significant linguistic features of each dataset (the Na-tiveEnglish dataset is the unified dataset of all native English speakers datasets). Afterwards, we apply a multiple logistic regression (MLR) analysis over the statistically important features of each dataset. This test shows the distinguishing strength of the important linguistic features. We discuss the results for each dataset/culture, and try to provide some cross-cultural observations.

Statistical significance analysis
Since we cannot make any assumption about the distribution of the feature values in each dataset, we performed the non-parametric Mann-Whitney U test (two-tailed) with a 99 per cent confidence interval and α = 0.01. The null hypothesis (H0) to be refuted is that there is no statistically significant difference between the mean rank of a feature for texts belonging to the deceptive class and the mean rank of the same feature for texts belonging to the truthful class.
The results are available in the Appendix Tables 31 and 32. Below we summarize the main observations.
1. No statistically significant features were found in the Russian collection and as a result we ignore this dataset in the rest of this analysis. This is probably due to the inappropriateness of the selected features and/or the shortage of language resources for the Russian language, or even because of the intrinsic properties and peculiarities of the dataset itself. This suggests that we cannot come to any conclusion about how the linguistic features are used in this dataset and compare it with the rest. 2. Statistically significant differences were found in most of the datasets for the features: #lemmas, #words, and #punctuation. In more details: • The importance of #lemmas is observed in most of the datasets. A large number of lemmas seems to be a signal for truthful texts in most of the examined datasets, with the exception of the DevRev and Bluff datasets, where a large number of lemmas is a signal for deceptive texts. These two datasets are quite distinct from the rest, since the former is an example of unsanctioned deception, while the latter concerns transcriptions of spoken data with notably stylistic elements like humour and paralogism.
Although, we cannot characterize it as a universal feature, since it is not observed in the Russian dataset, it is a language-agnostic cue that seems to be employed across most cultures. • The same observations hold also for the feature #words, with the exception that it is not statistically significant for the OpSpam dataset. • Regarding the #punctuation feature, it is rather important for all datasets except for Bluff and DeRev.
Since Bluff is a dataset created from transcripts, the transcription process might shadow the intonation and emotional status of the original agent with the idiosyncrasies of the transcriber/s, e.g., there are almost zero exclamations. Furthermore, the use of punctuation, except in DeRev and Bluff, is an indication of truthful text. 3. An observation of possibly cultural origin is the fact that sentiment related features, positive or negative, are notably important for the individualist cultures (US and Dutch). The expression of more positive sentiment vocabulary is linked with the deceptive texts, while negative sentiment is linked to truthful text, except in the EnglishUS case, where the negative sentiment is related to the deceitful texts. For the collectivistic cultures that are more engaged in the high context communication style, sentiment related features are not distinguishing. As explained earlier, the effort to restrain sentiment and keep generally friendly feelings towards the others in order to protect social harmony, might be responsible for this difference. Our findings contradict Taylor's results and are in agreement with his original hypothesis and related studies like Seiter et al. [2002] (see Section 2.3). 4. Another important finding of our experiments is that in almost all datasets the formulation of sentences in past tense is correlated with truth, while in present tense with deception, independently of the individualistic score of the corresponding culture. This can be attributed to the process of recalling information in the case of truthful reviews or opinions. In the case of deception, present tense might be used due to preference to simpler forms, since the deceiver is in an already overloaded mental state. In the US datasets the only exceptions are the Bluff and the OpSpam datasets, where we observe the opposite. However, in the OpSpam dataset these two features are not statistically significant. 5. Furthermore, the #modal verbs is important in the US datasets. Specifically, an increased usage of modal verbs usually denotes a deceptive text. 6. Another cross-cultural observation correlated with the degree of individualism is the #spatial words feature.
Specifically, for the datasets where this feature is important, we observe a difference in the frequency of spatial details for the deceptive texts in the collectivist datasets and the truthful texts in the individualistic ones. In detail, more spatial features are linked with deception for the Romanian and SpanishMexico datasets, while their frequency is balanced in the case of Dutch, and diverges to truthful text for the NativeEnglish dataset. These observations are in agreement with Taylor (see Table 3). On top of that, discrepancies in the quantity of spatial details have also been found in different modalities [Qin et al., 2005]. More specifically, deceivers had significantly fewer spatial details than truth-tellers in audio but more in text. This signifies how sensitive this linguistic cue is not only across cultures but also when other parameters such as context or modality vary. 7. Regarding the #pronouns, our results show mixed indications about their usage that do not fully agree with Taylor. Notice though, that we had only limited tool functionality for pronoun extraction (i.e., no tools for Dutch and SpanishMexico). As a result, we created our own lists for English and used translations for the other languages. Generally, pronouns in various forms seem to be important in most datasets. Third person pronouns are correlated with deceptive texts mainly in EnglishUS and less in the Romanian and EnglishIndia datasets, all of which belong to the same cross-cultural opinion deception detection dataset, and to truthful ones in the Boulder dataset (with a rather small difference though). This is in partial agreement with Taylor's results, where third person pronouns are linked with deception in collectivist languages. Regarding first person pronouns, the observations show mixed results. They are linked with both truthful and deceptive text, in the latter case though only for individualistic datasets (i.e., Bluff and OpSpam). Exploring the use of singular and plural forms sheds a bit more light, since the plural form is linked with truthful text in both collectivistic and individualistic cultures, except in Dutch where the plural form slightly prevails for deceptive. Finally, indefinite and demonstrative pronouns are rarely important. 8. The #nasal feature, that counts the occurrences of /m/, /n/ and in some languages /N/ in texts, is rather important for the highly collective SpanishMexico and Romanian datasets. It prevails in truthful texts while we observe the opposite in the individualistic NativeEnglish. This is an interesting observation that enriches the relevant research around nasals. Generally, there are various studies (see Table 6) that claim a relation between the occurrence of consonants and the emotion of words based on the physiology of articulation for various languages. Most of the studies link nasals with sadness and plosives with happiness, although there are other studies contradicting these results (see Table 6). Furthermore, nasals have been connected with different semantic classes like iconic mappings, size and affect as shown by Schmidtke et al. [2014]. Finally, notice that plosives are not statistically significant in our results. We believe that this is a direction that needs further research with larger datasets and more languages.
9. Finally, the #filled pauses feature, that was incorporated to showcase differences between written and oral deception cues, does not provide any remarkable insight.
A collateral resulting observation is that most of the distinguishing features do not require complex extraction processes but only surface processing like counts on the token level.

EnglishUS and EnglishIndia datasets comparison
The EnglishUS and EnglishIndia datasets are ideal candidates to examine individualism-based discrepancies in linguistic deception cues by keeping the language factor the same. These two datasets are part of the Cross-Cultural Deception dataset (see Section 3.5) and were created using the same methodology. Both contain opinions on the same topics (death penalty, abortion, best friend), in the same language, and come from two cultures with a large difference in terms of the individualism index score (91 vs 48). Initially, to explore differences in the authors writing competence we computed the Flesch reading-ease score 20 [Kincaid et al., 1975] on both datasets. The scores are similar (63.0 for the EnglishIndia dataset and 63.6 for EnglishUS dataset) and correspond to the same grade level. Notice though, that based on Tables 31 and 32 in the Appendix, the native speakers use larger sentences and more subordinate clauses. A possible explanations is that since Indians are not native speakers of English, they might lack in language expressivity and use English similarly whether they are telling the truth or lying.
A crucial observation though, is the limited number of statistically important features in the case of the EnglishIndia (only 3) compared to the EnglishUS (15). Furthermore, the pronoun usage differs a lot between the two datasets. In more details, the individualist group employs more #1st person pronouns in truthful text, while in the case of the EnglishIndia first person pronouns are not important. In the case of #3rd person pronouns both datasets use the same amount of pronouns with a similar behaviour. As already mentioned, this might be a difference of cultural origin, since individualist group deceivers try to distance themselves from the deceit, while in the collectivist group deceivers aim to distance their group from the deceit. Finally, we notice again the importance of the sentiment cues for the native English speakers and their insignificance in the EnglishIndia dataset, which correlates to our previous observations. For the remaining features, it is risky to make any concluding statements in relation to cultural discrepancies.

Multiple Logistic Regression (MLR) analysis
To further examine the discriminative ability of the linguistic features and explore their relationship, we conducted a multiple logistic regression (MLR) analysis on the resulting significant features from the Mann-Whitney U test. The null hypothesis is that there is no relationship between the features and the probability of a text to be deceptive. In other words, all the coefficients of the features are considered equal to zero for the dependent variable.
Since MLR presupposes uncorrelated independent variables, we keep only the most significant feature for any set of correlated and dependent features for each dataset and manually filter out the rest. For example, we keep only the single most important positive or negative sentiment feature per dataset (e.g., in English where we use various lexicons). Also in the case of features that are compositions of more refined features, we keep the most refined ones when all of them are important, e.g., we keep the feature pair #first person pronouns (singular) and #first person pronouns (plural) instead of the more general #first person pronouns. Overall, we cannot guarantee that there is no correlation between the features.
In Tables 11 and 12 we present the results of the MLR analysis, reporting the features with p-value <0.1, for the native English and the cross language cases respectively. For each feature in the table we report the corresponding coefficient, the standard error, the z-statistic (Wald z-statistic), and the p-value. Higher coefficient values increase the odds of having a deceptive text in the presence of this specific feature, while lower values increase the odds of having a truthful text. The Wald (or z-value) is the regression coefficient divided by its standard error. The larger magnitude (i.e., either too positive or too negative) indicates that the corresponding regression coefficient is not 0 and the corresponding feature matters. Generally, there are no features participating in all functions both in the context of the native English datasets and across different languages and cultures, an indication of how distinct the feature sets are both within and across cultures. Among different languages is difficult to conclude how the characteristics of each language (e.g., pronoun-drop languages) and/or the different extraction processes (e.g., sentiment lexicons) affect the analysis. A more thorough analysis is safer to be performed in the context of the same language though. Below, we report some observations from this analysis.

Native English dataset observations
For the native English language datasets shown in Table 11, we observe high coefficients for the various types of pronouns (especially of the first person ones). Although there is no clear indication about their direction, most of the times they are associated with truthful text. The only exceptions are the #1st person pronouns (singular) in the case of OpSpam, the #demonstrative pronouns in the case of Boulder, and the #3rd person pronouns in the case of EnglishUS. Additionally, we can observe the importance of sentiment, as already noted in the statistical analysis, especially of positive sentiment as captured by MPQA, which highly discriminates deceptive texts in the OpSpam and Boulder datasets. Finally, the #punctuation marks feature is correlated with truthful text in many datasets, although with a lower coefficient.
Notice that in the results there are a number of features with high coefficients that appear to be strong only in one dataset, e.g., the #boosters, #nasals and #hedges features that are extremely distinguishing only in the OpSpam collection. This observation indicates differences and variations among various datasets/domains, in accordance with previous considerations in the literature on how some features can capture the idiosyncrasies of a whole domain or only of a particular use case. In the case of the OpSpam, such features might be representative of the online reviews domain or might reflect how mechanical turkers fabricate sanctioned lies [Mukherjee et al., 2013b].
Regarding the #spatial details feature, for which we made some interesting observations in the previous statistical analysis, we observe that they are important for discriminating truthful text only in OpSpam. The observation that fake reviews in OpSpam include less spatial language has already been pointed out by Ott et al. [2011Ott et al. [ , 2013 for reviews with both positive and negative sentiment. This is not the case in situations where owners bribe customers in return for positive reviews or when owners ask their employees to write reviews [Li et al., 2014].
Finally, regarding the #lemmas and #words, they were not found to be important in this analysis. The same holds for the used tenses in most datasets with no clear direction. Per culture and cross-cultural observations Table 12 reports the per culture and cross-cultural observations. Although the resulting feature sets are quite distinct, we observe some similarities around the usage of pronouns. Again pronouns have very large coefficients in most datasets, and the usage of #1st person pronouns is correlated with truthful text for the individualistic native English and collectivist Romanian speakers, while the usage of #3rd person pronouns is correlated with deception in collectivist EnglishIndia and Romanian datasets. Positive sentiment, as already discussed, prevails in native English speakers for deceptive text, while sentiment features do not play any major role in other cultures. Additionally, the #lemmas and #words do not seem to discriminate between the different classes of text, and the usage of tenses plays a mixed and not significant role. #nasals appear to correlate with deceptive text in native English speakers, while it is the most discriminative feature for truthful text for Spanish. For the EnglishIndia dataset, by far the most distinguishing feature is the #negations. This is a finding that agrees with the relevant bibliography in relation to the significance of negations in South Asian languages (see also Section 2.3). A final observation is the absence of features correlated with truthful and deceptive text in the similarly created EnglishIndia and SpanishMexico datasets respectively.

Classification
In this section, we evaluate the predictive performance of different feature sets and approaches for the deception detection task. Firstly we present and discuss the results of logistic regression, then the results of fine-tuning a neural network approach based on the state-of-the-art BERT model, and finally we provide a comparison with other related works. As a general principle, and given the plethora of different types of neural networks and machine learning algorithms in general, this work does not focus on optimizing the performance of the machine learning algorithms to the specific datasets. Our focus is to explore, given the limited size of training data, which are the most discriminative types of features in each domain and language, and, in succession, if the combination of features is beneficial to the task of deception detection.
We split the datasets into training, testing, and validation subsets with a 70-20-10 ratio. We report the results on test sets, while validation subsets were used for fine-tuning the hyper-parameters of the algorithms. In all cases, we report Recall, Precision, F-measure and Accuracy. These statistics were calculated according to the following definitions: tp tp + f n Accuracy (Accu.) : tp + tn tp + tn + f p + f n where a true positive (tp) and a true negative (tn) occurs when the model correctly predicts the positive class or negative class respectively, while a false positive (fp) and a false negative (fn) when the model incorrectly predicts the positive and negative class respectively.

Logistic regression experiments
Logistic regression has been widely applied in numerous NLP tasks, among which deception detection from text [Fuller et al., 2009, Popoola, 2017. We experimented with several logistic regression models, including one based on linguistic features (i.e., linguistic), various n-grams features (phoneme-gram, character-gram, word-gram, POSgram, and syntactic-gram), and the linguistic+ model that represents the most performant model that combines linguistic features with any of the n-gram features. For our experiments we used two implementations of logistic regression of Weka [Hall et al., 2009] simple logistic [Landwehr et al., 2005, Sumner et al., 2005 and logistic [Le Cessie and Van Houwelingen, 1992]. The simple logistic has a built-in attribute selection mechanism based on LogitBoost [Friedman et al., 2000], while the logistic aims to fit a model that uses all attributes. In all cases, we have two mutually exclusive classes (deceptive, truthful) and we use a classification threshold of 0.5. In the case of n-grams, a preprocessing step selects the highest occurring 1000 n-gram features, while when the attribute selection is set on the CfsSubsetEval evaluator of Weka is used. The CfsSubsetEval evaluator estimates the predictive power of subsets of features.
In the following tables (Tables 13 -23) we present the logistic regression results. We group the native English datasets, and seek for differences across them, since they are written in the same language and we assume the same culture for the authors (see Section 6.1.1 and Tables 13 -17). Then we proceeded with cross-domain dataset experiments for native English datasets, by iteratively keeping each native English dataset as a testing set and using the rest as the training set (see Section 6.1.2 and Table 20). Lastly, in Section 6.1.3 we present cross-culture experiments. We report only the best performed experimental setup in the test set based on the accuracy value for each feature type. The measures Precision, Recall and F1 refer to the deceptive class while in all cases we report a majority baseline that classifies all instances in the most frequent class. We also report AUC (surface area under a ROC curve) measure [Hanley and Mcneil, 1982]. AUC value shows the probability that a randomly chosen positive instance will be ranked higher than a randomly chosen negative instance. Consequently, the closer the AUC is to 1, the better the performance of the classifier [Ling et al., 2003]. The description of the experimental setup uses the following notation: (a, b): all n-grams of size in [a, b], with a ≥ b and a, b ∈ [1, 3] (e.g., (1,2) denotes all unigrams and bigrams). stem: word stemming. attrsel: attribute selection.

Native English dataset experiments
Tables 13-15 present the results for the US datasets that concern the online reviews domains (i.e., datasets OpSpam, DeRev, and Boulder). Each dataset consists of reviews about a particular product category or service, with the exception of the Boulder dataset, which covers the wider domain of hotels and electronic products.      In the OpSpam dataset (see Section 3.1) the best performance is achieved with the combination of linguistic cues with the word-gram (unigram) configuration (86 per cent accuracy). The other configurations, although not as performant, managed to overshadow the majority baseline (see Table 13). Additionally, the second best performance of the word unigram approach showcases the importance of the word textual content in this collection.
In the DeRev dataset (see Section 3.3), the word unigram configuration offers exceptional performance (accuracy of 1.00 per cent). The rest configurations achieve much lower performances. However, as in the case of the OpSpam dataset, the performance in all the configurations is a lot better than the majority baseline. Since we were puzzled with the 1.00 value in all measures for the unigram configuration, we ran some additional experiments. Our results show that in this specific dataset, there are words that appear only in one class. For example, the word "Stephen" is connected only with the truthful class and the words "thriller", "Marshall", "faith", and "Alan" only with the deceptive class. After thoroughly checking how this collection was created, we found that the above observation is a result of how this dataset was constructed. Specifically, the authors have used different items (i.e., books) for the deceptive and truthful cases, and as a result the classifiers learn to identify the different items. To this end, the performance of the linguistic, POS-gram and syntactic-gram configurations are more representative for this dataset, since they are more resilient to this issue.
The Boulder dataset is quite challenging, since it includes two domains under the generic genre of online reviews (hotels and electronics), and two types of deception i.e., lies and fabrications, (see Section 3.2). Given the above, we observe that the performance of all classifiers is much lower and close to the majority baseline (as shown in Table 15). The best accuracy is provided by the POS (bigrams+trigrams) configuration that achieves with a value of 73 per cent, followed closely by the rest. Notice also the poor performance of the AUC measures, which is an important observation since the dataset is not balanced.  The results for the EnglishUS dataset, which is based on deceptive and truthful essays about opinions and feelings (see Section 3.5), are presented in Table 16. In this dataset, the linguistic model offers the best performance (71 per cent accuracy). The combination of linguistic cues with word unigrams, the POS-gram (unigrams) and the phoneme-gram (unigrams) configurations provide lower but relative close performance.  Lastly, Table 17 contains the results for the Bluff dataset, which is the only dataset that originates from spoken data and is multi-domain (see Section 3.4). All the configurations are equal or below the major baseline which is 69 per cent. Notice that this is a small unbalanced dataset with most configurations having a low AUC performance. The inclusion of features that elicit humorous patterns, could possibly improve the performance of the classifiers, since an integral characteristic of this dataset is humour, a feature that we do not examine in this work.
Tables 18 and 19 present the top ten features in terms of their estimate value for each class, for the configuration with the best performance. We observe that in one-domain datasets, the content in the form of word-grams is prevalent and implicitly express deceptive patterns. This is the case for the OpSpam and DeRev datasets. For example, spatial details and verbs in past tense (i.e., told, renovated, updated, based, returned) are associated with the truthful class while positive words (e.g., amazing, luxury, intriguing) are related to deceptive class. In the rest datasets that consist of different topics (i.e., two in the Boulder, three in the EnglishUS and multiple in the Bluff dataset), the best performance is achieved with the use of linguistic cues, more abstract types of n-grams such as POS-grams or with the combination of linguistic cues with n-grams. We also observe the existence of the feature "priceline" in the OpSpam list. This refers to one of the sites from which the truthful reviews were collected (e.g., Yelp, Priceline, TripAdvisor, Expedia etc.). However, since this resembles the problem in the DeRev dataset in which particular features mostly are associated with one class we checked that a very small percentage of truthful reviews contain such reference. As a closing remark,  we would like to showcase the rather stable performance of the linguistic models in all datasets (except maybe in the case of the Bluff dataset in which the performance of all models is hindered). As a result, the linguistic cues can be considered as a valuable information for such classification models, that in many cases can provide complementary information and improve the performance of other content or non-content based models.

Cross dataset experiments for US datasets
In this part, we examine the performance of the classifiers when they are trained on different datasets than those on which they are evaluated. In more details, we used every native English dataset once as a testing set for evaluating a model trained over the rest native English datasets.  The setting of these experiments results in highly heterogeneous datasets not only in terms of thematic but also in terms of the collection processes, the type of text (e.g., review, essay), the deception type etc. These discrepancies seem to be reflected to the results (see Table 20). Overall, the results show that the increased training size, with instances that are connected with the notion of deception but in different context, and without sharing many other properties, are not beneficial for the task. Note also that the configuration in these experiments results in unbalanced datasets both in training and testing sets so the comparison is fairly demanding.

All-OpSpam
The performance for the linguistic-only setting has an average accuracy of 50 per cent. These result show that there is no overlap between the distinguishing features across the datasets that can lead to an effective feature set, as has already been revealed in the MLR analysis (see Table 12). In the case of the All -Bluff dataset the linguistic cues only configuration has the lowest accuracy of 33 per cent which is quite below the random chance. After a closer inspection, we observed that the classifier identifies only the truthful texts (recall that the Bluff dataset has an analogy 2:1 in favour of the deceptive class). This could be explained by the reversed direction of important features such as the #words compared to the rest of the datasets. Moreover, there are features that are statistically significant only in this dataset and not on the training collection, e.g., the negative sentiment FBS, and vice versa, e.g., #demonstrative pronouns. The interested reader can find the details in Table 31 in the Appendix and in Table 11. Similarly, n-gram configurations are close to randomness in most of the cases. However, topic relatedness seems to have a small positive impact on the results for the All -OpSpam and the All -Boulder datasets, as expected, since the Boulder dataset contains hotels and electronics reviews, and the OpSpam dataset also concerns hotels. The high recall values for the deceptive class in some of the classifiers depicts the low coverage and the differences between the datasets.
The POS-grams and the syntactic-grams settings that are less content dependent, fail to detect morphological and syntactical patterns of deception respectively across the datasets. This could be attributed to the fact that such n-gram patterns might not be discriminating across different datasets and due to the fact that such types of n-grams can be implicitly influenced from the unrelated content. Overall and as future work, we plan to remove strongly domain specific attributes from the feature space, in order for the training model to rely more on function words and content independent notions. In this direction, a hint for a possible improvement is given in Tables 21 and 22, where the most performant models include functions words, auxiliary verbs and so on.   Table 23. We also measured the accuracy of pairs of of the available n-gram feature types, to check if different types of n-grams can provide different signals of deception. The results show only minor improvements for some languages. We provide the results in the Appendix (see Table 33).
Generally and despite the fact that it is safer to examine results in a per dataset basis, it is evident that the word and phoneme-grams setups prevail in comparison with the rest of the setups. Even when the best accuracy is achieved through a combination of feature types, word and phoneme n-grams belong to the combination. This is the case for the native English dataset and the Romanian dataset (see Tables 23 and 33 respectively). Overall, for all the examined datasets the classifiers surpass the baseline by a lot.
The most perplexing result was the performance of the linguistic cues in the EnglishIndia and EnglishUS datasets (results presented in Tables 16 and 23) that are part of the Cross-Cultural dataset (see Section 3.5). These datasets have similar sizes, cover the same domains and were created through an almost identical process. However, we observe that while the feature sets of the EnglishUS achieve accuracy of 71 per cent, the accuracy drops to 54 per cent in the EnglishIndia. This is surprising, especially for same genre datasets that use the same language (i.e., EnglishUS and EnglishIndia). To ensure that this difference is not a product of the somewhat poor quality of text in the EnglishIndia dataset (due to the orthographic problems), we made corrections in both datasets and we repeated the experiments. However, since the differences in the results were minor, it is difficult to identify the cause of this behaviour. One hypothesis is that this difference in the performance of the feature sets may be attributed to the different expression of deception between these two cultures, given the fact that almost all other factors are stable. The second hypothesis is that since most Indians are non-native speakers of English, they use the language in the same way while being truthful or deceptive. This hypothesis is also supported by the fact that there are very few statistically important features for EnglishIndia, e.g., #negations and the #3rd person pronoun. As a result, the classifiers cannot identify the two classes, and exhibit a behaviour closer to randomness. Notice that we might be noticing implications from both hypotheses, since the #3rd person pronoun is also important while deceiving for the collectivistic Romanian.  Lastly, to get a visual insight over the above results we present the most valuable features for the configuration that achieved the best accuracy in the logistic regression experiments for all the examined datasets (see Tables 24 and 25). The features are listed by decreasing estimate value. Most of the cases include morphological and semantic information that has been explicitly defined in linguistic cues (e.g., the use of pronouns as in "my room", tenses, spatial details, polarized words etc.). As a result, the combination of such n-gram features with linguistic cues do not work in synergy. Moreover, notice the contribution of two features for discriminating deception in the SpanishMexico; the word "mi mejor" and the word "en" both attributed to the deceptive class. A similar behaviour with a small resulting feature set is also evident in the Russian dataset.

Discussion on features
Among all the variations of n-grams tested in this work, word n-grams achieve the best results across almost all the datasets. The results for the other types of n-grams seems to be a little lower and to fluctuate in a per dataset basis. More content-based n-gram types such character-grams and phoneme-grams have an adequate performance while the other variations that bear more abstract and generalized linguistic information, such as POS n-grams and syntactic n-grams achieve lower performance. However, POS-gram seem to perform quite better than the syntactic n-grams. The difference in accuracy decreases in cross domain experiments in which semantic information is more diverse, and as already discussed, linguistic indications of deception change from one domain to another. Lastly, stemming, stopwords removal and lowercase conversion are generally beneficiary, so it is a preprocessing step that must be examined. The experimental results show that the discriminative power of linguistic markers of deception is overly better than random baseline and the expected human performance (according to literature slightly better than chance, see Section 2) especially in one domain scenarios (see Tables 13 to 17). More specifically, linguistic markers of deception are struggling in cross domain settings (see Section 6.1.2). This confirms that linguistic markers of deception vary considerably and are extremely sensitive even within the same culture, let alone across different  cultures (see Table 23). Different domains, individual differences, even the way the texts were collected seem to influence the behaviour of linguistic markers and indicate how complex the deception detection task is. In the native English case in which the employed feature set is richer and in general the linguistic markers are more well-studied, we can observe better results. This might signal that there are opportunities for enhancement.
Lastly, the combination of linguistic features with n-gram variations does not enhance the performance in a decisive way in most of our experiments. N-grams and more often word-grams or phoneme-grams in an indirect way can capture information that has been explicitly encoded in the linguistic cues. However, there are cases when this combination can improve the performance of the classifier. In such cases the resulting feature space succeeds to blend content with the most valuable linguistic markers.

BERT experiments
In these experiments, we use BERT [Devlin et al., 2019] with a task-specific linear classification layer on top, using the sigmoid activation function, as an alternative to the logistic regression classifiers of the previous experiments 21 .
As already discussed in Section 4.4, BERT is already pretrained on a very large unlabelled corpus. Here it is further trained ('fine-tuned') jointly with the task-specific classifier on deception detection datasets to learn to predict if a text is deceptive or not. BERT produces context-aware embeddings for the tokens of the input text, and also an embedding for a special classification token ([CLS]), intended to represent the content of the entire input text. Here the input to the task-specific linear classifier is the embedding of the [CLS] token. We do not 'freeze' any BERT layers during fine-tuning, i.e., the weights of all the neural layers of BERT are updated when fine-tuning on the deception detection datasets, which is the approach that typically produces the best results in most NLP tasks. We use categorical cross entropy as the loss function during fine-tuning, and AdamW as the optimizer [Loshchilov and Hutter, 2019]. Finally, we exploit monolingual BERT models for each language (see Table 10), as well as the multilingual multiBERT model. The BERT limitation of processing texts up to 512 wordpieces does not affect us, since the average length of the input texts of our experiments is below this boundary (see Table 4). However, due to batching and GPU memory restrictions, the upper bound of the used text length was 200 wordpieces, so there is some loss of information due to text truncation, though it is limited overall. More specifically, the truncation affects 5.6 per cent of the total number of texts of all the datasets used in our experiments (506 texts out of a total of 8,971). The effect of truncation is more severe in the Bluff, OpSpam and Russian datasets, where 41 per cent (109 out of 267), 21 per cent (332 out of 1600) and 29 per cent (65 out of 226) of the texts were truncated, respectively; the average text length of the three datasets is 190, 148 and 160 wordpieces, respectively. In the other datasets, the percentage of truncated texts was much smaller (10 per cent or lower). We note that valuable signals may be lost when truncating long texts, and this is a limitation of our BERT experiments, especially those on Bluff and OpSpam, where truncation was more frequent. For example, truthful texts may be longer, and truncating them may hide this signal, or vice versa. Deceptive parts of long documents may also be lost when truncating. In such cases, models capable of processing longer texts can be considered, such as hierarchical RNNs [Chalkidis et al., 2019, Jain et al., 2019 or multi instance learning as in [Jain et al., 2019]. No truncation was necessary in our logistic regression experiments, but long texts may still be a problem, at least in principle. For example, if only a few small parts of a long document are deceptive, features that average over the entire text of the document may not capture the deceptive parts. We leave a fuller investigation of this issue for future work.
In addition, we combined BERT with linguistic features. To this end, we concatenate the embedding of the [CLS] token with the linguistic features, and pass the resulting vector to the task-specific classifier. In this case, the classifier is a multilayer perceptron with one hidden layer, consisting of 128 neurons with ReLU activations. The MLP also includes layer normalization [Ba et al., 2016] and a dropout layer [Srivastava et al., 2014] to avoid overfitting. Hyperparameters were tuned by random sampling 60 combinations of values and keeping the combination that gave the minimum validation loss. Early stopping with patience 4 was used on the validation loss to adjust the number of epochs (the max number of epochs was set to 20). The tuned hyperparameters were the following: learning rate (1e-5, 1.5e-5, 2e-5, 2.5e-5, 3e-5, 3.5e-5, 4e-5), batch size (16, 32), dropout rate (0.0, 0. 05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45) max token length (125, 150, 175, 200, and average training text length in tokens), and the used randomness seeds (12, 42, and a random number between 1-100).
Tables 26 and 27 present the results for these experiments. The former presents the results for each native English dataset, while the latter for the cross-culture datasets. For the US culture we used the unified dataset NativeEnglish. We explored both the BERT model alone and the BERT model augmented with the whole list of linguistic cues of deception studied in this work. For the native English cases we used the BERT model for the English language, while for the per culture experiments, we experimented with both the language monolingual models and the multilingual version of the BERT model. The dataset subscript declares the experimental setup, e.g., bert+linguistic,en uses the English language BERT model along with the linguistic cues. Exactly what types of linguistic or world knowledge BERT-like models manage to capture (or not) and the extent to which they actually rely on each type of captured knowledge is the topic of much current research [Rogers et al., 2020]. It has been reported that the layers of BERT probably capture different types of linguistic information, like surface features at the bottom, syntactic features in the middle and semantic features at the top [Jawahar et al., 2019]. Fine-tuning seems to allow retaining the most relevant types of information to the end-task, in our case deception detection.    Overall, the experiments show similar, and in some cases improved results, compared to the logistic regression ones and the available related work (see Section 6.3). As shown in Table 26 this is the case for the OpSpam, Boulder, and EnglishUS datasets, while the performance drops a bit in the case of the DeRev dataset for the plain BERT model (the excellent 98 per cent accuracy drops to 94 per cent for the plain BERT model, rising again to 96 per cent for the combined BERT with the linguistic features). An interesting point is that for the Bluff dataset, the plain BERT model offers better performance to the logistic classifier (83 per cent accuracy compared to 75 per cent), which drops to 77 per cent when combined with the linguistic features. This is the only case where the addition of the linguistic features drops the performance of the classifier. The reason might be that the plain BERT model possibly manages to capture humour, which is an internal feature of this dataset and a feature not captured by the linguistic features.
Regarding the per culture datasets shown in Table 27 and compared to the logistic regression experiments, there are clear gains in the accuracy of most of the models for the NativeEnglish, EnglishIndia and CLiPS datasets. However, this is not the case for the SpanishMexico, and the Russian datasets. Especially in the case of the peculiar Russian dataset, out of the four experimental setups, only the BERT alone setup with the dedicated Russian BERT model slightly surpassed the statistical random baseline of 50 per cent. Recall that similar low performance is not only evident in our logistic experiments but also in the related work. The low performance in the case of BERT, where there are no feature extraction steps that can propagate misfires of the used tools or a problematic handling from our side, showcases that this is an intrinsically problematic collection.
A rather important finding is the contribution of the linguistic features. The addition of the linguistic features to the BERT models leads to better performance in many of the experiments, such as in the case of EnglishUS, DeRev, NativeEnglish, SpanishMexico, Romanian and EnglishIndia datasets. This showcases their importance compared to the corresponding logistic regression experiments, where the linguistic cues improved the n-grams approaches only in the case of the SpanishMexico and the Russian dataset. The linguistic features seem to work better when combined with the BERT classifier, which might be the result of the model learning non-linear combinations of the features. As already mentioned, in the case of the DeRev dataset the addition of the linguistics cues greatly improves the performance of the classifier, leading to almost excellent performance. Even though we have not made explicit experiments to identify which are the helpful linguistic cues in the case of BERT models, we can speculate that they are phoneme related features, e.g., #fricatives, #plosives, #nasals, and the punctuation feature. These are significant features, which either the BERT models cannot capture or exploiting their explicit counts seems to be more effective (see Tables 31 and 32 in the Appendix).
Table 28 provides a comparison between the monolingual BERT models and multiBERT. In particular, monolingual BERT models seem to perform better, except in Dutch and Romanian. Despite the lower performance of the multiB-ERT model, the difference is not prohibitive. In this section, we proceed with cross language experiments due to the adequate performance of the multiBERT model. The idea of the experiment is to fine-tune a BERT model over one language and test its performance over another language, trying to exploit similarities in the morphological, semantic and syntactic information encoded in BERT layers, across cultures. Our main focus is on cultures that are close in terms of the individualism dimension, thus could possibly share similar deceptive patterns that BERT can recognize. We are also interested in cross-cultural experiments to evaluate to what extent BERT can distinguish between deceptive and truthful texts in a crosslingual setting. Finally, we have also added the EnglishUS dataset to experiment with same domain and alike collection procedure but cross language datasets (i.e., Romanian, SpanishMexico, EnglishIndia and EnglishUS). We also performed experiments with the NativeEnglish minus the EnglishUS collection to explore the effectiveness of a large training dataset to a different domain (EnglishUS) and to different cultures (Romanian, SpanishMexico, EnglishIndia). For each experiment we trained a model over the 80 per cent of a language specific dataset, validated the model over the rest 20 per cent of the same dataset, and then tested the performance of the model over the other datasets. Notice that these experiments are not applicable for the NativeEnglish and EnglishUS datasets, since the former is a superset of the latter.
For most of the experiments the results are close to randomness. For example this is the case when Russian and Dutch (CLiPS) are used either as testing or training sets with any other language and when the combined NativeEnglish dataset is used for testing on any other language. For the Russian language this is quite expected given the performance in the monolingual experiments. However, on the Dutch dataset, the situation is different, since the fine-tuned BERT model manages to distinguish between deceptive and truthful texts in the monolingual setting but when the multiBERT is trained on the Dutch dataset, it doesn't perform well on the other datasets.
The Romanian, SpanishMexico, EnglishUS and EnglishIndia datasets that are part of the Cross-Cultural Deception dataset (see Section 6.3) show a different behaviour. A model trained on one dataset offers an accuracy between 60 and 70 per cent on the other set using the multiBERT, with SpanishMexico exhibiting the best performance when is is used as testing set for the EnglishUS trained model. This indicates that the domain is an important factor that alleviates the discrepancies in terms of culture and language in the crosslingual multiBERT setting. A reasonable explanation might be vocabulary memorization or lexical overlap, which occurs when word pieces are present during fine-tuning and in the language of the testing set. However, according to Pires et al. [2019] multiBERT has the ability to learn even deeper multilingual representations.
Another important observation is the performance whenever the NativeEnglish is used as training set. The domain similarity is rather small in this case, since NativeEnglish is a largely diverse dataset. The results show that multiB-ERT can possibly reveal connections in a zero-shot transfer learning setting when the training size is quite adequate. This has been observed also in other tasks, like the multilingual and multicultural irony detection in the work of Ghanem et al. [2020]. In this case instead of the multiBERT model, the authors applied an alignment of monolingual word embedding spaces in an unsupervised way. Zero-shot transfer learning for specific tasks based on multiBERT is also the focus of other recent approaches [Pires et al., 2019] [Libovický et al., 2019], that show promising results.
Removing the EnglishUS dataset from the NativeEnglish dataset reduces considerably the performance in the Romanian, SpanishMexico and EnglishIndia datasets, showcasing the importance of domain even for cross-lingual datasets. Notice though that for the SpanishMexico and the Romanian datasets the performance is greater than that of a random classifier, indicating cues of the zero-shot transfer connection hypothesis at least for this dataset. On the other hand, the random performance for the EnglishUS and EnglishIndia datasets, that have the same language with the trained model and which additionally belong to the same domain with the SpanishMexico and Romanian datasets, showcases that it is difficult to generalize.  Table 30 provides an overall comparison between our best experimental setup and results, with those presented in other studies on the same corpora. The comparison was based on the accuracy scores reported in those studies. In addition, we report human accuracy whenever it is available. For comparison purposes, we set a p-value of 0.01 and performed a 1-tailed z-test evaluating if the differences between two proportions are statistically significant. By comparing absolute numbers only, the comparison is not so straightforward and cannot easily lead to conclusions, since the studies employed different model validation techniques and set difference research goals.
To the best of our knowledge, the only computational work that address cross-cultural deception detection is the work of . In that work, the authors build separate deception classifiers for each examined culture, and report a performance ranging between 60-70 per cent. Then they build cross-cultural classifiers by applying two alternative approaches. The first one was through the translation of unigrams and the second one by using equivalent LIWC semantic categories for each language. Both approaches resulted in lower performances. All the approaches were tested on the Cross-Cultural Deception dataset, which was created by the authors , and which we also used in this work (see Section 6.3). The treatment is different since each sub-domain dataset (death penalty, abortion, best friend) is separately examined. However, since average scores are also reported we compare this work with those scores. In addition, since the EnglishUS dataset has been extensively used in other works in the same way, we also report the average accuracy for these cases.
The comparison in Table 30 shows that BERT outperforms other approaches in most of the cases. BERT's performance is mostly surpassed in the relatively smaller sized datasets, indicating the need for fine-tuning BERT over a large number of training samples. In particular, BERT achieves state-of-the-art performance for the OpSpam dataset, that is the gold standard for opinion spam detection. In addition, for the CLiPS dataset, the BERT model outperforms the other models studied in this work, as well as another unigram approach in the bibliography [Verhoeven and Daelemans, 2014]. For the Cross-Cultural Deception dataset (see Section 3.5) BERT outruns other approaches that are based on feature engineering for the Romanian and the EnglishIndia datasets. In the case of SpanishMexico dataset the combination of linguistic cues with word n-grams seems to have a strong discriminative power and in the EnglishUS dataset the combination of latent Dirichlet allocation topics (LDA) with a word-space model achieves the highest accuracy. Lastly, in comparison with human judgments, for the two datasets that we have numbers (i.e., OpSpam and Bluff), the automatic detection approaches significantly outperform human performance with respect to the accuracy measure.  [Kennedy et al., 2019] 0 90 no RCNN  0 88 no Psycholinguistic+word bigrams [Ott et al., 2011[Ott et al., , 2013 0 87 no Human performance [Ott et al., 2011[Ott et al., , 2013 0 59 yes DeRev 236 samples Word n-gram 1 00 -LDA+word space model [Hernández-Castañeda et al., 2017] 0 95 yes Various numeric features, e.g., length of reviews, frequency of n-grams etc. [Fornaciari and Poesio, 2014] 0 76 yes EnglishUS 600 samples BERT+Linguistic 0 76 -Character 5-grams [Sánchez-Junquera et al., 2018] 0 73 no LDA+word space model [Hernández-Castañeda et al., 2017] 0 85 yes LIWC  0 69 yes Syntax+words [Feng et al., 2012] 0 78 no Words [Mihalcea and Strapparava, 2009]    0 66 yes Russian 226 samples Word n-gram 0 64 -POS tags+POS tags bigrams features [Pisarevskaya et al., 2017] 0 57 no Rocchio classification  0 68 no SpanishMexico 346 samples Phoneme-gram 0 74 -LIWC  0 68 no Romanian 870 samples BERT+Linguistic 0 71 -LIWC  0 64 yes

Conclusions
This study explores the task of automated text-based deception detection within cultures by taking into consideration cultural and language factors, as well as limitations in NLP tools and resources for the examined cases. Our aim is to add a larger scale computational approach in a series of recent inter-disciplinary works that examine the connection between culture and deceptive language. Culture is a factor that is usually ignored in automatic deception detection approaches, which simplistically assume the same deception patterns across cultures. To this end, we experimented with datasets representing six cultures, using countries as culture proxies (United States, Belgium, India, Russia, Mexico and Romania), written in five languages (English, Dutch, Russian, Spanish and Romanian). The datasets cover diverse genres, ranging from reviews of products and services, to opinions in the form of short essays, and even transcripts from a radio game show. To the best of our knowledge, this is the first effort to examine in parallel and in a computational manner, multiple and diverse cultures for the highly demanding deception detection task in text.
We aimed at exploring to what extent conclusions drawn from the social psychology field about the connection of deception and culture can be confirmed in our study. The basic notion demonstrated by these studies is that specific linguistic cues to deception do not appear consistently across all cultures, e.g., they change direction, or are milder or stronger between truthful or deceptive texts. Our main focus was to investigate if these differences can be attributed to cultural norm differences and especially to the individualism/collectivism divide. The most closely related work is that of Taylor [Taylor et al., 2014[Taylor et al., , 2017 from the field of social psychology, that studies the above considerations for four linguistic cues of deception namely negations, positive affect, pronouns usage and spatial details in texts from individualistic and collectivist cultures. Having as starting point Taylor's work, we performed a study with similar objectives over a larger feature set that we created, that also covers the previously mentioned ones.
The outcome of our statistical analysis demonstrates that indeed there are great differences in the usage of pronouns between different cultural groups. In accordance with Taylor's work, people from individualistic cultures employ more third person and less first person pronouns to distance themselves from the deceit when they are deceptive, whereas in the collectivism group this trend is milder, signalling the effort of the deceiver to distance the group from the deceit. Regarding the expression of sentiment in deceptive language across cultures, the original work of Taylor hypothesized that different cultures will use sentiment differently while deceiving, a hypothesis that was not supported by the results of his research. The basis for this hypothesis is the observation that in high-context languages, which are related with collectivist cultures, people tend to restrain their sentiment. Our experiments support the original hypothesis of Taylor, since we observe an increased usage of positive language in deceptive texts for individualistic cultures (mostly in the US datasets), which is not observed in more collectivist cultures. In fact, by examining the statistical significant features and the resulting feature sets from the MLR analysis, we notice that generally, there are fewer discriminating deception cues in the high-context cultures. This might be attributed to the fact that the bibliography overwhelmingly focuses on individualistic cultures, and to a lesser degree in collectivist cultures, leading to a smaller variation in deceptive cues for the latter. Additionally, it might indicate that during deception, high-context cultures use other communication channels on top of the verbal ones, a hypothesis that needs further research. Moreover, in affirmation of the above considerations, we observed that the strongly distinguishing features are different for each culture. The most characteristic examples are the #negations for the EnglishIndia dataset and the phoneme-related features for the SpanishMexico and Romanian datasets (#nasals and #fricatives). Both types of features have been related to the implicit expression of sentiment in previous studies. However, there is a need for a more thorough analysis, in order for such observations to be understood and generalized in other cultures. In relation to spatial details differences, we found that in the cross-cultural deception task, the collectivist groups increased the spatial details vocabulary. The exact opposite holds for the individualist groups, who used more spatial details while being truthful. This result is in accordance with Taylor's work.
These findings can be analyzed in conjunction with our second research goal which was to investigate the existence of a universal feature set that is reliable enough to provide a satisfactory performance across cultures and languages. Our analysis showed the absence of such a feature set. On top of this, our experiments inside the same culture (United States of America) and over different genres, revealed how volatile and sensitive the deception cues are. The more characteristic example is the Bluff dataset in which deception and humour are employed at the same time and the examined linguistic features have the reversed direction. Furthermore, another variable in the examined datasets is the type of deception. The examined datasets contain multiple types such as falsifications, oppositions and exaggerations to name a few. In addition, the data collection extraction process varies from user-generated content (e.g., posts in TripAdvisor, Amazon reviews), crowd-sourced workers, volunteers in controlled environments, and finally cases outside computer-mediated communication (the transcriptions from the Bluff the Listener show). Despite this diversity, we have to note that some features seem to have a broader impact. This is the case for the length of texts (#lemmas and #words features), where deceptive texts tend to be shorter. This was observed independently of the culture and the domain with only one exception, that of the Bluff dataset. This is in accordance with previous studies, attributing this behaviour to the reduction of cognitive/memory load  during the deception act.
Our third goal was to work towards the creation of culture/language-aware classifiers. We experimented with varying approaches and examined if we can employ specific models and approaches in a uniform manner across cultures and languages. We explored two classification schemes; logistic regression and fine-tuning BERT. Moreover, the experimentation with the logistic regression classifiers demonstrated the superiority of word and phoneme n-grams over all the others n-gram variations (character, POS and syntactic). Our findings show that the linguistic cues, even when combined with n-grams, lag behind the single or combined n-gram features, whenever models are trained for a specific domain and language (although their performance surpasses the baselines). In more details, shallow features, like the various n-grams approaches, seem to be pretty important for capturing the domain of a dataset, while the linguistic features perform worse. This is the case at least for the native English datasets, where we conducted experiments over various genres and found that the shallow features perform better, even across-domains. On the other hand, the linguistic cues seem to be important for the collectivist cultures, especially when combined with swallow features (e.g., in Russian, SpanishMexico and Romanian datasets). The fine tuning of the BERT models, although costly in terms of tuning the hyperparameters, performed rather well. Particularly, in some datasets (the NativeEnglish, CLiPS, and EnglishIndia datasets) we report state-of-the-art performance. However, the most important conclusion is that the combination of BERT with linguistic markers of deception is beneficial, since it enhances the performance. This is probably due to the addition of linguistic information that BERT is unable to infer, such as phoneme related information. Indeed, phonemes play an important role in all individual parts of this study. The experimentation with the multilingual embeddings of multiBERT, as a case of zero-shot transfer learning, showed promising results that can possibly be improved by incorporating culture specific knowledge or by taking advantage of cultural and language similarities for the least resourced languages. Finally, we observed the importance of domain specific deception cues across languages, which can be identified by multiBERT. Given the promising results of multiBERT, other recently introduced multilingual representations may be applied. Alternatives include, for example, MUSE [Chidambaram et al., 2019, LASER [Artetxe and Schwenk, 2019], LaBSE [Feng et al., 2020]. XLM [Conneau and Lample, 2019] and its XLM-R extension [Conneau et al., 2020] have been reported to obtain state-of-the-art performance in zero-shot cross-lingual transfer scenarios, making them appropriate for low resource languages [Hu et al., 2020].
Although this work is focused on deception detection from text using style-based features and without being concerned with a particular domain, we plan to consider additional features that have been used in other domains and other related work. Specifically, we aim to incorporate features used in discourse level analysis, such as rhetorical relationships [Rubin et al., 2015, Karimi and Tang, 2019, Pisarevskaya and Galitsky, 2019, other properties of deception like acceptability, believability, the reception [Jankowski, 2018] of a deceptive piece of text (e.g., number of likes or dislikes), and/or source-based features such as the credibility of the medium or author using stylometric approaches [Potthast et al., 2018, Baly et al., 2018. Such features are used extensively in fake news detection [Zhou and Zafarani, 2020]. We also plan to examine the correlation of such features with the perceiver's culture [Seiter et al., 2002, Mealy et al., 2007.
Additionally, we plan to extend the Bluff the Listener dataset with new episodes of this game show, in order to further examine the linguistic cues of deception and humour and how they correlate, and to enrich the community with relevant gold datasets for non-studied languages, e.g., Greek. Moreover, we plan to investigate the role of phonemes and its relation with the expression of sentiment, and incorporate and study phonemes embeddings [Haque et al., 2019]. Finally, we will apply and evaluate our models in real-life applications. This will hopefully add more evidence to the generality of our conclusions, and eventually lead to further performance improvements and reliable practical applications.   Table 33: Results per culture of logistic regression experiments on various feature types, including combinations of pairs. The accuracy measure is reported and the bold font marks the pair with the best achieved performance. Best n-gram row indicates the best accuracy for the no-paired configuration.

US Dutch India Russia Mexico Romania
Single feature types