Hostname: page-component-77f85d65b8-grvzd Total loading time: 0 Render date: 2026-03-29T13:45:09.911Z Has data issue: false hasContentIssue false

Polish natural language inference and factivity: An expert-based dataset and benchmarks

Published online by Cambridge University Press:  01 June 2023

Daniel Ziembicki*
Affiliation:
Department of Formal Linguistics, University of Warsaw, Warsaw, Poland
Karolina Seweryn
Affiliation:
NASK - National Research Institute, Warsaw, Poland Faculty of Mathematics and Information Science, Warsaw University of Technology, Warsaw, Poland
Anna Wróblewska
Affiliation:
Faculty of Mathematics and Information Science, Warsaw University of Technology, Warsaw, Poland
*
Corresponding author: Daniel Ziembicki; Email: daniel.ziembicki@uw.edu.pl
Rights & Permissions [Opens in a new window]

Abstract

Despite recent breakthroughs in Machine Learning for Natural Language Processing, the Natural Language Inference (NLI) problems still constitute a challenge. To this purpose, we contribute a new dataset that focuses exclusively on the factivity phenomenon; however, our task remains the same as other NLI tasks, that is prediction of entailment, contradiction, or neutral (ECN). In this paper, we describe the LingFeatured NLI corpus and present the results of analyses designed to characterize the factivity/non-factivity opposition in natural language. The dataset contains entirely natural language utterances in Polish and gathers 2432 verb-complement pairs and 309 unique verbs. The dataset is based on the National Corpus of Polish (NKJP) and is a representative subcorpus in regard to syntactic construction [V][że][cc]. We also present an extended version of the set (3035 sentences) consisting more sentences with internal negations. We prepared deep learning benchmarks for both sets. We found that transformer BERT-based models working on sentences obtained relatively good results ($\approx 89\%$ F1 score on base dataset). Even though better results were achieved using linguistic features ($\approx 91\%$ F1 score on base dataset), this model requires more human labor (humans in the loop) because features were prepared manually by expert linguists. BERT-based models consuming only the input sentences show that they capture most of the complexity of NLI/factivity. Complex cases in the phenomenon—for example, cases with entitlement (E) and non-factive verbs—still remain an open issue for further research.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2023. Published by Cambridge University Press
Figure 0

Table 1. Inter-annotator agreement given by Cohen’s Kappa (α = 0.05). Ex—an expert who made the gold standard, A1–A4—non-expert linguists

Figure 1

Table 2. Datasets on NLI and their linguistic features

Figure 2

Table 3. Factive verbs in CommitmentBank and their Polish equivalents in LingFeatures NLI

Figure 3

Table 4. Distributions of features in LingFeatured NLI and LingFeatured NLI Extended dataset in Polish version

Figure 4

Table 5. Contingency table consisting of the frequency distribution of two variables

Figure 5

Table 6. Hard utterances in our dataset

Figure 6

Figure 1. Relationship between the number of the most frequent verbs and the coverage of dataset. Left: The analysis of factive subsample. Right: The analysis of non-factive subsample.

Figure 7

Table 7. Classification results of entailment (E), contradiction (C), and neutral (N). Linguistic features comprise: verb, grammatical tense of verb, occurrence of internal negation, grammatical tense of complement clause, utterance type, verb semantic class, verb type (factive/non-factive). F1 score depicts weighted F1 score. LF denotes linguistic features. The results for baseline are calculated over the entire input dataset

Figure 8

Table 8. Model and training parameters

Figure 9

Table 9. Results in the most characteristic subsets in our test base dataset: entailment and factive, neutral, and non-factive, and the other cases. Values present model accuracy [%]. LF denotes linguistic features

Figure 10

Figure 2. Impurity-based feature importance of feature-based Random Forest trained using base LingFeatured NLI. The chart shows the English equivalents of Polish verbs: know/wiedzieć; pretend/udawać; think/myśleć; turn out/okazać się; admit/przyznać; it is known/wiadomo; remember/pamiętać.

Figure 11

Table 10. Classification results of entailment (E), contradiction (C), and neutral (N) using the extended LingFeatured NLI dataset. Linguistic features comprise: verb, grammatical tense of verb, occurrence of internal negation, grammatical tense of complement clause, utterance type, verb semantic class, verb type (factive/non-factive). F1 score depicts weighted F1 score. LF denotes linguistic features. The results for baseline are calculated over the entire input dataset

Figure 12

Table 11. Results in the most characteristic subsets in our test extended LingFeatured NLI dataset: entailment and factive, neutral, and non-factive, and the other cases. Values present model accuracy [%]. LF denotes linguistic features

Figure 13

Table 12. Features in classification of entailment, contradiction and neutral. Random forest results with inputs of different sets of features on base LingFeatured NLI

Figure 14

Table 13. Top 10 verbs broken down into factive/non-factive subgroups. List of all factive and non-factive verbs is available in Appendix D

Figure 15

Table 14. Annotation examples for non-experts. “Annot.” indicate non-expert annotations

Figure 16

Table 15. Classification results of entailment (E), contradiction (C), and neutral (N) using the LingFeatured NLI dataset in balanced settings. Linguistic features comprise the verb, the grammatical tense of the verb, the occurrence of internal negation, the grammatical tense of the complement clause, utterance type, verb semantic class, and verb type (factive/non-factive). F1 score depicts the weighted F1 score. LF denotes linguistic features. The results for the baselines are calculated over the entire input dataset