Skip to main content
×
Home
    • Aa
    • Aa

Supervised approach to recognise Polish temporal expressions and rule-based interpretation of timexes

  • JAN KOCOŃ (a1) and MICHAŁ MARCIŃCZUK (a1)
Abstract
Abstract

A key challenge of the Information Extraction in Natural Language Processing is the ability to recognise and classify temporal expressions (timexes). It is a crucial source of information about when something happens, how often something occurs or how long something lasts. Timexes extracted automatically from text, play a major role in many Information Extraction systems, such as question answering or event recognition. We prepared a broad specification of Polish timexes – PLIMEX. It is based on the state-of-the-art annotation guidelines for English, mainly TIMEX2 and TIMEX3 (a part of TimeML – Markup Language for Temporal and Event Expressions). We have expanded our specification for a description of the local meaning of timexes, based on LTIMEX annotation guidelines for English. Temporal description supports further event identification and extends event description model, focussing on anchoring events in time, events ordering and reasoning about the persistence of events. We prepared the specification, which is designed to address these issues, and we annotated all documents in Polish Corpus of Wroclaw University of Technology (KPWr) using our annotation guidelines. We also adapted our Liner2 machine learning system to recognise Polish timexes and we propose two-phase method to select a subset of features for Conditional Random Fields sequence labelling method. This article presents the whole process of corpus annotation, evaluation of inter-annotator agreement, extending Liner2 system with new features and evaluation of the recognition models before and after feature selection with the analysis of statistical significance of differences. Liner2 with presented models is available as open source software under the GNU General Public License.

Copyright
Footnotes
Hide All

Work financed as part of the investment in the CLARIN-PL research infrastructure funded by the Polish Ministry of Science and Higher Education.

Footnotes
References
Hide All
AllenJ. 1995. Natural Language Understanding. Redwood City, CA, USA: Benjamin Cummings.
AndersenP. M., HayesP. J., HuettnerA. K., SchmandtL. M., NirenburgI. B., and WeinsteinS. P. 1992. Automatic extraction of facts from press releases to generate news stories. In Proceeding of the 3rd Conference on Applied Natural Language Processing, ANLC. Trento, Italy: Association for Computational Linguistics, pp. 170–7.
BenthemJ. 1983. The Logic of Time: A Model-Theoretic Investigation into the Varieties of Temporal Ontology and Temporal Discourse. Dordrecht, London, Boston: D. Reidel.
BethardS. 2013. ClearTK-TimeML: A minimalist approach to TempEval 2013. In Second Joint Conference on Lexical and Computational Semantics (SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation, SemEval. Atlanta, Georgia, USA: Association for Computational Linguistics, pp. 10–14.
BlumA. L. and LangleyP. 1997. Selection of relevant features and examples in machine learning. Artificial Intelligence 97 (1–2): 245–71.
BrodaB., MarcińczukM., MaziarzM., RadziszewskiA., and WardyńskiA. 2012. KPWr: Towards a Free Corpus of Polish. In Proceedings of the 8th International Conference on Language Resources and Evaluation, LREC. Istanbul, Turkey: European Language Resources Association (ELRA), pp. 3218–22.
BusemannS., DeclerckT., DiagneA. K., DiniL., KleinJ., and SchmeierS. 1997. Natural language dialogue service for appointment scheduling agents. In Proceedings of the 5th Conference on Applied Natural Language Processing, ANLC. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 25–32.
ChinchorN. A. 1998. MUC-7 test scores introduction (Appendix B). In Proceedings of the 7th Message Understanding Conference, Fairfax, VA: Association for Computational Linguistics.
DanielN., RadevD., and AllisonT. 2003. Sub-event based multi-document summarization. In Proceedings of the HLT-NAACL 03 on Text Summarization Workshop, HLT-NAACL-DUC. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 9–16.
DietterichT. G. 1998. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation 10 (7): 1895–923.
FerroL. 2001. Instruction manual for the annotation of temporal expressions. MITRE Technical Report. MITRE Washington C3 Center, McLean, Virginia.
FilatovaE., and HovyE. 2001. Assigning time-stamps to event-clauses. In Proceedings of the Workshop on Temporal and Spatial Information Processing - Volume 13, TASIP. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 1–8.
HanB., GatesD., and LevinL. 2006. Understanding temporal expressions in emails. In Proceedings of the Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, HLT-NAACL. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 136–43.
HouC. and JiaoL. 2010. Selecting features of linear-chain conditional random fields via greedy stage-wise algorithms. Pattern Recognition Letters 31 (2): 151–62.
HripcsakG. and RothschildA. S. 2005. Agreement, the f-measure and reliability in information retrieval. Journal of the American Medical Informatics Association 12 (3): 296–8.
KędziaP., PiaseckiM., KocońJ., and Indyka-PiaseckaA. 2014. Distributionally extended network-based word sense disambiguation in semantic clustering of Polish texts. IERI Procedia 10 (1): 3844.
KocońJ. and MarcińczukM. 2015. Recognition of Polish temporal expressions. In Proceedings of the Recent Advances in Natural Language Processing, RANLP. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 282–90.
KohaviR. and JohnG. H. 1997. Wrappers for feature subset selection. Artificial Intelligence 97 (1–2): 273324.
LaffertyJ. D., McCallumA., and PereiraF. C. N. 2001. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning, ICML. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., pp. 282–9.
LiD., Kipper-SchulerK., and SavovaG. 2008. Conditional random Fields and support vector machines for disorder named entity recognition in clinical texts. In Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing, BioNLP. Columbus, Ohio. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 94–95.
LlorensH., SaqueteE. and Navarro-ColoradoB. 2010a. TimeML events recognition and classification: learning CRF models with semantic roles. In Proceedings of the 23rd International Conference on Computational Linguistics, COLING. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 725–33.
LlorensH., SaqueteE. and Navarro-ColoradoB. 2010b. TIPSem (English and Spanish): evaluating CRFs and semantic roles in TempEval-2. In Proceedings of the 5th International Workshop on Semantic Evaluation, SemEval. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 284–91.
LlorensH., SaqueteE. and Navarro-ColoradoB. 2013. Applying semantic knowledge to the automatic processing of temporal expressions and events in natural language. Information Processing & Management 49 (1): 179197.
ManiI. and WilsonG. 2000. Robust temporal processing of news. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics. ACL, Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 69–76.
MarcińczukM., KocońJ. and BrodaB. 2012. Inforex – a web-based tool for text corpus management and semantic annotation. In Proceedings of the 8th International Conference on Language Resources and Evaluation, LREC. Istanbul, Turkey: European Language Resources Association (ELRA), pp. 224–30.
MarcińczukM., KocońJ. and JanickiM. 2013. Liner2 – a customizable framework for proper names recognition for Polish. In Intelligent Tools for Building a Scientific Information Platform. Studies in Computational Intelligence. Berlin: Springer Verlag, pp. 231–53.
MarcińczukM. and KocońJ. 2013. Recognition of named entities boundaries in Polish texts. In Proceedings of the 4th Biennial International Workshop on Balto-Slavic Natural Language Processing, ACL. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 94–9.
MaziarzM., PiaseckiM., RudnickaE. and SzpakowiczS. 2013. Beyond the transfer-and-merge wordnet construction: plWordNet and a comparison with WordNet. In Proceedings of the Recent Advances in Natural Language Processing, RANLP. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 443–52.
MazurP. 2012. Broad-Coverage Rule-Based Processing of Temporal Expressions. PhD Thesis. Wrocław: Politechnika Wrocławska.
MizobuchiS., SumitomoT., FuketaM. and AoeJ.-I. 1998. A method for understanding time expressions. In IEEE International Conference on Systems, Man, and Cybernetics, SMC. San Diego, CA, pp. 1151–5.
NegriM. and MarsegliaL. 2005. Recognition and normalization of time expressions: ITC-irst at TERN 2004. Technical Report. Developing Multilingual Web-scale Language Technologies.
NiemiJ. and KoskenniemiK. 2007. Representing calendar expressions with finite-state transducers that bracket periods of time on a hierarchical timeline. In Proceedings of the 16th Nordic Conference of Computational Linguistics NODALIDA-2007, NODALIDA. Estonia, Tartu: University of Tartu, pp. 355–62.
PiaseckiM., MaziarzM., SzpakowiczS. and RudnickaE. 2014. PlWordNet as the cornerstone of a toolkit of Lexico-semantic resources. In Proceedings of the 7th International Global Wordnet Conference, ACL. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 304–12.
PustejovskyJ., IngriaB., SauriR., CastanoJ., LittmanJ., GaizauskasR., SetzerA., KatzG., and ManiI. 2005a The specification language TimeML. The Language of Time: A Reader, 545–57. Oxford University Press.
PustejovskyJ., KnippenR., LittmanJ. and SauríR. 2005b. Temporal and event information in natural language text. Language Resources and Evaluation 39 (2–3): 123–64.
RadziszewskiA., MaziarzM. and WieczorekJ. 2012. Shallow syntactic annotation in the Corpus of Wrocław University of Technology. Cognitive Studies 12 (1): 129–47.
SaqueteE., MuñozR., and Martínez-BarcoP. 2003. TERSEO: temporal expression resolution system applied to event ordering. In Preceedings of Text, Speech and Dialogue, Lecture Notes in Computer Science. Berlin: Springer Verlag, pp. 220–8.
SauríR., LittmanJ., GaizauskasR., SetzerA., and PustejovskyJ. 2006. TimeML Annotation Guidelines, Version 1.2.1. http://www.timeml.org/site/publications/timeMLdocs/annguide_1.2.1.pdf
SchilderF. 2004. Extracting meaning from temporal nouns and temporal prepositions. ACM Transactions on Asian Language Information Processing (TALIP) 3 (1): 3350.
SchilderF. and HabelC. 2001. From temporal expressions to temporal information: semantic tagging of news messages. In Proceedings of the ACL-2001 Workshop on Temporal and Spatial Information Processing, ACL. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 65–72.
SkukanL., GlavasG. and SnajderJ. 2014. HEIDELTIME.HR: extracting and normalizing temporal expressions in Croatian. In Proceedings of the 9th Slovenian Language Technologies Conferences, IS-LT. Slovenia, Ljubljana: Information Society, pp. 99–103.
SmithC. S. 2010. Temporal structures in discourse. In Text, Time, and Context. Studies in linguistics and philosophy, vol. 87. Netherlands: Springer, pp. 285302.
StrötgenJ., ZellJ., and GertzM. 2013. HeidelTime: tuning english and developing Spanish resources for TempEval-3. In 2nd Joint Conference on Lexical and Computational Semantics (SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), SemEval. Atlanta, Georgia, USA: Association for Computational Linguistics, pp. 15–19.
StrötgenJ. and GertzM. 2013. Multilingual and cross-domain temporal tagging. Language Resources and Evaluation 47 (2): 269–98.
StrötgenJ. and GertzM. 2015. A baseline temporal tagger for all languages. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP. Lisbon, Portugal. Association for Computational Linguistics, pp. 541–547.
UzZamanN., and AllenJ. 2010. TRIPS and TRIOS system for TempEval-2: extracting temporal information from text. In Proceedings of the 5th International Workshop on Semantic Evaluation. Association for Computational Linguistics, pp. 276–283.
UzZamanN., LlorensH., AllenJ. F., DerczynskiL., VerhagenM., and PustejovskyJ. 2012. TempEval-3: evaluating events, time expressions and temporal relations. Computing Research Repository, abs/1206.5333.
UzZamanN., LlorensH., DerczynskiL., VerhagenM., AllenJ., and PustejovskyJ. 2013. SemEval-2013 Task 1: TEMPEVAL-3: evaluating time expressions, events and temporal relations. In 2nd Joint Conference on Lexical and Computational Semantics (SEM), Volume 2: Proceedings of the 7th International Workshop on Semantic Evaluation, SemEval. Atlanta, Georgia, USA: Association for Computational Linguistics, pp. 1–9.
Vicente-DiezM. T., SamyD., and MartinezP. 2008. An empirical approach to a preliminary successful identification and resolution of temporal expressions in Spanish news corpora. In Proceedings of the 6th International Conference on Language Resources and Evaluation, LREC. European Language Resources Association (ELRA), pp. 2153–8.
ZhuX. 2010. Conditional Random Fields. CS769 Advanced Natural Language Processing. http://pages.cs.wisc.edu/~jerryzhu/cs769/CRF.pdf
Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Natural Language Engineering
  • ISSN: 1351-3249
  • EISSN: 1469-8110
  • URL: /core/journals/natural-language-engineering
Please enter your name
Please enter a valid email address
Who would you like to send this to? *
×

Metrics

Full text views

Total number of HTML views: 5
Total number of PDF views: 50 *
Loading metrics...

Abstract views

Total abstract views: 338 *
Loading metrics...

* Views captured on Cambridge Core between 27th September 2016 - 23rd October 2017. This data will be updated every 24 hours.