Skip to main content Accessibility help
×
Home

UIMA Ruta: Rapid development of rule-based information extraction applications

  • PETER KLUEGL (a1), MARTIN TOEPFER (a2), PHILIP-DANIEL BECK (a2), GEORG FETTE (a2) and FRANK PUPPE (a2)...

Abstract

Rule-based information extraction is an important approach for processing the increasingly available amount of unstructured data. The manual creation of rule-based applications is a time-consuming and tedious task, which requires qualified knowledge engineers. The costs of this process can be reduced by providing a suitable rule language and extensive tooling support. This paper presents UIMA Ruta, a tool for rule-based information extraction and text processing applications. The system was designed with focus on rapid development. The rule language and its matching paradigm facilitate the quick specification of comprehensible extraction knowledge. They support a compact representation while still providing a high level of expressiveness. These advantages are supplemented by the development environment UIMA Ruta Workbench. It provides, in addition to extensive editing support, essential assistance for explanation of rule execution, introspection, automatic validation, and rule induction. UIMA Ruta is a useful tool for academia and industry due to its open source license. We compare UIMA Ruta to related rule-based systems especially concerning the compactness of the rule representation, the expressiveness, and the provided tooling support. The competitiveness of the runtime performance is shown in relation to a popular and freely-available system. A selection of case studies implemented with UIMA Ruta illustrates the usefulness of the system in real-world scenarios.

    • Send article to Kindle

      To send this article to your Kindle, first ensure no-reply@cambridge.org is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about sending to your Kindle. Find out more about sending to your Kindle.

      Note you can select to send to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be sent to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

      Find out more about the Kindle Personal Document Service.

      UIMA Ruta: Rapid development of rule-based information extraction applications
      Available formats
      ×

      Send article to Dropbox

      To send this article to your Dropbox account, please select one or more formats and confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your <service> account. Find out more about sending content to Dropbox.

      UIMA Ruta: Rapid development of rule-based information extraction applications
      Available formats
      ×

      Send article to Google Drive

      To send this article to your Google Drive account, please select one or more formats and confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your <service> account. Find out more about sending content to Google Drive.

      UIMA Ruta: Rapid development of rule-based information extraction applications
      Available formats
      ×

Copyright

References

Hide All
Appelt, D. E., and Onyshkevych, B. 1998. The common pattern specification language. In Proceedings of a Workshop on Held at Baltimore, Maryland: October 13–15, 1998, (TIPSTER '98), Stroudsburg: ACL, pp. 2330.
Atzmueller, M., Kluegl, P., and Puppe, F. 2008. Rule-based information extraction for structured data acquisition using TextMarker. In Baumeister and Atzmueller (ed.) LWA-2008 (Special Track on Knowledge Discovery and Machine Learning), Würzburg, Germany, pp. 17.
Beck, P.-D. 2013. Identifikation und Klassifikation von Abschnitten in Arztbriefen (in German). Master Thesis, University of Würzburg.
Black, W. J., McNaught, J., Vasilakopoulos, A., Zervanou, K., Theodoulidis, B., and Rinaldi, F. 2005. CAFETIERE: conceptual annotations for facts, events, terms, individual entities and RElations. Technical Report TR–U4.3.1. Parmenides Technical Report.
Boguraev, B., and Neff, M. 2006. An annotation-Based finite-state system for UIMA: pattern matching over annotations. Technical Report, IBM T.J. Watson Research Center.
Boguraev, B., and Neff, M., 2010. A framework for traversing dense annotation lattices. Language Resources and Evaluation 44 (3): 183203.
Bohannon, P., Merugu, S., Yu, C., Agarwal, V., DeRose, P., Iyer, A., Jain, A., Kakade, V., Muralidharan, M., Ramakrishnan, R., and Shen, W. 2009. Purple SOX extraction management System. SIGMOD Record 37 (4): 2127
Brill, E., 1995. Transformation-based error-driven learning and natural language processing: a case study in part-of-speech tagging. Computational Linguistics 21 (4): 543565.
Chiticariu, L., Chu, V., Dasgupta, S., Goetz, T. W., Ho, H., Krishnamurthy, R., Lang, A., Li, Y., Liu, B., Raghavan, S., Reiss, F. R., Vaithyanathan, S., and Zhu, H. 2011. The systemT IDE: an integrated development environment for information Extraction rules. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, New York: ACM, pp. 12911294.
Chiticariu, L., Krishnamurthy, R., Li, Y., Raghavan, S., Reiss, F. R., and Vaithyanathan, S. 2010. SystemT: an algebraic Approach to declarative information extraction. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Stroudsburg: ACL, pp. 128137.
Chiticariu, L., Li, Y., and Reiss, F. R. 2013. Rule-based information extraction is dead! Long live rule-based information extraction systems! In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Stroudsburg: ACL, pp. 827832.
Ciravegna, F. 2003. (LP)2, rule induction for information extraction using linguistic constraints. Technical Report CS–03–07, Department of Computer Science, University of Sheffield.
Cunningham, H. 2007. Indexing and querying linguistic metadata and document content. In Recent Advances in Natural Language Processing IV: Selected papers from RANLP 2005. 292, pp. 3544. Amsterdam: John Benjamins Publishing Company.
Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V., Aswani, N., Roberts, I., Gorrell, G., Funk, A., Roberts, A., Damljanovic, D., Heitz, T., Greenwood, M. A., Saggion, H., Petrak, J., Li, Y., and Peters, W., 2011. Text Processing with GATE (Version 6). Murphys, CA: Gateway Press.
Cunningham, H., Maynard, D., and Tablan, V. 2000. JAPE: a java annotation patterns engine (Second Edition). Research Memorandum CS–00–10, Department of Computer Science, University of Sheffield, Sheffield.
David, J., and Hossein, S., 2005. Test-driven development: concepts, taxonomy, and future direction. Computer 38 (9): 4350.
Doan, A., Granavo, L., Ramakrishnan, R., and Vaithyanathan, S., 2008. Special Issue on Managing Information Extraction. New York: ACM.
Drozdzynski, W., Krieger, H.-U., Piskorski, J., Schäfer, U., and Xu, F., 2004. Shallow processing with unification and typed feature structures - foundations and applications. Künstliche Intelligenz 18 (1): 1723.
Eckstein, B., Kluegl, P., and Puppe, F. 2011. Towards learning error-driven transformations for information extraction. In Workshop Notes of the LWA 2011 - Learning, Knowledge, Adaptation, Magdeburg, Germany, pp. 199204.
Fagin, R., Kimelfeld, B., Reiss, F., and Vansummeren, S. 2013. Spanners: a formal framework for information extraction. In Proceedings of the 32nd Symposium on Principles of Database Systems, New York: ACM, pp. 3748.
Ferrucci, D., and Lally, A., 2004. UIMA: An architectural approach to unstructured information processing in the corporate research environment. Natural Language Engineering 10 (3/4): 327348.
Ferrucci, D. A., Brown, E. W., Chu-Carroll, J., Fan, J., Gondek, D., Kalyanpur, A., Lally, A., Murdock, J. W., Nyberg, E., Prager, J. M., Schlaefer, N., and Welty, C. A., 2010. Building Watson: an overview of the DeepQA project. AI Magazine 31 (3): 5979.
Greenwood, M. A., Tablan, V., and Maynard, D. 2011. GATE Mmir: answering questions Google Cant. In Proceedings of the 10th International Semantic Web Conference (ISWC2011). Lecture Notes in Computer Science, vol. 7031. Springer.
Gurevych, I., Mühlhäuser, M., Müller, C., Steimle, J., Weimer, M., and Zesch, T. 2007. Darmstadt knowledge processing repository based on UIMA. In Proceedings of the First Workshop on Unstructured Information Management Architecture at Biannual Conference of the Society for Computational Linguistics and Language Technology, Heidelberg: Springer.
IJntema, W., Sangers, J., Hogenboom, F., and Frasincar, F. 2012. A lexicosemantic pattern language for learning ontology instances from text. Web Semantics: Science, Services and Agents on the World Wide Web, vol. 15. Amsterdam: Elsevier, pp. 3750.
Khaitan, S., Ramakrishnan, G., Joshi, S., and Chalamalla, A. 2008. RAD: a scalable framework for annotator development. In Alonso, Blakeley and Chen (ed.), ICDE, Los Alamitos, CA: IEEE Computer Society Press, pp. 16241627.
Kluegl, P., Atzmueller, M., Hermann, T., and Puppe, F. 2009a. A framework for semi-automatic development of rule-based information extraction applications. In Hartmann and Janssen (ed.), Proceedings LWA 2009 (KDML - Special Track on Knowledge Discovery and Machine Learning), Darmstadt, Germany, pp. 5659.
Kluegl, P., Atzmueller, M., and Puppe, F. 2009b. Test-driven development of complex information Extraction systems using TextMarker. In Naplepa and Baumeister (ed.), 4th International Workshop on Knowledge Engineering and Software Engineering (KESE 2008), 31th German Conference on Artificial Intelligence (KI-2008), Darmstadt, Germany, pp. 1930.
Kluegl, P., Atzmueller, M., and Puppe, F. 2009c. Meta-level information extraction. In 32nd Annual German Conference on Artificial Intelligence (KI 2009), Berlin: Springer, pp. 233240.
Kluegl, P., Atzmueller, M., and Puppe, F. 2009d. TextMarker: a tool for rule-based information Extraction. In Chiarcos, Eckart de Castilho and Stede (ed.), Proceedings of the Biennial GSCL Conference 2009, 2nd UIMA@GSCL Workshop, Tübingen: Gunter Narr Verlag, pp. 233240.
Kluegl, P., Hotho, A., and Puppe, F. 2010. Local adaptive extraction of references. In 33rd Annual German Conference on Artificial Intelligence (KI 2010), Berlin: Springer, pp 4047.
Lafferty, J., McCallum, A., and Pereira, F. 2001. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proceedings 18th International Conference on Machine Learning, San Francisco: Morgan Kaufmann, pp. 282289.
Li, Y., Chiticariu, L., Yang, H., Reiss, F. R., and Carreno-fuentes, A. 2012. WizIE: a best practices guided development environment for information extraction. In Proceedings of the ACL 2012 System Demonstrations, ACL '12, Stroudsburg: ACL, pp. 109114.
Maximilien, E. M., and Williams, L. 2003. Assessing test-driven development at IBM. In ICSE '03: Proceedings of the 25th International Conference on Software Engineering, Los Alamitos: IEEE Computer Society Press, pp. 564569.
Piskorski, J., and Yangarber, R. 2013. Information extraction: past, present and future. In Poibeau, Saggion, Piskorski and Yangarber (ed.), Multi-source, Multilingual Information Extraction and Summarization, Theory and Applications of Natural Language Processing, Berlin: Springer, pp. 2349.
Ramakrishnan, G., Balakrishnan, S., and Joshi, S. 2006. Entity Annotation based on inverse index operations. In Jurafsky and Gaussier (ed.), EMNLP, Stroudsburg: ACL, pp. 492500.
Savova, G. K., Masanz, J. J., Ogren, P. V., Zheng, J., Sohn, S., Kipper-Schuler, K. C., and Chute, C. G., 2010. Mayo clinical text analysis and knowledge extraction system (cTAKES): architecture, component evaluation and applications. Journal of the American Medical Informatics Association: JAMIA 17 (5): 507513.
Shen, W., Doan, A., Naughton, J. F., and Ramakrishnan, R. 2007. Declarative information extraction using datalog with embedded extraction predicates. In Proceedings of the 33rd International Conference on Very Large Data Bases, (VLDB '07), VLDB Endowment, pp. 10331044.
Soderland, S., Cardie, C., and Mooney, R., 1999. Learning information extraction rules for semi-structured and free text. Machine Learning 34: 233272.
Turmo, J., Ageno, A., and Català, N., 2006. Adaptive information extraction. ACM Computing Surveys 38 (2): 147.
Wittek, A., Toepfer, M., Fette, G., Kluegl, P., and Puppe, F. 2013. Constraint-driven evaluation in UIMA Ruta. In Kluegl, Eckart de Castilho and Tomanek (ed.), UIMA@GSCL, CEUR Workshop Proceedings, vol. 1038. CEUR-WS.org, pp. 58–65.
Yang, H., Pupons-Wickham, D., Chiticariu, L., Li, Y., Nguyen, B., and Carreno-Fuentes, A 2013. I can do text analytics!: Designing development tools for novice developers. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, (CHI '13), New York: ACM, pp. 15991608.

UIMA Ruta: Rapid development of rule-based information extraction applications

  • PETER KLUEGL (a1), MARTIN TOEPFER (a2), PHILIP-DANIEL BECK (a2), GEORG FETTE (a2) and FRANK PUPPE (a2)...

Metrics

Altmetric attention score

Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed