Hostname: page-component-77f85d65b8-v2srd Total loading time: 0 Render date: 2026-03-28T14:39:31.403Z Has data issue: false hasContentIssue false

UIMA Ruta: Rapid development of rule-based information extraction applications

Published online by Cambridge University Press:  08 October 2014

PETER KLUEGL
Affiliation:
Comprehensive Heart Failure Center, University of Würzburg, Straubmühlweg 2a and Department of Computer Science VI, University of Würzburg, Am Hubland, Würzburg, Germany email: peter.kluegl@uni-wuerzburg.de
MARTIN TOEPFER
Affiliation:
Department of Computer Science VI, University of Würzburg, Am Hubland, Würzburg, Germany email: martin.toepfer@uni-wuerzburg.de, philip.beck@uni-wuerzburg.de, georg.fette@uni-wuerzburg.de, frank.puppe@uni-wuerzburg.de
PHILIP-DANIEL BECK
Affiliation:
Department of Computer Science VI, University of Würzburg, Am Hubland, Würzburg, Germany email: martin.toepfer@uni-wuerzburg.de, philip.beck@uni-wuerzburg.de, georg.fette@uni-wuerzburg.de, frank.puppe@uni-wuerzburg.de
GEORG FETTE
Affiliation:
Department of Computer Science VI, University of Würzburg, Am Hubland, Würzburg, Germany email: martin.toepfer@uni-wuerzburg.de, philip.beck@uni-wuerzburg.de, georg.fette@uni-wuerzburg.de, frank.puppe@uni-wuerzburg.de
FRANK PUPPE
Affiliation:
Department of Computer Science VI, University of Würzburg, Am Hubland, Würzburg, Germany email: martin.toepfer@uni-wuerzburg.de, philip.beck@uni-wuerzburg.de, georg.fette@uni-wuerzburg.de, frank.puppe@uni-wuerzburg.de
Rights & Permissions [Opens in a new window]

Abstract

Rule-based information extraction is an important approach for processing the increasingly available amount of unstructured data. The manual creation of rule-based applications is a time-consuming and tedious task, which requires qualified knowledge engineers. The costs of this process can be reduced by providing a suitable rule language and extensive tooling support. This paper presents UIMA Ruta, a tool for rule-based information extraction and text processing applications. The system was designed with focus on rapid development. The rule language and its matching paradigm facilitate the quick specification of comprehensible extraction knowledge. They support a compact representation while still providing a high level of expressiveness. These advantages are supplemented by the development environment UIMA Ruta Workbench. It provides, in addition to extensive editing support, essential assistance for explanation of rule execution, introspection, automatic validation, and rule induction. UIMA Ruta is a useful tool for academia and industry due to its open source license. We compare UIMA Ruta to related rule-based systems especially concerning the compactness of the rule representation, the expressiveness, and the provided tooling support. The competitiveness of the runtime performance is shown in relation to a popular and freely-available system. A selection of case studies implemented with UIMA Ruta illustrates the usefulness of the system in real-world scenarios.

Information

Type
Articles
Copyright
Copyright © Cambridge University Press 2014 
Figure 0

Example 1. UIMA Ruta script pipeline for parsing bibliographic references.

Figure 1

Example 2. Three different notations of the same rule for detecting dates: old fashioned (line 3+4), compact (line 6+7), and traditional (line 9+10). Years with three digits are allowed.

Figure 2

Fig. 1. List of conditions and actions currently available in UIMA Ruta.

Figure 3

Example 3. A simple rule for copying feature values and assigning annotations to features. A new annotation of the type Container is created, which stores different information of the underlying annotations as feature values.

Figure 4

Algorithm 1 Pseudo-code of the rule matching algorithm in UIMA Ruta.

Figure 5

Example 4. Two simple rules that match on a token followed by a LastToken annotation. While the first rule has to investigate every token, the second rule starts to match with the second rule element and requires less index operations.

Figure 6

Example 5. Two equivalent rules for annotating text between two periods. While the first rule needs to match on each token (ANY), the second rule just searches for the next period resulting in less UIMA index operations.

Figure 7

Example 6. A conjunction of two simple rules. The complete rule matches only if both rules are able to match independently of each other.

Figure 8

Example 7. Two identical rules that match on different text positions due to the changed filtering settings in the second rule. The second rule is sensible to markup and whitespaces in its sequential constraint.

Figure 9

Example 8. Three rules for matching on sentences. The other rules change the filtering setting resulting in different matches on sentences.

Figure 10

Example 9. A conditioned statement using the block construct. The contained rules are only applied if the language of the document is set to ‘de’.

Figure 11

Example 10. Iteration over annotations of the type Sentence. The contained rules are applied for each sentence and only in the window of the current sentence.

Figure 12

Example 11. An example of an inlined rule interpreted as a postcondition. An annotation is created for each sentence if additional requirements are fulfilled.

Figure 13

Example 12. An example of a rule element with an inlined rule interpreted as a precondition. An annotation is created only if the sentence contained two subsequent noun phrases.

Figure 14

Example 13. Candidate classification with UIMA Ruta rules. The rule classifies a paragraph as a headline if it is ninety to hundred percent covered by Bold and Underlined annotations, and ends with a colon

Figure 15

Example 14. Bottom-up approach for labeling author sections. The first rule detects initials, the second rule identifies names, and the third rule combines names to authors.

Figure 16

Example 15. Boundary matching approach for labeling author sections. First rule detects the start position, the second rule identifies the end position, and the third rule combines both for the complete annotation.

Figure 17

Example 16. Two examples for transformation-based rules. The first rule deletes headlines without words and the second rule includes text like ‘Mr.’ in Person annotations.

Figure 18

Example 17. Scoring rules for weighting different aspects of headlines. The rules create an annotation of the type Headline for paragraphs like ‘Diagnoses:’ since the first (line 2) and fifth rule (line 6) increase the score resulting in an overall score of 12. The rule in line 8 evaluates the score and creates a new annotation.

Figure 19

Fig. 2. The UIMA Ruta Workbench: (A) Script Explorer with a UIMA Ruta project. (B) Full-featured editor for specifying rules. (C) CAS Editor for visualizing the results. (D) Overview of annotations sorted by type. (E) Annotations overlapping the selected position in the active CAS Editor.

Figure 20

Fig. 3. The UIMA Ruta Workbench: (A+B+C+D) A selection of views for the explanation of the rule execution: (A) The Applied Rules view for detailed report with profiling information. (B+C) Successful and failed matches. (D) The Rule Elements view displaying matches of rule elements and results of conditions. (E) The Ruta Query view for introspection in a collection of documents.

Figure 21

Fig. 4. The UIMA Ruta Workbench: Different views for automatic validation. (A) The Annotation Testing view for gold standard evaluation. (B) Results of constraint-driven evaluation. (C) Detailed results of specific constraints. (D) Collection of expectations.

Figure 22

Fig. 5. An excerpt of an exemplary JAPE macro and rule (Cunningham et al.2000) (left) for the detection of ‘money’ entities and their UIMA Ruta equivalents (right).

Figure 23

Fig. 6. Average processing time for documents of different sizes.

Figure 24

Fig. 7. An exemplary AFST rule (Boguraev et al.2010) (left) for vertical matching in ‘PName’ annotations and its UIMA Ruta equivalent (right). The rules match on text passages like ‘General Ulysses S. Grant’ if the corresponding annotations are present. The optional patterns for the First and Middle annotations are not necessary in UIMA Ruta.

Figure 25

Fig. 8. Excerpt of exemplary AQL rules (Chiticariu et al.2010) (left) for the detection of persons and their UIMA Ruta equivalents (right). The last two UIMA Ruta rules are only necessary for the consolidate statement.

Figure 26

Fig. 9. Excerpts of exemplary documents processed in the case studies: German clinical letters (A), curricula vitae (B), and scientific references (C).

Figure 27

Table 1. Key figures of case studies: amount of involved rules, effort spent for rule development, size (#documents) of development and test set, and F1 score on unseen test set. *Confidential project with industrial partner where effort and F1 score are not available, but the company confirmed an increase of efficiency by 100%

Figure 28

Example 18. Example of a transformation-based rule for correcting Author annotations dependent on a type variable (EndOfAuthor), which stores the dominant ending of authors, and an additional annotation (ConflictAtEnd) that points out discrepancies to the local model.