To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Quite a few of the techniques described in the foregoing chapters could be said to qualify as machine-learning methods. In this chapter we consider a number of other popular machine-learning algorithms and models which either have seen limited use in gene finding, or would seem to offer possible avenues for future investigation in this arena. Most of the methods which we describe are relatively easy to implement in software, and nearly all are available in open-source implementations (see Appendix). While the current emphasis in the field of gene prediction seems to be on Markovian systems (in one form or another), an expanded role for other predictive techniques in the future is not inconceivable.
Overview of automatic classification
Perhaps the most typical setting for machine-learning applications is that of N-way classification (Figure 10.1). In this setting, a test case (i.e., a novel object) is presented to a classifier for assignment to one of a fixed number of discrete categories. The test case is typically encoded as a vector of real-valued or integer-valued attributes (i.e., random variables – section 2.6), though since we will generally treat all attributes as being real-valued; thus the attributes of a single test case are drawn from, for some integer m. The categories to which test cases are to be mapped are typically encoded as integer values in the range.
In this chapter we describe a number of heuristics which we have found useful during the implementation, training, and/or deployment of practical gene finding systems for real genome annotation tasks.
Boosting
A well-known trick from the field of machine learning is boosting. This technique has been applied to the training of gene finders in the following way, with modest accuracy improvements being observed in a number of cases.
Suppose that while training a signal sensor for a GHMM-based gene finder we notice that a number of positive examples are assigned relatively poor scores by the newly trained sensor. One approach to boosting involves duplicating these examples in the training set and then re-training the sensor from scratch. The duplicated, low-scoring examples will now have a greater impact on the parameter estimation process due to their being present multiple times in the training set, so that the re-trained sensor is more likely to assign a higher score to those examples. Assuming that the low-scoring examples are not mislabeled training features, improvements to the accuracy of the resulting gene finder might be expected when the gene finder is later deployed on sequences having genes with similar characteristics to the duplicated examples. Care must be taken to avoid overtraining, however. To the extent that a gene finder with optimal genome-wide accuracy is desired, it is important that boosting not be allowed to bias the gene finder in way that is significantly inconsistent with the actual frequency of these difficult signals in the genome.
We investigate the density and distribution behaviors of the chinese remainder representationpseudorank. We give a very strong approximation to density, and derive two efficientalgorithms to carry out an exact count (census) of the bad pseudorank integers. One ofthese algorithms has been implemented, giving results in excellent agreement withour density analysis out to 5189-bit integers.
This chapter provides an introduction to Description Logics as a formal language for representing knowledge and reasoning about it. It first gives a short overview of the ideas underlying Description Logics. Then it introduces syntax and semantics, covering the basic constructors that are used in systems or have been introduced in the literature, and the way these constructors can be used to build knowledge bases. Finally, it defines the typical inference problems, shows how they are interrelated, and describes different approaches for effectively solving these problems. Some of the topics that are only briefly mentioned in this chapter will be treated in more detail in subsequent chapters.
Introduction
As sketched in the previous chapter, Description Logics is the most recent name for a family of knowledge representation (KR) formalisms that represent the knowledge of an application domain (the “world”) by first defining the relevant concepts of the domain (its terminology), and then using these concepts to specify properties of objects and individuals occurring in the domain (the world description). As the name Description Logics indicates, one of the characteristics of these languages is that, unlike some of their predecessors, they are equipped with a formal, logic-based semantics. Another distinguished feature is the emphasis on reasoning as a central service: reasoning allows one to infer implicitly represented knowledge from the knowledge that is explicitly contained in the knowledge base.
Description Logics and related formalisms are being applied in at least five applications in medical informatics – terminology, intelligent user interfaces, decision support and semantic indexing, language technology, and systems integration. Important issues include size, complexity, connectivity, and the wide range of granularity required – medical terminologies require on the order of 250,000 concepts, some involving a dozen or more conjuncts with deep nesting; the nature of anatomy and physiology is that everything connects to everything else; and notions to be represented range from psychology to molecular biology. Technical issues for expressivity have focused on problems of part–whole relations and the need to provide “frame-like” functionality – i.e., the ability to determine efficiently what can sensibly be said about any particular concept and means of handling at least limited cases of defaults with exceptions. There are also significant problems with “semantic normalization” and “clinical pragmatics” because understanding medical notions often depends on implicit knowledge and some notions defy easy logical formulation. The two best-known efforts – Open Galen and Snomed-rt – both use idiosyncratic Description Logics with generally limited expressivity but specialized extensions to cope with issues around part–whole and other transitive relations. There is also a conflict between the needs for re-use and the requirement for easy understandability by domain expert authors. OpenGalen has coped with this conflict by introducing a layered architecture with a high level “Intermediate Representation” which insulates authors from the details of the Description Logic, which is treated as an “assembly language” rather than the primary medium for expressing the ontology.
We have previously remarked that CCS, like all other process algebras, can be used to describe both implementations of processes and specifications of their expected behaviours. A language like CCS therefore supports the so-called single-language approach to process theory – that is, the approach in which a single language is used to describe both actual processes and their specifications. An important ingredient of these languages is therefore a notion of behavioural equivalence or behavioural approximation between processes. One process description, say SYS, may describe an implementation and another, say SPEC, may describe a specification of the expected behaviour. To say that SYS and SPEC are equivalent is taken to indicate that these two processes describe essentially the same behaviour, albeit possibly at different levels of abstraction or refinement. To say that, in some formal sense, SYS is an approximation of SPEC means roughly that every aspect of the behaviour of this process is allowed by the specification SPEC and thus that nothing unexpected can happen in the behaviour of SYS. This approach to program verification is also sometimes called implementation verification or equivalence checking.
Criteria for good behavioural equivalence
We have already argued informally that some processes that we have met so far ought to be considered as behaviourally equivalent.
This chapter will discuss the implementation of the reasoning services which form the core of DL-based knowledge representation systems. To be useful in realistic applications, such systems need both expressive logics and fast reasoners. As expressive logics inevitably have high worst-case complexities, this can only be achieved by employing highly optimized implementations of suitable reasoning algorithms. Systems based on such implementations have demonstrated that they can perform well with problems that occur in realistic applications, including problems where unoptimized reasoning is hopelessly intractable.
Introduction
The usefulness of Description Logics in applications has been hindered by the basic conflict between expressiveness and tractability. Realistic applications typically require both expressive logics, with inevitably high worst-case complexities for their decision procedures, and acceptable performance from the reasoning services. Although the definition of acceptable may vary widely from application to application, early experiments with Description Logics indicated that, in practice, performance was a serious problem, even for logics with relatively limited expressive powers [Heinsohn et al., 1992].
On the other hand, theoretical work has continued to extend our understanding of the boundaries of decidability in Description Logics, and has led to the development of sound and complete reasoning algorithms for much more expressive logics. The expressive power of these logics goes a long way towards addressing the criticisms leveled at Description Logics in traditional applications such as ontological engineering [Doyle and Patil, 1991] and is sufficient to suggest that they could be useful in several exciting new application domains, for example reasoning about database schemas and queries [Calvanese et al., 1998f; 1998a] and providing reasoning support for the so-called Semantic Web [Decker et al., 2000; Bechhofer et al., 2001b].
This book is based on courses that have been held at Aalborg University and at Reykjavík University over the last six years or so. The aim of these semester-long courses has been to introduce computer science students, at an early stage of their M.Sc. degrees or late in their B.Sc. degree studies, to the theory of concurrency and to its applications in the modelling and analysis of reactive systems. This is an area of formal-methods study that is finding increasing application outside academic circles and allows students to appreciate how techniques and software tools based on sound theoretical principles are very useful in the design and analysis of nontrivial reactive computing systems.
In order to carry this message across to students in the most effective way, the courses on which the material in this book is based have presented:
some prime models used in the theory of concurrency (with special emphasis on state-transition models of computation such as labelled transition systems and timed automata);
languages for describing actual systems and their specifications (with a focus on classic algebraic process calculi such as Milner's calculus of communicating systems and logics such modal and temporal logics); and
the embodiment of these models and languages in tools for the automatic verification of computing systems.
It has long been realized that the web could benefit from having its content understandable and available in a machine processable form. The Semantic Web aims to achieve this via annotations that use terms defined in ontologies to give well defined meaning to web accessible information and services. OWL, the ontology language recommended by the W3C for this purpose, was heavily influenced by Description Logic research. In this chapter we review briefly some early efforts that combine Description Logics and the web, including predecessors of OWL such as OIL and DAML+OIL. We then go on to describe OWL in some detail, including the various influences on its design, its relationship with RDFS, its syntax and semantics, and a range of tools and applications.
Background and history
The World Wide Web, while wildly successful in growth, may be viewed as being limited by its reliance on languages such as HTML that are focused on presentation (i.e., text formatting) rather than content. Languages such as XML do add some support for capturing the meaning of web content (instead of simply how to render it in a browser), but more is needed in order to support intelligent applications that can better exploit the ever increasing range of information and services accessible via the web. Such applications are urgently needed in order to avoid overwhelming users with the sheer volume of information becoming available.
The aim of the first part of this book is to introduce three basic notions that we shall use to describe, specify and analyse reactive systems, namely
Milner's calculus of communicating systems (CCS) (Milner, 1989),
the model known as labelled transition systems (LTSs) (Keller, 1976), and
Hennessy–Milner logic (HML) (Hennessy and Milner, 1985) and its extension with recursive definitions of formulae (Larsen, 1990).
We shall present a general theory of reactive systems and its applications. In particular, we intend to show the following:
how to describe actual systems using terms in our chosen models (i.e. either as terms in the process description language CCS or as labelled transition systems);
how to offer specifications of the desired behaviour of systems either as terms of our models or as formulae in HML; and
how to manipulate these descriptions, possibly (semi-)automatically, in order to analyse the behaviour of the model of the system under consideration.
In the second part of the book, we shall introduce a similar trinity of basic notions that will allow us to describe, specify and analyse real-time systems – that is, systems whose behaviour depends crucially on timing constraints. There we shall present the formalisms of timed automata (Alur and Dill, 1994) and timed CCS (Yi, 1990, 1991a, b) to describe real-time systems, the model of timed labelled transition systems (TLTSs) and a real-time version of Hennessy–Milner logic (Laroussinie, Larsen and Weise, 1995).
This chapter covers extensions of the basic Description Logics introduced in Chapter 2 by very expressive constructs that require advanced reasoning techniques. In particular, we study reasoning in description logics that include general inclusion axioms, inverse roles, number restrictions, reflexive–transitive closure of roles, fixpoint constructs for recursive definitions, and relations of arbitrary arity. The chapter will also address reasoning w.r.t. knowledge bases including both a TBox and an ABox, and discuss more general ways to treat objects. Since the logics considered in the chapter lack the finite model property, finite model reasoning is of interest and will also be discussed. Finally, we mention several extensions to description logics that lead to undecidability, confirming that the expressive description logics considered in this chapter are close to the boundary between decidability and undecidability.
Introduction
Description Logics have been introduced with the goal of providing a formal reconstruction of frame systems and semantic networks. Initially, the research has concentrated on subsumption of concept expressions. However, for certain applications, it turns out that it is necessary to represent knowledge by means of inclusion axioms without limitation on cycles in the TBox. Therefore, recently there has been a strong interest in the problem of reasoning over knowledge bases of a general form. See Chapters 2, 3, and 4 for more details.
When reasoning over general knowledge bases, it is not possible to gain tractability by limiting the expressive power of the description logic, because the power of arbitrary inclusion axioms in the TBox alone leads to high complexity in the inference mechanisms.
The purpose of the chapter is to help someone familiar with DLs to understand the issues involved in developing an ontology for some universe of discourse, which is to become a conceptual model or knowledge base represented and reasoned about using Description Logics.
We briefly review the purposes and history of conceptual modeling, and then use the domain of a university library to illustrate an approach to conceptual modeling that combines general ideas of object-centered modeling with a look at special modeling/ontological problems, and DL-specific solutions to them.
Among the ontological issues considered are the nature of individuals, concept specialization, non-binary relationships, materialization, aspects of part–whole relationships, and epistemic aspects of individual knowledge.
Background
Information modeling is concerned with the construction of computer-based symbol structures that model some part of the real world. We refer to such symbol structures as information bases, generalizing the term from related terms in Computer Science, such as databases and knowledge bases. Moreover, we shall refer to the part of a real world being modeled by an information base as its universe of discourse (UofD). The information base is checked for consistency, and sometimes queried and updated through special-purpose languages. As with all models, the advantage of information models is that they abstract away irrelevant details, and allow more efficient examination of both the current, as well as past and projected future states of the UofD.
In contrast to the relatively complex information that can be expressed in DL ABoxes (which we might call knowledge or information), databases and other sources such as files, semistructured data, and the World Wide Web provide rather simpler data, which must however be managed effectively. This chapter surveys the major classes of application of Description Logics and their reasoning facilities to the issues of data management, including: (i) expressing the conceptual domain model/ontology of the data source, (ii) integrating multiple data sources, and (iii) expressing and evaluating queries. In each case we utilize the standard properties of Description Logics, such as the ability to express ontologies at a level closer to that of human conceptualization (e.g., representing conceptual schemas), determining consistency of descriptions (e.g., determining if a query or the integration of some schemas is consistent), and automatically classifying descriptions that are definitions (e.g., queries are really definitions, so we can classify them and determine subsumption between them).
Introduction
According to [EIMasri and Navathe, 1994], a database is a coherent collection of related data, which have some “inherent meaning”. Databases are similar to knowledge bases because they are usually used to maintain models of some universe of discourse (UofD). Of course, the purpose of such computer models is to support end-users in finding out things about the world, and therefore it is important to maintain an up-to-date and error-free model.