The data integration saga: A QUESTION OF SEMANTICS?

The importance of data integration has been recognized by the life sciences industry for many years. Effective integration of data promises the ability for organizations to make better decisions about resource allocation during the drug discovery and development process, and for more informed decisions to be made in respect to the market opportunity for compounds. However, effective data integration has largely eluded the life sciences.

The life sciences has a rich history of making data publicly available, as scientists have long recognized the benefits of sharing data for the greater good of science. However, there are now more than a thousand different biological data repositories available. This is far too many for biologists to remember, never mind navigate through all of the different interfaces to get to the data of interest.

It's All Relational

To enable researchers to gain easier access to the publicly available data, life sciences organizations typically bring the data in-house to integrate within a relational database management system (RDBMS). However, this is not an easy task, as the many public repositories were developed in relative isolation, and consequently have developed different data models and naming conventions. Furthermore, the same biological entities in different repositories may not be equivalent due to different methods of data generation or in the use of terms for defining concepts. Data models being frequently updated to reflect advances in scientific understanding compound the difficulties.

Managing data in a RDBMS provides an excellent architecture when concise and efficient queries are required. Typically, this occurs with enterprise applications, where the user is interacting with the data through a tightly constrained set of forms provided by the application. As all of the metadata regarding the query is embedded in the application or implicit within the RDBMS itself, the application needs a minimal amount of input to execute properly.

However, an RDBMS does not provide a strong architecture for sharing data or executing queries across organizational boundaries. This is because different organizations, and even different departments within the same organization, will use dissimilar database schemas and applications, even for performing the same activities. In this environment, using

The Semantic Web

A new approach for data integration is rapidly becoming popular in the life sciences. The approach is termed the Semantic Web, and consists of a number of standard recommendations from the World Wide Web Consortium (W3C), that include Resource Description Framework (RDF) and Web Ontology Language (OWL). RDF has been designed for information sharing with ultimate flexibility. RDF uses a triple (subject, predicate, object) syntax to represent a statement about a fact, a statement about an association, or a statement about another statement. Each component of the triple can be assigned a Uniform Resource Identifier (URI), which makes it possible to determine exactly what it is that is being described. There is, therefore, no need for a community to agree on a schema for sharing data.

The very flexible triple representation of RDF, combined with the use of URIs, enables organizations to more easily aggregate all information that is relevant to an entity of interest. For example, if a URI is assigned to a compound of interest within a life sciences organization, it becomes much simpler to aggregate all information relating to the compound across the company. Furthermore, the bioinformatics community has worked to develop a common mechanism for assigning URI to biological entities, which is making it simpler to aggregate information across the different publicly available data repositories.

OWL provides an expressive and unambiguous schema language for modeling information as ontologies. Using RDF as a core data format, it constrains the data relationships to a set of named properties. A reasoner can then derive implicit facts by combining constraints expressed in the schema with explicitly stated facts. Communities can share, distribute and join references to other data by importing new ontologies and referring to external triples, allowing communities to use common metadata dictionaries without needing to adopt a common data model. An extensive dependence on custom programming, and the need to rewrite code whenever a schema or model changes, is avoided when domain knowledge is abstracted to an ontology layer.

The biomedical domain has a rich history of using common vocabularies for the integration and annotation of data repositories and database schemas. Vocabularies are increasingly being made available as OWL-based ontologies, providing a formal representation of knowledge about specific domains. Examples of ontologies that are available in OWL include BioPAX, which provides information about biological pathways; Gene Ontology, which maps gene and gene product attributes; and Unified Medical Language System, which models the medical domain.

Semantic Web technologies provide a very flexible architecture for data integration. They enable users to re-use data in unforeseen ways, which is of great importance in a field that is rapidly progressing. They also enable users to form queries, even if the user has little or no technical knowledge of where the data is located or how it is structured. This is especially powerful for users who are working in a grid environment. However, users should be aware that it is difficult with this approach to guarantee the completeness and accuracy of query results. Also, these queries cannot be performed as efficiently or with the scalability of

Life Science Leadership

The Semantic Web has generated much interest within the science industry over the last couple of years. This is in part due to a growing realization that traditional technology is not going to be able to help industry overcome its complex data-sharing and integration challenges. However, interest has also been raised by W3C's workshop in 2004 on the Semantic Web for Life Sciences, and then in 2005 forming an interest group that focused on the application of the Semantic Web to the health care and life sciences industries. The interest group will provide a beneficial forum for the exchange of information, coordination between different groups and the establishment of best practices. Furthermore, the National Institutes of Health has recognized that advances are required in the area of information management, and it has invested heavily in the establishment of a National Center for Biomedical Ontologies.

Another key driver towards adoption of Semantic Web technology includes Oracle providing support for an RDF Data Model in the latest release of its database. As Oracle is the first enterprise technology to provide such support, it is a critical step forward in the community having access to a RDF data repository that is scalable, highly available and secure. It also provides a strong validation of Semantic Web technology.

The life sciences industry has very complex data management requirements due to the distributed and heterogeneous nature of data. Semantic Web technologies promise the flexibility that researchers need for integrating data from many different sources. As each of the information management approaches discussed in this article have distinct strengths, it will be interesting to see how they will be used together to help users gain maximum insight from their data.

Dr Susie Stephens joined Oracle in 2002 to lead the development of the Database, Application Server and Collaboration Suite and to further enhance their capabilities as a powerful infrastructure platform and analytical engine for drug discovery and development. Previously, she spent four years at Sun Microsystems, initially as a pre-sales systems engineer and then as Life Sciences Market Segment Manager, where she played a key role in establishing the company's presence in the life sciences industry.