New Directions for Genome Databases at Stanford

New Directions in Genome Databases at Stanford
Mike Cherry, David Flanders, and Fabien Petel
Department of Genetics, Stanford University, Stanford, California 94305-5120 U.S.A.

Abstract
A small group at Stanford University has begun development of a new database which will provide easier access, better reliability, and the ability to store more information than the current ACEDB-based database. We describe here not the specifics of our development efforts, but rather introduce the basic technologies being used: the new programming language called Java and the new object-relational database from Illustra Information Technologies.

Introduction
In 1991, Howard Goodman's laboratory (Massachusetts General Hospital, Boston) undertook the creation of a database for Arabidopsis thaliana. The ACEDB software, originally developed for the Caenorhabditis elegans genome project, was used to create AAtDB (An Arabidopsis thaliana Database). This database contained: genetic and physical maps, all Arabidopsis DNA sequences, a catalog of DNA and seed stocks, extensive collection of published Arabidopsis literature, names and addresses of scientists interested in Arabidopsis, and more. In late 1995, the AAtDB project was moved to the Genome Databases Group, Department of Genetics, at Stanford University and given a new name: AtDB (Arabidopsis thaliana DataBase).
The AtDB project at Stanford is currently developing new software that will greatly expand the ability of our group to provide information to the Arabidopsis community. However, while this development is our main focus the, old ACEDB-based database is being maintained. Indeed, the old ACEDB database has been given a new WWW look and will continue to be used at Stanford for at least the next year or two. Many new features have also been added to the AtDB home page, for example, sequence similarity searches against Arabidopsis-only sequences using BLAST and FASTA.
The software developments at Stanford are focusing in two areas: user interface and modern database technology. As more and more information becomes available on Arabidopsis the databases that hold this information must be able to process information faster and provide summary views of complex details. To do this, the Java programming language will be used to create interactive graphical displays of complex information, such as genetic and physical maps, and gene structure. The latter will be at both the DNA and protein sequence level as well as the structural/3-dimensional level. The ability to hold and process large amounts of complex information effectively will be a result of the use of a new commercial database product from Illustra. The ACEDB WWW version of AtDB provides a useful view of Arabidopsis information, but there are limitations. The graphic views of various parts of the database are slow and static. The user can click on these images to change the view, but this involves a new image being created on the AtDB server at Stanford and then transported to the user's computer. The ACEDB database is a fantastic example of software engineering (developed by Richard Durbin, Sanger Centre, U.K. and Jean Thierry-Mieg, CNRS, Montpellier, France). ACEDB allows biological information to be stored and presented easily. However, ACEDB was designed for use by small groups of people and is limiting in features used for database systems where several people can add or modify information simultaneously.
Java will allow the creation of new WWW programs that minimize the amount of information that is transmitted between the AtDB computers and the user's computer. Illustra provides all the standard features required by commercial user's of large database centers plus several new enhancements that will aid in the creation of an easy to use and an easy to manage database environment for the AtDB curators, and thus allow more timely and complete information to be provided to the Arabidopsis and plant-biological-community worldwide.
Java
You've probably at least heard of Java. Even though it's only been around for a year or so in it's publicly-available form, it's been the subject of numerous news articles.
So what is Java, what's all the fuss about, and what interest is it to Weeds World-types?
Java is a new type of (relatively) simple computer-programming-language, developed by Sun Microsystems. Java comes into its own in enabling the creation of truly interactive Web-pages. A measure of Java's popularity is that over 1.7 gigabytes are downloaded from Sun's Java pages a day.
Web interactivity is achieved through small programs -- termed applets -- written in Java. When using a WWW browser that supports Java, special instructions are included in Web pages containing applets. These instructions direct your browser to download the applet onto your computer. (Great care has been and is being taken to ensure that this is not a security risk.) The applets, the small programs written in the Java language, then run on your computer, rather than on the computer where the Web page is located.
This can happen, because one of the main strengths of Java is that an applet, once written, will run, without further modification, on any type of computer -- PC, Mac, UNIX, etc.-- providing that the computer is running a Java-enabled Web-browser, such as Netscape. (The recently released Netscape version 3.0 is java-enabled for Windows 95, UNIX and Mac machines and can be downloaded from Netscape (URL: http://home.netscape.com/comprod/upgrades/))
This ability to run applets on your computer, combined with transfer of data over the Web permits the production of very interactive Web pages.
AtDB will be incorporating Java into our Web-based facilities. One important area to do this is with our Web-forms. For example, you can send physical-map contig details to AtDB using a Web form. Currently, you do this by entering clone names and positions. When this form is re-written using Java, you will be able to draw and label your contig directly into the AtDB Web-page (using an applet that looks and behaves very much like a standard computer-drawing program). This will be quicker, easier, less prone to error and provide you and the community with information already to go in a "map" display.
AtDB is involved in a Bay Area collaboration with SGD (Saccharomyces Genome Database, also in the Genetics Dept. at Stanford) and BDGP (Berkeley Drosophila Genome Project). All three teams are working on using Java and a really nice example of an interactive, Java-based, Genome Browser for fly-information has already been produced by Gregg Helt at Berkeley.
Other examples of Java pages of interest to biologists can be found at The Java-based Molecular Biology Work Bench. Links to hundreds of other Java sites may be found at Gamelan, the Java Directory.

The Illustra ORDBMS

A Bit of Background

The Illustra server is an example of an 'Object-Relational Database Management System' (ORDBMS), also called 'Extended-Relational' by others. It is similar in many ways to traditional DBMS such as Sybase or Oracle, but also provides some interesting object-oriented features.
Illustra, the company that sells the Illustra server, was founded in 1992 by Michael Stonebreaker (the founder of Ingres Corp.) and represents the commercialization of a seven-year University of California database research project called "Postgres".
The whole goal is to allow users to ask more complex queries on more complex data than could be done with previous DBMS's. Application architects are seeking to expand the definition of data to include, for example, diagrams, maps, images, sound, documents, time series and multi-dimensional data. Relational Database Management Systems (RDBMS's) usually store complex data as uninterpreted BLOBs (Binary Large Objects), but because BLOBs are uninterpreted bit patterns, the RDBMS does not know how to perform content-based queries on them and it has no sensible comparison operators, so it cannot intelligently build an optimal query plan and it cannot provide high performance storage and retrieval. The hard work is left up to the individual application developer - the work of understanding the contents, format, and methodologies required for each data type - and this work needs to be re-invented and re-embedded in each application. Worse, because the RDBMS cannot understand BLOB contents, BLOBs have to be shipped across the network to the client for processing, placing a heavy burden on scarce network bandwidth.
Object Orientation (OO) offers a new way to view and model the relationship between data and applications and brings the hope that applications will become easier to develop, more robust and more maintainable. Some RDBMS vendors are grafting an object layer on top of previous relational product, but the basic engine is generally unable to understand how to optimize storage and access to object data, thus making the DBMS somewhat inefficient. The merger of the existing client/server database model with Object Orientation led to the development of Object-Oriented Database Management Systems (OODBMS) that could store objects created by an OO application, e.g., a C++ program. However, the OODBMS model suffered from the lack of a common query language. Much of the appeal of the RDBMS's stems from their use of a widely-adopted query-language, SQL, which makes their architecture so flexible. In addition many ODBMS systems seem to lack very important high level features required by corporations, including scalability, security, server-side functions and concurrency.
Illustra tries to combine the best of both worlds by providing a reliable and efficient system with a standard SQL query language interface together with the tools to manage complex data and achieve content-based query capability, comparison operators, query optimization, efficient storage and access, and reasonable performance on complex types.

Brief Comparison with Traditional RDBMS

Illustra shares many aspects with other well-known Relational DBMS products:

usual data types and built-in functions: int, float, date, varchar, abs(), sum(), count(), etc
usual structures and server-enforced data integrity: page and table locks, keys, defaults, tables, views, cursors, etc
flexible data access via ANSI-SQL 92, the current agreed-upon industry standard
client-server architecture with built-in support for TCP/IP
rules (to specify actions to be taken before, after, or instead of a user request)
standard security controls, transactions and recovery (log, dump and transaction management)
client API (C and C++) and user-defined functions executable on the server side
Illustra adds some very interesting features:

function overloading (use of a signature)
single and multiple inheritance on both tables and types
user-defined composite datatypes and type casting
arrays and sets
references (i.e., pointer to a row's internal object identifier)
alerters to inform external programs about events within the databases
archiving of non-current (i.e., deleted or updated) data for 'time travel' or versioning
scrollable (non-updatable) cursors
possibility of defining a table's column as the result of a user-defined function
datablades, which are sets of domain-oriented data types, functions and access methods optimized for certain tasks such as image processing, text searches or www publishing. The datablades enhance the flexibility and the extensibility of the databases and their related applications

The combination of some of these features allows Illustra to exhibit the following object-oriented features :

encapsulation
inheritance
polymorphism

Illustra lacks some features that have proven to be useful in other systems such as Sybase:

no added flow control in the SQL language (e.g., nothing like Sybase's Transact-SQL)
only a few third-party tools (may change, as still a young product)
uncertainty about the performance on sites with multiple simultaneous database modifications
nothing done for the replication of server data on secondary sites

However, the recent purchase of Illustra by Informix may solve some of these issues in the soon-to-be released 'Informix Universal Server' which is supposed to integrate the Illustra engine and its datablades.

Advantages for Biology and Genetics

Using object-orientation doesn't necessarily make your model easier to design, but it may help making it more understandable by biologists with no particular database background. Moreover, your database logical design is closer to reality and to what users really want, thus making it potentially easier to maintain and re-use.
The datablades allow you to quickly use efficient tools for manipulating biological data that are usually not easily handled by relational DBMS. Among many features, Illustra provides :

visual information retrieval (VIR) to search the content of an image on color, composition (spatial arrangements of the color regions in the scene), texture (e.g. wood grain, granite, marble or clouds) and structure (general shape characteristics of the object in the image). You don't need to store images in the database unless you choose to do so. When an image is loaded, the Illustra engine analyzes it and does a "feature extraction" which produces a small numeric string that describes the image along the search criteria listed above. This allows you to store the image on a less expensive archival medium and also provides very fast searches.
2D and 3D spatial objects to search for overlap, perimeter, area, containment and volumes
advanced document and text searches, storage and management.
management of web pages and generation of dynamic database front-end with live web pages.
All these features can be combined to answer complex queries such as :

find pictures of anthocyanin-producing lines with a Columbia background
give me all the references where an abstract is related to gene splicing
find all the people in Sweden working on seed-irradiation
draw me the physical map with this set of contigs
find all light-regulated genes and color in red their protein factor on the global sequence

Information on Object-Relational Databases

Even if some uncertainties are still surrounding the Illustra products, it is time to think to better approaches in order to tackle the complex and intuitive queries that biologists have been wanting to ask for years. Illustra is one of the few DBMS's that provides flexibility and reliability while maintaining the legacy of SQL-based projects. Furthermore, its set of tools allows us to create biology-related applications and databases that could be easier to learn, query, adapt and maintain.
For more information available on the WWW :

Object-Relational DBMS - The Next Wave: a paper from Michael Stonebraker, the founder of Illustra
The Illustra DBMS server: A document describing the features of Illustra's product.
Informix products: A list of current Informix products.
Other ORDBMS :

UniSQL
Omniscience

Current Arabidopsis thaliana Database Prototype
The Arabidopsis thaliana Database Prototype, a work in progress, can now be used. As of July 1996 three types of information are available from the prototype: Colleague (names and addresses of Arabidopsis researchers from around the world, References (published literature on Arabidopsis), and YAC/Cosmid physical map walking results called "Tiling Paths". The first two areas can also be found in the ACEDB based AtDB, but the Tiling Paths are only available from the prototype.

New Directions in Genome Databases at Stanford

Abstract

Introduction

Java

The Illustra ORDBMS

A Bit of Background

Brief Comparison with Traditional RDBMS

Advantages for Biology and Genetics

Information on Object-Relational Databases

Current Arabidopsis thaliana Database Prototype