Introduction

The approach of producing Expressed Sequence Tags (ESTs) was first used to characterize gene expression in the human brain five years ago. Since then, the technique has found widespread use as a tool for gene discovery, and expression profile characterization in projects of various sizes worldwide. These range from massive scale sequencing of human cDNA libraries at The Institute for Genomic Research and Washington University, to much smaller and less well funded projects on a large variety of organisms.

Through the work of Tom Newman at Michigan State University and a collaboration of French labs, there are more Arabidopsis EST sequences in the public database than for any other organism except human. (Newman et al., 1994; Höfte et al., 1993). To date over 24,000 EST sequences have been deposited in the public database, dbEST (Boguski et al., 1993). For the latest summary of EST totals in dbEST click here.

The information management of sequence data is one of the more challenging aspects of the whole EST game. There are issues of volume, redundancy, sequence quality, and presentation of the data to be considered. One of the problems a user faces when accessing the EST databases is the large volume of data. This is related in part to the ease of relatively high throughput sequence generation, and also to the redundancy inherent in any EST data set. Even in the best normalized cDNA libraries, transcripts of highly expressed genes are represented at a higher frequency than rare transcripts. This leads to sequencing of transcripts from the same gene multiple times which lowers the efficiency of the sequencing project. Sometimes such duplicate sequences are not deposited in the databases. However such withholding of sequences is not always desirable because although the gene itself may be represented in the database by a previous EST, the new EST may contribute additional sequence information. This is often the case when ESTs are derived from the 5' end of a cDNA, because of the tendency for incomplete reverse transciption in the generation of cDNA libraries. Thus the database may hold many ESTs derived from the same gene.

Sequence quality is an important consideration in evaluating EST data. The price that is paid for the high throughput in EST projects is that each clone is sequenced only once. Thus the handling of sequence quality must occur on the informatics side of the projects. Generally most EST projects employ good quality control procedures to ensure sequence quality is of an acceptable level before submitting to the public database. However, ambiguous bases are still a fact of life when dealing with EST entries.

Access to the EST data for Arabidopsis is currently available in two locations: dbEST and at the University of Minnesota's Arabidopsis cDNA Analysis Project. A detailed tutorial on the tools available at the University of Minnesota site is available in a previous issue of Weeds World. In nutshell however, the user can search the EST database with both a sequence or a keyword to identify appropriate ESTs. The keyword search provides access to the results of prior BLAST searches of each EST against the public databases.

All of this information is very valuable, but the volume can sometimes become daunting because of the redundancy of the EST database. For example, both keyword and sequence searching for actin genes can produce large lists of ESTs. Partial results of a keyword search are shown here. Wading through that information can be particularly time-consuming, and provides little or no information about how many genes are represented by those numerous ESTs.

In an attempt to circumvent these sorts of problems, we have produced a database of Arabidopsis ESTs that compliments the services already provided in dbEST and at the University of Minnesota. By taking the redundancy of the data and using it as a tool in and of itself, we have grouped together ESTs that have long stretches of overlapping, nearly identical sequence and aligned them into contigs or assemblies. The resulting consensus sequence is longer than any of the individual ESTs and likely represents a single gene. Thus the sequence information contained in numerous EST entries has been condensed into a single entity, allowing for a much smaller dataset that is a closer approximation to individual genes than the exisiting raw EST data.


Back to the Index