The European Arabidopsis Genome Sequencing Project

ESSA (European Scientisits Sequencing Arabidopsis) Coordinator, Mike Bevan
Cambridge Lab, JICPSR, Colney Lane, Norwich,UK

The systematic sequencing of the Arabidopsis genome began last year. An EU-wide network, funded by the EC, was set up to sequence on a pilot scale 2 Mbp of chromosome 4 and 0.5 Mbp of other regions of genomic DNA. In addition, the partial sequencing of 3000 novel cDNAs, the development of a sequence Informatics Node and a contribution to the cost of preparing sequence-ready libraries was also funded. The project began last September and most labs were operational by the beginning of this year; therefore the time is ripe to provide UK scientists with both a review of progress to date and an idea of the future scope of this sort of work.

The largest contiguous region sequenced so far has been 16kb surrounding the GAP-A gene on chromosome 3. In addition to the GAP-A gene, 4 novel ORFs and a retrotransposon were identified, as well as a peculiar AT-rich tract. The density of genes in this area, together with the means of identifying ORFs, bodes well for the future of the programme; we can exect a gene density of 1 every 4-5kb, close to that predicted. Based on a genome size of 100 Mbp, this indicates a total of 20,000- 25,000 genes in Arabidopsis. Data is still coming on the large regions of chromosome 4, as the participating labs have had to learn new methods of large-scale sequencing. Regions of overlap between the 7 participants in this area show a high degree of accuracy in the independently sequenced areas, a necessary precondition for integrated genome activities. in the next 3-4 months the first year's quota of 350kb of chromosome 4 should be completed. A detailed analysis of this region will provide important new information on gene density, possible clustering of gene families, intergenic DNA composition, and the sequence of novel classes of plant genes.

A sequencing project is most interesting and useful if it aims to complete the genome in a reasonable time. It is clear that the EC doesn't have the resources or the will to do the entire genome on its own- it is most properly a joint effort both in terms of the relatively large sums required and in terms of distributing the effort fairly. Because of this, and in view of the unprecedented volume of valuable information to be obtained, US colleagues have agreed to join the EU effort from 1995 onwards. The EU network, once expanded mainly in the capacity of a few labs in the network to sequence on a megabase scale, will join with a US network consisting of 3-5 major labs, and sequence 10Mb each between 1996 and 1999. It is hoped that the sequence of the remaining unknown areas of the genome, comprising about 60-70 Mb, will then be sequenced in the next five years, using the latest methods adopted from the human genome programme. The major rate- limiting step in this plan is the provision of sequence-ready libraries; present cosmid coverage amounts to only 80% of the regions to be sequenced in the next 2 years. The increased effort being put into YAC coverage, particularly on chromosomes 4 and 5, means that YACs have to be the main source of sequence substrates. Because of this, new methods for deriving random libraries from YACs are being investigated.

Further information can be obtained from

the ESSA Coordinator, Mike Bevan, bevan@bbsrc.ac.uk