HTML Markup provided by AtDB (now TAIR)
The Multinational Science Steering Committee:
Committee Chair: Gerd Jürgens, University of Tübingen,
Germany
Michael Bevan, John Innes Centre, Norwich, United Kingdom
Michel Caboche, Lab. Biol. Cellulaire, INRA, Versailles, France
Daphne Preuss, University of Chicago, Chicago, IL, USA
Joseph Ecker, University of Pennsylvania, Philadelphia, PA USA
Fernando Migliaccio, CNR, Monterotondo, Italy
Kiyotaka Okada, Kyoto University, Kyoto, Japan
David Smyth, Monash University, Clayton, Australia
Marc Van Montagu, University of Ghent, Belgium
Preface
Overview of Genome Analysis
Stock Center Resources and Data Bases
National and Transnational Projects
Appendix 1: NSF Arabidopsis Genome Meeting Report
Appendix 2: Summary of December 1998 AGI Meeting at CSHL
Appendix 3: Database Workshop (Madison, WI 1998) Report
Appendix 4: "Arabidopsis thaliana Information Resource Project"
Announcment
The "Multinational Coordinated Arabidopsis thaliana Genome
Research Project" was established in 1990 to promote international
cooperation in basic and applied research with Arabidopsis, a
model plant species amenable to experimental manipulation in the
laboratory. The primary objective of this project has been to
understand the molecular basis of plant growth and development
and to address fundamental questions in plant genetics, physiology,
biochemistry, cell biology, and pathology. Initial plans were
outlined in a publication (NSF #90-80) drafted nine years ago
by an ad hoc committee of nine scientists from the United States,
Europe, Japan, and Australia. In recent years, this project has
become a model for widespread participation and effective coordination
of multinational research efforts in modern biology.
Arabidopsis thaliana, a small plant in the mustard family, was
chosen for this large-scale research effort because it offers
many advantages for detailed genetic and molecular studies. Among
these features are its small size, short life cycle, small genome,
ability to be transformed, availability of numerous mutations,
and prolific seed production. By concentrating research efforts
on a single model organism, detailed information on specific genes
and cellular processes can be readily obtained and rapidly applied
to a wide range of plants relevant to agriculture, health, energy,
manufacturing, and the environment.
Each year since 1990, the scientific steering committee for the
Arabidopsis Genome Project has prepared a progress report summarizing
recent advances in Arabidopsis research. This is the seventh annual
progress report published by the steering committee in conjunction
with the U.S. National Science Foundation. Three years ago the
report was a color brochure designed to explain the value and
significance of Arabidopsis research to a wide audience. Two years
ago the report presented a detailed overview of recent advances
in research with Arabidopsis, along with technical information
for use by members of the Arabidopsis community. The sixth report
presented an updated vision statement for the future to stimulate
further advances in the use of Arabidopsis as a model system for
the analysis of complex organisms.
This report covers progress for the seventh and eighth years of
the project. It is focused on the large-scale analysis of the
Arabidopsis genome. Specifically, this report is designed to make
the available information accessible to the scientific community
in a hands-on format. At the current rate of progress, the genome
sequencing project can be expected to be completed within two
years. The 1998 genome issue of Science (Meinke et al. 1998) featured
Arabidopsis prominently.
Multinational cooperation and communication continue to be an important feature of the Arabidopsis genome project. A brief overview of Arabidopsis research efforts in a number of participating countries is therefore included in this report. Additional information can be obtained through recent publications, electronic news groups and databases, and biological resource centers devoted to Arabidopsis research. As with any document that attempts to summarize the contributions of many individuals, this report may fail to include or misrepresent some significant achievements. The steering committee hopes that members of the Arabidopsis community will overlook such shortcomings and will communicate any concerns to committee members so that future reports will be as accurate as possible. We thank all members of the Arabidopsis community for their many contributions to the success of the initial phase of the Multinational Coordinated Arabidopsis thaliana Genome Research Project.
1983 | Publication of first genetic map |
1988-89 | Publication of RFLP maps |
1990 | Multinational Coordinated Arabidopsis thaliana Genome Research Project initiated |
1991 | Arabidopsis Stock Centers at Ohio State (USA) and Nottingham (UK), as well as the Arabidopsis Data Base (AtDB), were established |
1991 | First YAC libraries and anchoring of YAC clones to RFLP map |
1992 | Publication of first chromosome walk (local contig) |
1993 | Recombinant inbred (RI) map |
1994-8 | Collections of cDNA (EST) clones sequenced linking up genetic and cytogenetic with physical maps |
1995-6 | CIC-YACs, TAMU-BACs, IGF-BACs, Mitsui-P1, Kazusa-P1 libraries |
1995-8 | Physical map of all 5 chromosomes delineated |
Jan 98 | Publication of 1.9 Mb of contiguous DNA sequence from chromosome 4 |
June 98 | 29 Mb of genomic DNA sequenced |
Oct. 98 | Arabidopsis featured in genome issue of "Science" |
Dec 98 | >46 Mb of genomic DNA sequenced and annotated 90 Mb of genomic DNA in edited BAC contigs >41,000 (of 44,000) BAC ends sequenced >11,000 non-redundant (of >37,000) EST clones |
2000 | Completion of genome sequencing (expected date) |
Two genetic maps were independently developed: a classic map of
mutations (Koornneef et al., 1983) and a recombinant inbred (RI)
map of molecular markers (Lister and Dean, 1993). As an increasing
number of genes originally identified by mutation has been cloned
and converted to molecular markers mapped onto the RI map, the
two maps are beginning to merge into a unified genetic map. Map
distances differ between the two maps, presumably because of the
different genetic backgrounds. In addition, map distances are
calculated with the Mapmaker program, resulting in local inaccuracies,
such as relative order of closely linked markers. These problems
will eventually be resolved by physical mapping.
The RI map is now commonly used as the standard reference, enabling
new genes identified by mutation to be easily mapped by PCR markers
(SSLP, CAPS). The current RI map (November 1998) contains ca.
800 markers which fall into 3 different categories: "framework"
(fixed reference location), "unique" (defined location
on the map) and "multiple" (several possible locations).
RI markers were also used to map a collection of YAC, BAC and
P1 clones from which physical maps of the 5 chromosomes were initiated,
thus linking genetic and physical maps from the very beginning.
Several physical maps have been established for all 5 chromosomes.
Initially, contigs of large YAC clones were assembled and anchored
to RI markers (e.g. Schmidt et al., 1997; Bouchez et al., 1998).
Corresponding BAC and P1 clones were identified by hybridisation
with YAC clones. For chromosome 5, a nearly complete physical
map was established by P1 and TAC clone contigs (Kazusa homepage;
Kotani et al., 1997). BAC contigs have also been established at
the global scale by fingerprinting and by hybridisation
with BAC endprobes. For example, 9 Mb constituting the bottom
arm of chromosome 3 have been covered by a single BAC contig (see
http://www.genoscope.cns.fr/externe/English/Projets/projetsindex.html).
In addition to whole-chromosome physical mapping with YAC, BAC
and P1 clones, chromosome walks in several chromosome regions
have yielded local contigs up to 2 Mb long (e.g. Hardtke &
Berleth, 1996; Wang et al., 1997; Thorlby et al., 1997), and several
hundred EST clones have been PCR-mapped onto YAC clones (Agyare
et al., 1997).
Fingerprinting data of BAC clones were used to assemble contigs
with FPC software, followed by manual editing to join the initial
contigs. At present, ca. 70 BAC contigs encompass ca. 90 Mb of
estimated 121 Mb total sequence (M. Marra & M. Sekhon, Washington
University, St. Louis; M.A. Marra et al.,1997). High throughput
BAC-endprobe hybridization was used as a complementary approach
to assemble contigs (Mozo et al., 1998). Information gathered
from 2995 hybridization data (including 272 mapped markers) was
manually edited after application of the probeorder computer
program and integrated with the fingerprint data to generate a
complete BAC-based physical map consisting of 27 contigs distributed
over the 10 chromosome arms that covers approximately 124 Mb (see:
http://www.mpimp-golm.mpg.de/101/bac.html). As the genome sequencing
project is progressing, many RI markers are mapped physically,
resulting in an excellent alignment of genetic and physical maps
(see AtDB; see also integrated contig tables by Daphne Preuss
and colleagues at the CSHL website). This integration will undoubtedly
facilitate gene isolation by map-based cloning.
In addition to the unique-sequence regions of the chromosome arms,
both rDNA repeats (NORs on chromosomes 2 and 4) and centromeric
regions have been mapped genetically and physically. The centromeric
regions were mapped by tetrad analysis (Copenhaver et al., 1998)
and localized by in situ hybridization (Brandes et al., 1997).
Thus, an outline of the physical organisation of the nuclear genome
has emerged.
Agyare FD, Lashkari DA, Lagos A, Namath AF, Lagos
G, Davis RW, Lemieux B (1997) Mapping expressed sequence tag sites
on yeast artificial chromosome clones of Arabidopsis thaliana
DNA. Genome Res. 7: 1-9.
Brandes A, Thompson H, Dean C, Heslop-Harrison JS
(1997) Multiple repetitive DNA sequences in the paracentromeric
regions of Arabidopsis thaliana L. Chromosome Res. 5: 238-246.
Camilleri C, Lafleuriel J, Macadre C, Varoquaux F,
Parmentier Y, Picard G, Caboche M, Bouchez D (1998) A YAC contig
map of Arabidopsis thaliana chromosome 3. Plant J. 14:633-642.
Copenhaver GP, Browne WE, Preuss D (1998) Assaying
genome-wide recombination and centromere functions with Arabidopsis
tetrads. Proc. Natl. Acad. Sci. USA 95: 247-252.
Hardtke CS, Berleth T (1996) Genetic and contig map
of a 2200-kb region encompassing 5.5 cM on chromosome 1 of Arabidopsis
thaliana. Genome 39: 1086-1092.
Kotani H, Sato S, Fukami M, Hosouchi T, Nakazaki
N, Okumura S, Wada T, Liu YG, Shibata D, Tabata S (1997) A fine
physical map of Arabidopsis thaliana chromosome 5: construction
of a sequence-ready contig map. DNA Res. 4:371-378.
Marra MA, Kucaba TA, Dietrich NL, Green ED, Brownstein B, Wilson RK, McDonald KM, Hillier LW,
McPherson JD, Waterston RH (1997) High throughput
fingerprint analysis of large-insert clones. Genome Res. 7:
1072-1084.
Meinke, DW, Cherry JC, Dean C, Rounsley SD, Koornneef
M (1998) Arabidopsis thaliana: A model plant for genome analysis.
Science 282: 662-682.
McPherson JD, Waterston RH (1997) High throughput
fingerprint analysis of large-insert clones. Genome Res. 7:1072-1084.
Mozo T, Fischer S, Maier-Ewert S, Lehrach H, Altmann
T (1998) Use of the IGF BAC library for physical mapping of the
Arabidopsis thaliana genome. Plant J. 16, 377-384.
Round EK, Flowers SK, Richards EJ (1997) Arabidopsis
thaliana centromere regions: genetic map positions and repetitive
DNA structure. Genome Res 1997 Nov;7(11):1045-53
Sato S, Kotani H, Hayashi R, Liu YG, Shibata D, Tabata
S (1998) A physical map of Arabidopsis thaliana chromosome 3 represented
by two contigs of CIC YAC, P1, TAC and BAC clones. DNA Res.5:163-168.
Schmidt R, Love K, West J, Lenehan Z, Dean C (1997)
Description of 31 YAC contigs spanning the majority of Arabidopsis
thaliana chromosome 5. Plant J. 11: 563-572.
Thorlby GJ, Shlumukov L, Vizir IY, Yang CY, Mulligan
BJ, Wilson ZA (1997) Fine-scale molecular genetic (RFLP) and physical
mapping of a 8.9 cM region on the top arm of Arabidopsis chromosome
5 encompassing the male sterility gene, ms1. Plant J. 12: 471-479.
Wang ML, Huang L, Bongard-Pierce DK, Belmonte S,
Zachgo EA, Morris JW, Dolan M, Goodman HM (1997) Construction
of an approximately 2 Mb contig in the region around 80 cM of
Arabidopsis thaliana chromosome 2. Plant J. 12: 711-730.
More than 37,000 partial cDNA (EST) sequences have been deposited
in the public databases while the total number of genes is most
likely about 20,000. Building EST "contigs", i.e. larger
cDNA sequences from overlapping ESTs, reduces the number of ESTs
to those representing different genomic sequences (Rounsley et
al., 1996; Cooke et al., 1997). The current estimate of non-redundant
ESTs is about 11,000 or approximately half the total number of
genes.
Large-scale high-throughput genomic sequencing makes use of the
physical maps and the available BAC (TAMU, IGF), P1 and TAC (Mitsui,
Kazusa) libraries (see AGI). BAC, TAC and P1 clones are mapped
onto YAC, and their ends are sequenced to determine minimum tiling
paths for sequencing large regions. More than 41,000 BAC ends
(of a total of 22,000 BAC clones) have been sequenced, yielding
stretches of ca. 400 bp every 4 kb on average (total sequence
ca. 14 Mb). The largest contiguous region sequenced to date is
nearly 1.9 Mb long (Bevan et al., 1998). This region around FCA
on chromosome 4 contains 389 genes of which 46% could not be assigned
a putative function by sequence comparisons with the databases.
On average, one gene (ORF) was found every 4.8 kb, and similar
values were observed for other genomic regions (Quigley et al.,
1996; Sato et al., 1997; Kotani et al., 1997). For many ORFs no
corresponding EST was found in the databases. To identify expressed
genes within contig regions, a novel cDNA selection method has
been proposed (Seki et al., 1997).
Bevan M et al. (1998) Analysis of 1.9 Mb of contiguous
sequence from chromosome 4 of Arabidopsis thaliana. Nature 391:
485-488.
Cooke R, Raynal M, Laudie M, Delseny M (1997) Identification
of members of gene families in Arabidopsis thaliana by contig
construction from partial cDNA sequences: 106 genes encoding 50
cytoplasmic ribosomal proteins. Plant J. 11: 1127-1140.
Kotani H, Nakamura Y, Sato S, Kaneko T, Asamizu E,
Miyajima N, Tabata S (1997) Structural analysis of Arabidopsis
thaliana chromosome 5. II. Sequence features of the regions of
1,044,062 bp covered by thirteen physically assigned P1 clones.
DNA Res. 4: 291-300.
Quigley F, Dao P, Cottet A, Mache R (1996) Sequence
analysis of an 81 kb contig from Arabidopsis thaliana chromosome
III. Nucl. Acids Res. 24: 4313-4318.
Rounsley SD, Glodek A, Sutton G, Adams MD, Somerville
CR, Venter JC (1996) The construction of Arabidopsis expressed
sequence tag assemblies. Plant Phys. 112: 1177-1183.
Sato S, Kotani H, Nakamura Y, Kaneko T, Asamizu E,
Fukami M, Miyajima N, Tabata S (1997) Structural analysis of Arabidopsis
thaliana chromosome 5. I. Sequence features of the 1.6 Mb regions
covered by twenty physically assigned P1 clones. DNA Res. 4:215-230
Seki M, Hayashida N, Kato N, Yohda M, Shinozaki K
(1997) Rapid construction of a transcription map for a cosmid
contig of Arabidopsis thaliana genome using a novel cDNA selection
method. Plant J. 12: 481-487.
The AGI was established on August 20-21, 1996 when representatives
of six research groups (3 from USA and one each from EU, Japan
and France) committed to sequencing the Arabidopsis genome met
in Arlington, VA to discuss strategies for facilitating international
cooperation in completing the genome project. In order to avoid
duplication of efforts, the six groups of the Arabidopsis Genome
Initiative (AGI) agreed to focus on different regions of the genome
(Bevan et al., 1997, Plant Cell 9:476-487). In July 1998, the
members of the AGI met again in Arlington, VA to discuss progress
to date, to anticipate barriers to timely completion, and to establish
an oversight committee for the U.S.-based labs (see Appendix).
At present, the major sequencing domains of the AGI groups have
been assigned as follows:
Chromosome 1 (30 Mb) | SPP group (Stanford, PennU, PGEC) | Chromosome 2 (14 Mb) | TIGR group | Chromosome 3 (top - 13.5 Mb) | Kazusa group | Chromosome 3 (top - 5 Mb) | TIGR group* | Chromosome 3 (bottom - 9 Mb) | EU project chrom3 (coordinated by Genoscope) | Chromosome 4 (top - 4 Mb) | CSHSC (CSH-WU-ABI group) | Chromosome 4 (bottom - 13 Mb) | EU group (ESSA I, II, III) | Chromosome 5 (top - 9 Mb) | EU group (ESSA III) | Chromosome 5 (top + middle - 4 Mb) | CSHSC (CSH-WU-ABI group) | Chromosome 5 (top + bottom - 17 Mb) | Kazusa group |
---|
Sequencing is being done on BAC and P1 clones. Two different strategies
are pursued. Both the SPP group and the TIGR group have selected
nucleating sites ("seed BACs") around which BAC contigs
have been established by using BAC end sequences to select adjacent
clones with minimum overlap. This sequential sequencing procedures
involves 32 and 16 starting points on chromosomes 1 and 2, respectively.
The other sequencing strategy adopted by CSHSC, ESSA and Kazusa
involves building of BAC or P1/TAC tiling paths with minimum overlap
of adjacent clones ("sequence ready maps"). This procedure
requires more preparative work but once established, large regions
can be sequenced in parallel, e.g. by the several sequencing groups
within the ESSA group.
Lists of clones selected for sequencing can be found on the web
sites of the sequencing groups. Start dates for sequencing are
indicated and it is agreed that the finished sequences will be
released within 4-6 month after the start of sequencing (for details,
see Appendix). The current state of genome sequencing is as follows
(for overview by chromosome region, see AtDB / Arabidopsis Sequencing
View and the homepages of the AGI groups):
Chr. | Est. Size | Completed | ||||||||
(Target) | Clones | Mb | Clones | Mb | Clones | Mb | Clones | Mb | ||
1 | 30 Mb | 52 | 5.52 | 16 | 2.02 | 16 | 1.7 | 84 | 9.25 | |
2 | 17 Mb | 107 | 10.29 | 45 | 4.32 | 27 | 2.64 | 179 | 17.25 | |
3 | 22.2 Mb | 13 | 1.09 | 29 | 2.5 | 27 | 1.2 | 69 | 5.8 | |
4 | 18.5 Mb | 153 | 11.71 | 51 | 5.2 | 28 | 2.6 | 231 | 19.4 | |
5 | 29.2 Mb | 208 | 14.62 | 15 | 1.2 | 38 | 2.7 | 261 | 18.54 | |
Total | 120 Mb | 534 | 43.32 | 156 | 15.2 | 136 | 11.8 | 826 | 70.3 |
Note that the total sequence entered into AtDB and summarised above includes overlaps between adjacent clones (except for those submitted by ESSA and WashU, which have overlaps almost all removed). For this reason the total number of clones sequenced is a better estimate of progress. With 10% overlap, 120 Mb will require 1,390 BAC clones. On 31 December 1998 the following finished clones had been deposited in Genbank:
SPP 50 BAC clones
TIGR 105 BAC clones
CSHSC 60 BAC clones
ESSA 117 BAC and cosmid clones
Kazusa 202 P1, TAC and BAC clones
Total 534 clones (approx. 39% of the total genome)
As of 31 December 1998, the AtDB Sequencing View displays 46 Mb
(39% of estimated 120 Mb genome size) of complete sequence. This
figure is 17 Mb higher than that given at the end of June 1998,
indicating that the current rate of sequencing is close to 3 Mb
per month for the entire AGI project. Taking into account the
sequences that have not been released, the actual amount of sequence
information is close to 55 Mb (almost 50% of the unique sequences).
It is thus a realistic goal to finish the sequence of the Arabidopsis
genome (excluding telomeric and centromeric regions as well as
NORs) by the end of the year 2000.
Completion of the sequence is defined as each chromosome arm between
subtelomeric repeats and centromeric repeats consisting of a single
fully sequenced contig. This excludes the rDNA repeats (NORs on
chromosomes 2 and 4 each of which accounts for ca. 3.5 Mb) and
other internal tandem repeat regions. For these regions, it will
be sufficient to sequence one repeat unit and to estimate the
repeat number at each site. By these criteria, sequencing of chromosomes
2 (14 Mb) and 4 (17 Mb) can be expected to be complete before
the end of 1999.
As sequencing is reaching the closing phase, boundaries between
sequencing domains have to be defined precisely to avoid duplication
of efforts by different sequencing groups. This difficulty has
already been encountered by all the sequencing groups, resulting
in duplication of sequences and mismapped clones (see table).
For example, on chromosome 4 both CSHSC and ESSA sequenced two
different but overlapping clones and had to reassign remaining
projects in a common region of ca. 900 kb. TIGR and SPP have abandoned
or mismapped at least 4 BACs and a chimaeric YAC, while Kazusa
has sequenced several duplicate clones on chromosome 5. Depending
on different rates of progress, it may seem advisable, in the
interest of the Arabidopsis community, to reallocate genomic regions
between the sequencing groups (see Appendix 1 and 2). The fingerprint
map constructed at Washington University and the hybridisation-based
map constructed by T. Altmann have the potential for delineating
these regions before they are sequenced, and will probably be
used for this purpose.
The seed stocks currently available from the two centers include
mutant lines (600), T-DNA lines and pools (30,000+), mapping strains,
the G. P. Rédei collection of mutants and research lines
(300+), the A. R. Kranz collection of mutants and ecotypes (700+),
transposon/transposase lines (100+), RI lines (3 populations),
ecotypes (400+), transgene lines and related species. The genetic
mapping resources of the centers and the T-DNA and transposon
resources complement the AGI sequencing efforts and the current
research focus on functional genomics.
DNA stocks of ABRC include cloned genes (200), RFLP mapping clones
(300+), expressed sequence tagged (EST) clones (30,000+), cDNA
libraries (7), a phage genomic library, YAC libraries (6), BAC
and P1 libraries used in genome sequencing (3) and two-hybrid
libraries (2). In addition, filters of BACs, P1s and YACs for
hybridization and isolated DNA from T-DNA populations (12,000
lines) are available.
The EST collection has been organized so that a set of 11,000,
non-redundant based on the sequences available to TIGR, is being
used by AGI. The 3' sequences of these clones are being analyzed
by the MSU EST project to further eliminate redundancy. Copies
of BAC and P1 clones, for which sequences have been published,
are being sent to many research laboratories. In this connection,
ABRC requests that all sequencing projects adhere, if at all possible,
to the agreed clone-naming conventions when publishing sequences
so that researchers can identify, without confusion, the proper
clones to obtain.
NASC and ABRC are working to enlarge the collections of characterized
mutants and clones. In addition, it is expected that large numbers
of T-DNA lines will be received so that, within the next year,
the available T-DNA lines will represent essential saturation
of the genome. In connection with the accumulating genomic and
cDNA sequence information, these resources will prove invaluable
to the research community. In addition, new transposon-tagged
populations, recombinant inbred mapping populations, a tetrad
mapping populations and GFP lines are being incorporated into
the collections.
The Nottingham Arabidopsis Stock Centre (NASC) curates the Lister
and Dean RI maps that were originally developed and maintained
by Clare Lister and Caroline Dean (JIC, Norwich). NASC also offers
a weekly community mapping service. Anyone can submit data to
NASC for mapping using the specially designed data submission
form. The positions of all markers mapped at NASC are made publicly
available through the NASC WWW server, the Arabidopsis Genome
Resource and AtDB. For private mapping, all the marker scores
are available from NASC. However, the aim for the community is
to have as many markers as possible placed on the canonical map
and so the submission of mapping data for inclusion on the RI
map is appreciated.
The Arabidopsis node of the BBSRC funded UK-Crop Plant Bioinformatics Network (UK-CropNet) based at NASC has established the Arabidopsis Genome Resource (AGR). AGR is being developed as a repository of Arabidopis data of value in the comparative analysis of plant genomes and as an essential tool to aid in the cloning of homeologous genes of agronomic importance.
Comparative analysis in plants relies upon genetic and physical mapping of common probes between species. To this end AGR has made available the YAC physical maps of chromosomes IV and V (from C.Dean, R.Schmidt, M. Stammers). AGR also includes the Recombinant Inbred Maps from NASC integrated with the AGI sequence template clones (locations provided through AtDB). Arabidopsis nucleotide sequences are also included within AGR.
Integrating these data sets is the next key step in the development
of AGR. Sequence overlap between completed AGI clones define contigs
of BACs and P1s. These contigs will be fixed to the YAC physical
maps using the results of BAC-YAC hybridisations. Contigs may
be anchored on the RI maps through the nearest marker information
from individual clones. RI maps and YAC physical maps are to some
extent integrated through the use of some RI markers as probes
in YAC physical mapping.
In collaboration with Martin Trick (John Innes Center), these
data will be used to generate comparative map displays between
Arabidopsis and the Brassicas.
Contact Persons
Randy Scholl, ABRC email: scholl.1@osu.edu
Mary Anderson, NASC email: arabidopsis@nottingham.ac.uk
Web sites
ABRC: | http://aims.cps.msu.edu/aims/ | NASC: | http://nasc.nott.ac.uk/ | AtDB: | http://genome-www.stanford.edu/Arabidopsis/ |
---|
The Arabidopsis Data Base (AtDB) is, at this time, located at Stanford University, Mike Cherry, P.I. The explosion of data, both genomic and biological, makes it clear that the data base as it now exists is operating at a minimal, not an optimal, level. The recognition that the community had to express its needs in a more concrete way resulted in two workshops addressing the issues of database composition and management. One was held in 1993 in Dallas, TX and that report can be accessed at http://genome-www.stanford.edu/Arabidopsis/db/dallas.report.html.
However, a more recent workshop on the same topic was held at
the international meeting at Madison, WI in 1998 and that report
is attached as an appendix. The needs are for a central database
with links to other useful databases and information which is
organized in a user-friendly fashion. Recognition of the needs
of the Arabiopsis community as well as other interested communities
has resulted in a call for proposals to the NSF titled "Arabidopsis
thaliana Information Resource Project (AtIR)" The deadline
date is March 22, 1999 and a copy of that announcement is attached
to this report as an appendix.
Recommendation on information management
Large-scale genomic sequencing has reached a critical stage, with
about half the genome in hand. Although the AGI sequencing groups
provide information for specific regions of chromosomes, it is
difficult and time-consuming for the Arabidopsis community to
retrieve the relevant information. To take full advantage of all
the progress that has been made in the analysis of the Arabidopsis
genome, it will be necessary to establish a well-funded unified
genome database that displays sequence and related features together
with biological information in a user-friendly way.
Australia
Arabidopsis research in Australia is focused on building an understanding
of fundamental aspects of plant biology. There is no direct commitment
to large scale genome sequencing at this stage.
Among recent highlights, Liz Dennis, Jim Peacock and colleagues
from CSIRO Division of Plant Industry in Canberra have discovered
a second nonsymbiotic leghemoglobin gene from Arabidopsis (Proc.
Nat. Acad. Sci. US 94, 12230-12234, 1997). They propose that
all plants have two classes of leghemoglobins, as exemplified
by the two genes in Arabidopsis. In the evolution of symbiosis,
the product of one or other of the genes has been recruited on
different occasions to play a new role in association with the
symbiont. In most cases class 1 gene products have been involved,
but the newly discovered class 2 proteins are also potentially
symbiotic.
Another highlight has been the discovery of a gene encoding the
catalytic subunit of cellulose synthase (Science 279, 717-720,
1998). Tony Arioli and colleagues in Richard Williamson's research
group in the Research School of Biological Sciences at ANU in
Canberra have walked to the locus of a temperature sensitive mutant
that leads to root swelling (RSW1). The gene that complements
the mutant phenotype is related to a cellulose synthase subunit
gene from cotton. In the mutant there is widespread accumulation
of beta-1,4-glucan but it is not crystallised into microfibrils,
suggesting such assembly is a role of the RSW1 gene product.
Other active programs include studies of various aspects of flowering,
from induction through floral organ morphogenesis to fertilisation
and seed development. Also topics as diverse as aspects of photosynthesis,
analysis of effects of abiotic stresses including heavy metals
and UV, epigenetic effects of cytosine methylation, and the roles
of the MYB gene family are being actively investigated.
A major commitment is being made to host the 10th International
Conference on Arabidopsis Research in Melbourne from 4-8 July
1999. A Regional Advisory Committee, with colleagues from Japan,
South Korea, Singapore and New Zealand, has been set up to give
the meeeting a Western Pacific focus. This will be the first
time the Arabidopsis community has met outside Europe and North
America, and we look forward to welcoming scientists and students
to Australia where plant science continues to thrive.
Contact Person: David Smyth, Monash University, Melbourne
E-mail Address: David.Smyth@sci.monash.edu.au
Belgium
As Belgium is a federal country we have both federal and Flemish initiatives to support research using Arabidopsis thaliana as the experimental organism.
A Flemisch project is running on the isolation and characterization of new ethylene mutants in Arabidopsis thaliana. This project aims at the isolation of a new series of mutants in the ethylene signal transduction pathway. A combined morphological, physiological and molecular-genetical approach will elucidate a number of previously unknown elements and will provide a better insight in the control of plant development by this hormone.
Belgian governement also stimulates interactions between the different
universities. In this frame a project is running between the universities
of Gent, Antwerp, Brussels and Liège on the growth and
development of higher plants. Many external factors such as light
intensity, light quality, temperature, the availability of nutrients
and the interaction with pathogenic organisms influence to a great
extent, growth and development of higher plants. The current knowledge
on the molecular processes that control growth and development
is still very limited. The national network aims at making a contribution
to developmental biology by studying a limited number of aspects
of plant development. Wherever possible, Arabidopsis thaliana
will be used as a model plant. Keyprojects include the identification
and cloning of key regulatory genes involved in leaf morphogenesis,
the molecular analysis of the formation of syncytia (=large feeding
cells) in nematode infected Arabidopsis roots. The Flemish community
also supports these projects.
Contact Person: Nancy Terryn /Marc Van Montagu, University of Ghent
E-mail Address: nater@gengenp.rug.ac.be
China
Research using Arabidopsis as a model system was further
established in China at national research institutes and universities
in the past year. The research areas mainly include biosynthesis
of amino acids, signal transduction and metabolism of plant hormones,
cell wall formation, seed storage proteins, response to environmental
stresses, isolation of various mutants affecting growth and development,
and characterization of transposable elements. Interests in reverse
genetics and functional genomics are also greatly increased with
the focuses on gene-targeting, constructing a large transgenic
population with mapped Ds randomly distributed at a high density,
developing an expression library to transform in planta and establishing
cDNA array to monitor gene expression and identify functional
genes. Grants to support the research projects mentioned above
are mainly from National Natural Science Foundation of China,
Chinese Academy of Sciences and Hong Kong Research Grant Council/UPGC
Grant HKU.
Contact Person: Jiayang Li, Institute of Genetics, Chinese Academy of Sciences
E-mail Address: jyli@ss10.igtp.ac.cn
Genome sequencing
During the last year, three French laboratories (M. Delseny/Perpignan,
M. Kreis/Orsay and R. Mache/Grenoble) have systematically sequenced
three BACS (300 kb) as part of the EU-ESSA II Program. Delseny's
group has also continued to sequence cDNA clones corresponding
to the 60kbp locus, Em1, on chromosome 3. A French sequencing
center, Genoscope CNS has been created and part of its activity
is devoted to sequencing the Arabidopsis genome. In collaboration
with TIGR and Upenn, Genoscope is generating end sequences from
all 23,000 BAC clones from the TAMU and IGF libraries to expedite
the selection of clones with minimal overlap with those already
sequenced. They are also coordinating a new EU project aimed at
sequencing the lower arm of chromosome 3 (9Mb). This project involves
16 sequencing groups. The goal for Genoscope and three academic
French laboratories is about 2 Mb.
Synteny with other genomes
A program was developed between INRA Rennes and Versailles groups
to identify consensus markers between rapeseed and Arabidopsis
for a number of agronomically relevant genes. A collaboration
between laboratories in Perpignan, Davis and Poznan has found
synteny between five adjacent genes in the chromosome 3 Em locus
of Arabidopsis and genes in B. oleracea, B. nigra and B. rapa.
The EU program EuDicotMap has started to select highly conserved
ESTs of rice and Arabidopsis and to map them in Arabidopsis as
well as important European crops in order to identify synteny
blocks between different families.
Generation of insertion lines and reverse genetics screenings
INRA-Versailles has now generated more than 38,000 T-DNA mutant
lines. Screening of the collection is being done via a coordinated
effort between INRA, CNRS and various European laboratories. Out
of approximately a hundred target genes selected for the screen
insertions were identified in 50% of them. The systematic characterization
of flanking sequences tags of insertions in over a thousand mutants
has now begun. About 11,000 lines will be donated to NASC by the
beginning of next year.
A summary of Arabidopsis genes under study
Research in many areas of plant genetics and biology is being
actively pursued in French laboratories. Plant hormone and signal
transduction, cell wall, secreted and membrane proteins, metabolism,
development, and plant pathogen interactions are being investigated
in laboratories throughout France.
Contact Person: Michel Caboche, INRA Versailles
E-mail Address: caboche@tournesol.versailles.inra.fr
Germany
Arabidopsis research is still increasing in scope at universities
and research institutions. The national research program on "Arabidopsis
as a model for analysing plant development" is in its
final two-year funding period. Because its tremendous success,
an initiative has been made by Arabidopsis researchers to establish
a new program focusing on plant cell biology. Another six-year
national research program on plant hormones to start in 1999 includes
several groups working on Arabidopsis. Beside these programs,
Arabidopsis research is funded within European projects
and by DFG grants on an individual basis or as part of local research
programs.
Several Arabidopsis projects are related to genome research.
ZIGIA, a program operated at the Max-Planck-Institut in Cologne,
aims at the functional analysis through gene inactivation by transposon
insertion. High throughput endprobe hybridization of BAC clones
from the IGF library was done at the Max-Planck-Institut in Golm.
These data were integrated with information made available by
other groups to assemble a complete BAC-based physical map of
the Arabidopsis genome. Projects on transcript profiling
have been initiated at the DKFZ in Heidelberg, the MPI in Golm
and the IPK in Gatersleben. The Federal Ministery of Education
and Science (BMBF) has made a call for proposals within a newly-established
Plant Genome Analysis program (GABI). A joint Arabidopsis proposal
involving 32 projects from 27 different institutions has been
submitted, aiming at a functional analysis of the genome.
An EMBO (European Molecular Biology Organisation) Course held
at the Max-Planck-Institut in Cologne in May 1998 entitled "Molecular
and Biochemical Analysis of Arabidopsis" was attended by
16 participants representing 13 European countries. The course
covered the theoretical and practical aspects of forward and reverse
genetics, genetic and physical mapping, transformation, transient
gene expression, in situ hybridisation, cell biology, physiology,
the yeast two-hybrid system, complementation of yeast mutants
and bioinformatics over an eleven-day period. EMBO Course seminars
from ten invited speakers were integrated with a two-day meeting
of the national Arabidopsis research program.
Contact Person: Gerd Jürgens, Universität Tübingen
E-mail Address: gerd.juergens@uni-tuebingen.de
Italy
Research in Italy with Arabidopsis is growing. About twenty
laboratories are presently attending to researches regarding this
model system. Investigations cover: plant pathogen relationships,
expression of PG and PGIF genes, role of rolB and rolD in plant
differentiation, HD-ZIP transcription factors in plant morphogenesis,
complementation of yeast by Arabidopsis genes, selection of Ca2+
and K+ transport mutants, genes involved in heat and cold resistance,
myb transcription factors, genes of the polyamine pathway, induction
of noduline genes in plants by Rhizobium, use of antisense RNA
to inhibit nitrogen transport, study of agravitropic mutants in
earth and micro g conditions (ESA-ASI projects). Financial support
for the researches is coming from different sources, e.g. the
National Research Council, the Ministry of Agriculture, the European
IV Frame Programs, the ESA-ASI Space Programs, and a few other
National Agencies. Research groups are located both in universities
and in National Institutes (National Research Council, ENEA, National
Institute of Nutrition). The Italian association of researchers
interested in Arabidopsis (ARABITALIA) met for the first time
in September 1997 in Abbadia di Fiastra (Macerata, central Italy).
In this occasion the scientists present to the meeting furnished
a report of their Arabidopsis investigations and projects, and
a booklet carrying the information about research on Arabidopsis
in Italy was also distributed. In this occasion some young Italian
researchers, who are working in foreign countries (USA, and UK)
also reported about their recent investigations. The 1998 annual
Meeting was held at the end of September in Viterbo (central Italy)
in the occasion of the EUCARPIA Symposium on plant breeding.
A document is in preparation about the state of Arabidopsis research
in Italy, and about the actions that can be started to obtain
the financial support that is needed to foster it.
Contact Person: Fernando Migliaccio, CNR (Monterotondo)
E-mail Address: miglia@nserv.icmat.mlib.cnr.it
Japan
Arabidopsis research is well-established in Japan. The
number of laboratories using the model plant for research and
education is still increasing gradually in universities, national
institutes, and private companies. Areas of research are widely
spread from developmental biology, metabolic regulation, gene
expression, environmental stress signaling, and DNA methylation,
to large scale DNA sequencing. The results of the researches
were reported in international meetings such as the " International
Congress of Arabidopsis Research" in Madison, WI, the "Joint
Meeting of Japanese and American Societies of Plant Physiologists"
in Vancouver, BC and in national meetings, especially in the "Workshop
on Arabidopsis Studies", an annual meeting. The 8th workshop
was organized by Kazuo Shinozaki, Minami Matsui, Yuji Kamiya,
and Richard E. Kendrick from October 11 to 13, 1997, at Riken
Institute at Wako city, Saitama. The workshop was joined with
Frontier Research Forum, "Recent Progress of Plant Hormone
Research in Arabidopsis". We had nearly 250 participants,
20 poster presentations, and 37 speakers including 7 guest speakers
from abroad. The 9th workshop was held in Kazusa Academia Center
from Nov. 19 to 20, 1998. The workshop organized by Satoshi Tabata
had nearly 300 participants, 40 poster presentations and 11 presentations.
Topics of the presentations included systemic genome analyses,
patent, and postgenome tactics, as well as mutant analyses, gene
cloning, and newly-developed techniques.
The Japanese Arabidopsis communication network, nazuna-net,
started in January 1995, now includes 442 members (Sept. 1998)
from 99 organizations including 17 private companies (contact:
Dr. Takayuki Kohchi: kouchi@bs.aist-nara.ac.jp). A large-scale
genome sequencing project showed extensive progress at Kazusa
DNA Research Institute in coordination with the Multinational
Arabidopsis Genome Initiative (contact: Dr. Satoshi Tabata: tabata@kazusa.or.jp).
Nearly 12.5 Mb covering 174 P1 clones have been sequenced and
reported in the journal "DNA Research" (contact:
http://www.uap.co.jp),
on a homepage (
http://www.kazusa.or.jp/arabi/). The Sendai Seed
Stock Center (SASSC) is operated by Dr. Nobuharu Goto (n-goto@ipc.miyakyo-u.ac.jp)
since 1993.
Contact Person: Kiyotaka Okada, Kyoto University
E-mail Address: kiyo@ok-lab.bot.kyoto-u.ac.jp
The Netherlands
The Dutch Arabidopsis groups organized their annual meeting in
Utrecht on February 19, which was attended by approximately 80
participants. Arabidopsis groups are located at the Universities
of Leiden, Utrecht and Wageningen and at CPRO-DLO in Wageningen.
Important research topics are in Leiden (Hooykaas) recombination,
auxin action and apoptosis, in Utrecht sugar sensing (Smeekens),
root development (Scheres) and acquired resistance (van Loon),
in Wageningen embryogenesis (de Vries, van Lammeren) and flowering
and seed- development (Koornneef), transposons, genome sequencing,
plant disease resistance genes and developmental biology (Stiekema,
Pereira, Angenent, Groot all CPRO-DLO). The groups collaborate
through their involvement in graduate schools and EU programs.
Contact Person: Maarten Koornneef, Agricultural University Wageningen
E-mail Address: Maarten.Koornneef@BOTGEN.EL.WAU.NL
Spain
No special funding programme supports Arabidopsis research
in Spain. However, more than 20 research groups are currently
active in research with this organism, mainly funded by the National
Biotechnology Programme, Basic Research Programmes, and the European
Union BIOTECH Programme. Some of these groups are involved in
large-scale genome sequencing and function search, specially in
the case of the Myb family of transcriptional factors. Spanish
groups interested in Arabidopsis development are mainly
focused on seed, leaf and flower development, and flowering induction.
This area is seing the incorporation of new groups of Arabidopsis
users, some of them also interested in cell differentiation. In
the area of plant physiology and metabolism some topics that have
seen significant contributions during the year are the study of
secondary metabolism, the identification of new elements in the
signal transduction pathways involved in different environmental
stress responses, and the analysis of sulfur and phosphate assimilation.
Arabidopsis has also being increasingly used for studies
in plant pathogen interactions to identify new elements in the
response signal transduction pathways.
The Spanish Arabidopsis network, funded by the National Biotechnology
Programme, generated a collection of 10000 T-DNA lines that is
being actively used in mutant screenings at both the phenotypic
and DNA levels, in many laboratories. This network that includes
all the Spanish laboratories working with Arabidopsis is
now discussing future join activities. Many more Spanish scientists
are currently involved in Arabidopsis research in other
laboratories around the world. Their succesful integration in
the Spanish R&D system would strongly contribute to steer
the field and increase the contribution of our country.
Contact Person: José Martinez Zapater, Centro Nacional de Biotecnología (Madrid)
E-mail Address: zapater@cnb.uam.es
United Kingdom
There are over 190 projects at present in the UK involving Arabidopsis.
The European Commission continues to be a major source of funding
and the newly announced Framework V programme is due to begin
calls for proposals. Although there are no longer any special
initiatives aimed specifically at Arabidopsis research, The Biotechnology
and Biological Sciences Research Council (BBSRC) funds projects
through competitive grants and special initiatives, contributing
approximately £ 6.8m to Arabidopsis research in the
UK.
An Arabidopsis Gene Function Search Network is currently under
development by Mike Bevan at the John Innes Centre. This is a
network of consortia, groups of labs with a common goal, being
brought together with the aim of doing large scale screening programmes
to reveal the functions of very large numbers of genes being revealed
by the genome project.
The Genetical Society of Great Britain chose Arabidopsis as the subject area for their annual autumn meeting in 1997. The Mendel Lecture was given by Elliott Meyerowitz who was preceded during the day by Mike Bevan, Rob Martienssen, Joe Ecker, Ben Scheres, Caroline Dean, Gerd Jurgens and Brain Staskawicz."Arabidopsis thaliana: Big Ideas from a Small Plant" was such a success that the Society has decided to host a biennial conference on Arabidopsis.
An EMBO (European Molecular Biology Organisation) Course held
at the John Innes Centre in May 1997 entitled "Arabidopsis
as an Experimental Organism" was attended by 12 participants
representing seven European countries. The course covered the
theoretical and practical aspects of mutant screening, genetic
and physical mapping, plant pathology, microscopy, biolistics,
the yeast two-hybrid system, and sequence fragment and data analysis
over a ten day period which also included seminars from ten invited
speakers.
The Chelsea Flower Show judges awarded a prestigious Silver Medal
to the John Innes Centre Science Communication and Education Department
exhibit, entitled "Arabidopsis - a Wonderful Weed".
The exhibit demonstrated how Arabidopsis is used to recognise
genes of agronomic importance in agricultural crops. The public
exposure and media coverage the display attracted in the UK and
abroad has helped to increase awareness of the importance of plant
molecular biology.
In the last year the Nottingham Arabidopsis Stock Centre (NASC)
in collaboration with the Arabidopsis Biological Resource Center
(ABRC) has continued to accumulate the broadest possible range
of stocks to provide the best platform of genetic diversity and
genetic tools for the investigation of this model system. Currently
NASC maintains and distributes over 20,000 accessions of Arabidopsis
to the research community. New stocks generated within the UK
and shortly to be made available include the first 10,000 of the
Sainsbury Laboratory Arabidopsis transposants (SLAT) lines
(Jonathan Jones, Sainsbury Lab, UK), 100 GFP lines (Jim Haseloff,
Cambridge, UK) and a Recombinant Inbred population of Nd (Niederzenz)
x Columbia generated by Eric Holub, Jim Beynon and Ian Crute (HRI
Wellsbourne, UK).
Contact Person: Caroline Dean, John Innes Centre, Norwich
E-mail Address: caroline.dean@bbsrc.ac.uk
United States
Arabidopsis research continues to flourish in both academic and
corporate laboratories in the United States. One of the most obvious
indicators of the value of information that can be gleaned from
Arabidopsis research has been the establishment of several
genomics companies that are exploiting Arabidopsis genetics.
Thanks to continued support from the National Science Foundation (NSF), the
Department of Energy (DOE) and the U.S. Department of
Agriculture (USDA), the Arabidopsis genome
is on track for being completely sequenced by the end of 2000.
A total of 46 Mb of finished sequence had been deposited in public
databases as of January 1999, of which the US sequencing groups
contributed more than 24 Mb. Importantly, the groups in the
US Arabidopsis Genome Initiative (AGI) finished the first phase
of their sequencing effort in less than the original 3 year time
allowed, and could thus begin during 1998 with the second phase
of sequencing ahead of time. In addition to its value for database
mining and other more traditional genomic approaches, the availability
of large amounts of genome sequence together with physical maps
that cover almost the entire genome have begun to eliminate positional
cloning as a bottleneck in Arabidopsis genetics. Much of
this information is conveniently accessed through the
Arabidopsis thaliana database (AtDB) at Stanford University.
The growing importance of Arabidopsis research has also
been evident in the increasing number of participants at the Eight
and Ninth International Conferences on Arabidopsis Research,
which were held in Madison, WI, and drew 817 and 998 participants,
respectively.
Apart from the genome sequencing efforts, important tools are
being developed for reverse genetics and functional genomics.
A significant advance in this area has been an $8.7M award from
the NSF Plant Genome Research Program for a cooperative effort
to provide high-throughput gene expression profiling as well as
gene knock out services to the Arabidopsis community. The identification
of gene knock outs has been made possible through the availability
of large numbers of T-DNA insertion lines, of which 48,500 have
already been deposited with the Arabidopsis Biological
Resource Center (ABRC) at Ohio State University. This number can
be expected to at least double in 1999. The ABRC continues to
be an important resource for the Arabidopsis community. It shipped
29,500 seed and 13,000 DNA stocks in 1997; and 46,500 seed and
16,000 DNA stocks in 1998.
As a direct consequence of the improvements in scientific infrastructure,
significant scientific advances have been made in every area of
Arabidopsis research, including hormone and light signaling,
circadian clock, responses to biotic and abiotic stress and developmental
biology. Some of the most noteworthy discoveries in 1998 included
the discovery of master regulatory genes that protect Arabidopsis
from cold damage and the identification of proteins that transport
auxins.
Contact Person: Detlef Weigel
E-mail Address: detlef_weigel@gm.salk.edu
Contact Person: Jeff Dangle
E-mail Address:dangle@email.unc.edu
NSF ARABIDOPSIS GENOME MEETING REPORT
INTRODUCTION
In 1990, a report entitled "A Long-range Plan for the Multinational
Coordinated Arabidopsis thaliana Genome Research Project"
was published by the National Science Foundation (NSF 90-80).
The report detailed plans made by members of the Arabidopsis
research community in the U.S. and abroad, to collaborate in the
sequencing of the genome of this model plant, and to characterize
the structure, function and regulation of all Arabidopsis
genes. In 1998 it became possible to set a realistic goal of
finishing the sequence by the end of the year 2000.
Since then, a multinational genome sequencing project involving
laboratories in the United States, in Europe, and in Japan, has
been engaged in achieving this goal. This report is the proceedings
of a meeting held to discuss progress to date, to anticipate barriers
to timely completion, and to establish an oversight committee
for the U.S. -based labs. The meeting was held at the National
Science Foundation in Arlington, Virginia on July 9 and 10, 1998.
Participants Representing
Elliot Meyerowitz, California Institute of Technology Chair
Ian Bancroft, John Innes Centre ESSA
Michael Bevan, John Innes Centre ESSA
Ellson Chen, Perkin-Elmer Applied Biosystems CSHSC
Ronald Davis, Stanford University SPP
Nancy Federspiel, Stanford University SPP
Gerd Jürgens, University of Tübingen MSC
Richard McCombie, Cold Spring Harbor Laboratory CSHSC
Rob Martienssen, Cold Spring Harbor Laboratory CSHSC
David Meinke Arabidopsis community
Xiaoying Lin, TIGR TIGR
Curtis Palm, Stanford University SPP
Daphne Preuss, University of Chicago Arabidopsis community
Francis Quetier, Genoscope Genoscope
Steven Rounsley, TIGR TIGR
Marcel Salanoubat, Genoscope Genoscope
Satoshi Tabata, Kazusa Kazusa
Athanasios Theologis, USDA Plant Gene Expression Ctr. SPP
Richard Wilson, Washington University CSHSC
Mary Clutter NSF
Machi Dilworth NSF
DeLill Nasser NSF
James Tavares DOE
Jane Peterson NIH
Adam Felsenfeld NIH
Peter Bretting USDA
Liang-Shiou Lin USDA
STRUCTURE AND PROGRESS
There are six different sequencing consortia participating in
the sequencing phase of the Arabidopsis genome project,
three from the United States, two from the European Community,
and one from Japan. Each is sequencing a different region of
the genome, and each has its own model for distribution of the
necessary work among consortium members. The progress of each
follows, taking them in turn.
TIGR (The Institute for Genome Research,
http://www.tigr.org/tdb/at/at.html)
TIGR has taken on two aspects of the sequencing project. The
first is BAC end sequencing (along with SPP and Genoscope), to
provide one-pass sequences of both ends of the 22,000 BAC clones
that are one type of clone being used for sequencing in the genome
project. The purpose of this is to allow sequential progression
from a single sequenced BAC to the two adjacent genomic regions
with minimal overlap. TIGR has sequenced 16,392 BAC ends from
a total of 9,572 BAC clones, providing a total of 7.34 Mb of BAC
end sequence. The total BAC end sequence from all three groups
is 36,574 BAC ends from 18,746 clones, representing 13.64 Mb.
The second TIGR project is the sequencing of chromosome 2. They
have chosen 16 well-spaced starting points (by use of the Goodman
lab chromosome 2 contig map), and are sequencing BAC clones in
parallel, starting with the original clone in each location, and
proceeding by use of BAC end sequences to adjacent clones with
minimal overlap. The average overlap between adjacent BAC clones
has been 8.2 kb, with a range from 150 bp to 30 kb. At present
4.83 Mb is complete and annotated, 3.25 Mb has shotgun sequencing
or annotation in progress, and 1.38 Mb of BAC clones are in preparation
for sequencing, for a total of 9.46 Mb.
The only problem encountered so far is a gap with no clones to
cross it in present BAC collections, in the m336 large contig.
Fiber FISH done at the University of Wisconsin indicates a gap
size of 500 kb, and the sequence at either side of the gap shows
no special features. There has also been a BAC difficult to close
due to long tandem dinucleotide repeats, but there is no theoretical
barrier to completion of such clones.
The total estimated length of chromosome 2 is less than 14 Mb,
not including an estimated 3.5 Mb of ribosomal DNA tandem repeats
at one end of the chromosome. The current rate of sequencing
in this phase of the project at TIGR is presently 8 Mb per year,
and there is an existing proposal to increase that to 12 Mb per
year. It is estimated that, barring unforeseen problems, chromosome
two, excluding highly repetitive centromeric regions and the rDNA
repeats, will be completed by the end of 1999; if the full capacity
is to be used, clones on other chromosomes will have to be started
by the end of 1998.
SPP (Stanford University, Plant Gene Expression Center, University
of Pennsylvania;
http://pgec-genome.pw.usda.gov;
http://cbil.humgen.upenn.edu/~atgc/ATGCUP.html;
http://sequence-www.stanford.edu/ara/ArabidopsisSeqStanford.html)
These three groups have as a goal completing the sequence of chromosome
1. They have divided some of the preparative tasks, with Stanford
providing automated template preparation, Penn mapping chromosome
1 BACs and providing BAC end sequences to the project, and PGEC
making the sequencing libraries. All groups are involved in sequencing.
The strategy is similar to that of TIGR, whereby seed BAC clones
chosen by the Penn laboratory are used a sequencing origins, and
progress made by use both of BAC end sequences and BAC fingerprints,
to provide minimal overlap. Initially 20 starting points were
used, there are plans to add an additional 20 soon.
SPP has provided 8,936 BAC end sequences to the 36,574 BAC end
total.
The chromosome 1 sequencing done or in progress has so far totaled
5.64 Mb, which is the sequence of 55 BACs and 1 YAC clone. Excluding
overlap between adjacent clones leaves a total unique sequence
in progress or finished of 5.36 Mb. Of this 4.02 Mb are complete,
0.65 Mb in finishing and 0.97 Mb in shotgun phase. Overlap between
adjacent clones has been 2 to 38 kb, with an average less than
7 kb; there has as yet been no failure to find the adjacent clone
from any sequenced BAC.
The total estimated length of chromosome 1 is 30 Mb. Capacity
exists to finish it by the end of 2000, given sufficient funding
- completion will require sequencing approximately 300 BAC clones
in the next 3 years, or 33 BACs per year per participating site.
CSHSC (Cold Spring Harbor Sequencing Consortium;
http://www.cshl.org/arabweb/;
http://genome.wustl.edu/gsc/)
This consortium includes Cold Spring Harbor Laboratories, Washington
University and Perkin-Elmer Applied Biosystems. They are taking
a different approach to choosing the BAC clones to sequence, which
involves HindIII and EcoRI fingerprinting of BAC clones, and from
the clone overlaps inferred from fingerprint identity, producing
deep contigs of overlapping clones. Each contig is then to be
anchored to known chromosomal positions by use of the abundant
public information on BAC clone map positions, or by cross-hybridization
with the YAC contigs already established for chromosomes 4 and
5 at the John Innes Centre in the U.K. Once a genome-wide set
of BAC contigs is available, a minimal tiling path can be calculated
and many clones can be sequenced in parallel. This approach requires
the same degree of preparative work as BAC end sequencing for
a comparable cost, but has the advantages of providing a physical
map to the Arabidopsis community prior to the completion
of the genomic sequence, and also will allow parallel sequencing
of clones rather than the necessarily sequential sequencing using
BAC end sequences. In addition, this method will allow gaps to
be identified in advance of sequencing in the gapped region, and
thus may allow a longer time to close gaps before they become
a critical problem with sequence completion.
So far an estimated 71 MB of the perhaps 120 Mb nuclear genome
is contained in 66 BAC contigs, which contain 10,840 BAC clones.
The chromosome totals are:
Chromosome Mb Contigs
1 22.5 13
2 >4 7
3 17.0 11
4 15.3 8
5 13.4 8
The current rate of BAC clone fingerprinting and editing is 15
Mb per month. It is expected that all 22,000 available BAC clone
will be added to this map by the end of 1998. Concentration at
present is on chromosome 5, where the CSHSC is sequencing, and
chromosome 3, where Genoscope plans to sequence using the CSHSC
contigs.
The CSHSC is committed to sequencing the top of chromosome 4 and
a region of approximately 4 Mb around the centromere and on the
north arm of chromosome 5. Sequence data has been contributed
by all three collaborating partners. Totals finished so far are
690 kb from ABI, 1.22 Mb from CSH and 1.64 Mb from Washington
University, adding up to 3.54 Mb (with overlap subtracted). In
addition to this, approximately 3 Mb of sequencing is in progress,
making a total of more than 6.0 Mb in 61 BAC clones and 1 YAC.
If this rate were to be continued, the proposed chromosome 4
region could be completed by the end of 1998, with chromosome
5 region completion either 1998 or early 1999.
ESSA (European Scientists Sequencing Arabidopsis;
http://muntjac.mips.biochem.mpg.de/arabi/index.html)
The ESSA project is in three phases. Phase I, which is complete,
was to sequence two contiguous regions on chromosome 4. One,
surrounding the FCA genetic marker, is 1.92 Mb (Bevan et al. 1998,
Nature 391:485), the other, around the genetic marker AP2, is
0.41 Mb, for a total completed ESSA I sequence of 2.33 Mb. ESSA
II, which is to be completed in October 1998, has the goal of
completing a 5 Mb region on the long arm of chromosome 4. So
far 3.16 Mb is completed and annotated, an additional 1.73 Mb
completed and in annotation phase, for a total of 4.89 Mb sequenced.
Another 0.24 Mb is nearly complete, for an overall total of ESSA
II complete and nearly complete contiguous sequence of 5.13 Mb.
The ESSA I and ESSA II total of completed and nearly completed
sequence is thus 7.46 Mb.
The two-year ESSA III project begins in August, 1998. Its goal
is to complete the sequence of the long arm of chromosome 4 (estimated
to total 13 to 13.5 Mb) and to sequence two regions of the north
arm of chromosome 5 (with others to be done by CSHSC and Kazusa),
with a total goal of sequencing 9 Mb.
The ESSA procedure is to use the existing YAC contig maps of chromosomes
4 and 5 to group BAC clones in bins according to their YAC cross-hybridization,
then to use SalI digestions and pulsed-field gel electrophoresis
followed by blotting and iterative hybridization with BAC clones
to establish both BAC contigs and an overall SalI restriction
map of both chromosomes. A minimal BAC tiling path is then defined
and called the "sequence ready map,", the clones from
this map are then sent to one of 9 collaborating sequencing laboratories
for nucleotide sequencing. The data are collected and annotated
at MIPS, the Munich Information Center for Protein Sequences.
The only problems encountered so far have been two difficult clones,
one with a large hairpin and the other with a large region of
tandem repeats. Both have been nearly completed, with the tandem
repeats solved by long PCR as a supplement to the shotgun sequencing.
Kazusa DNA Research Institute (
http://www.kazusa.or.jp/arabi/)
The Kazusa Institute is engaged in sequencing the long arm of
chromosome 5 and along with ESSA and CSHSC, portions of the short
arm of this chromosome (totaling 17.2 Mb when complete), and they
are beginning the sequencing of the long (13.2 Mb) arm of chromosome
3.
The clone libraries used are from the Mitsui Plant Biotechnology
Research Institute, and consist of P1 and TAC clones. Clones
from these libraries are initially selected by cross-hybridization
to mapped clone markers. The clones are then anchored on the
YAC contig (for chromosome 5 clones), and fingerprinted as an
integrity check. They are then shotgun sequenced, assembled,
and annotated. A collection of YAC, TAC and P1 clone end sequences
has been made for tiling the chromosome 5 clones, it includes
1254 sequences from 690 CIC YAC clones and 706 sequences from
389 P1 or TAC clones on chromosome 5. Similar methods for chromosome
3 are starting, using the YAC contig map of that chromosome produced
by D. Bouchez and collaborators at INRA. At present, two large
contigs for chromosome 3 exist, one of 13.6 Mb for the long arm,
and one of 9.2 Mb for the bottom arm.
Progress to date has been the release of 8.89 Mb of completed,
annotated sequence, with release of an additional 1.60 Mb scheduled
by August 1. Thus by August 1, 1998, 10.49 Mb will have been
completed and released. 10.15 Mb of this is on chromosome 5,
0.34 Mb on chromosome 3. An additional 2 Mb of chromosome 5 sequencing
is in progress. At current rates of 700 to 800 kb per month,
it is expected that 27 months will be required for completion
of this part of the project, which is estimated to include (in
addition to the 10.49 Mb to be completed by August 1) 7.05 Mb
of chromosome 5 and 13.3 Mb of chromosome 3. Genoscope has proposed
to do 5 Mb of the long arm of chromosome 3 (see below), if they
are able to take this on (a matter now being considered there,
and dependent upon the demand for their resources by human genome
sequencing) the total sequence proposed by Kazusa will be reduced,
and completion will be expected within 2 years.
Genoscope (Centre Nationale de Sequencage;
http://www.genoscope.cns.fr/externe/arabidopsis/Arabidopsis.html)
Genoscope is involved in the second European project. They have
already provided BAC end sequences totaling approximately 11,500
completed end sequences, with plans to provide 2,000 more. Once
this is complete 91% of the 22,000 BAC clones used in the sequencing
project (from the IGF and the TAMU collections) will have available
end sequences.
Their sequencing plan is to use the Bouchez chromosome 3 YAC contigs
to make a minimal BAC tiling path by use of fingerprints done
at Genoscope and at CSHSC, then to sequence the bottom (9 Mb)
arm of chromosome 3. Complete contigs for this region have been
supplied by CSHSC. 16 different European sequencing groups are
receiving the BAC clones from Genoscope, and the data are returned
to MIPS for annotation and entry into a public database. The
sending out of clones is to begin within weeks, and completion
of the 9 Mb region is expected by the end of 2000.
Genoscope has in addition explored with Kazusa the possibility
of sequencing an additional 5 Mb on the top arm of chromosome
3; their ability to do this will depend upon the amount of their
sequencing capacity that will be required to do their part of
human chromosome 14, and their ability to generate extra sequencing
capacity. A decision on whether Genoscope or Kazusa will sequence
this 5 Mb is planned for September, 1998.
Summary of Progress
Chromosome Est. Size (Mb) Complete (Mb) Group
1 ~30 4.02 SPP
2 14 (+rDNA) 4.83 TIGR
3 23 0.34 Kazusa & Genoscope
4 17 (+rDNA) 9.02 ESSA & CSHSC
5 ~30 10.15 Kazusa, CSHSC, ESSA
TOTAL ~114 Mb +rDNA 28.36
In addition, shotgun sequencing libraries are in preparation for
an additional 2.80 Mb, and sequencing is in progress but not yet
complete for an additional 2.98 Mb. Furthermore, 36,574 BAC ends
from 18,746 clones, representing 13.64 Mb, provided by TIGR, SPP
and Genoscope are completed, as are 1254 end sequences from 690
CIC YAC clones and 706 sequences from 389 P1 or TAC clones, provided
by Kazusa.
COMPLETING THE SEQUENCE
Defining completion
In addition to the gene-rich and highly informative regions of
the genome (with one gene every 4-5 kb), there are regions of
repetitive DNA, and perhaps of lower gene density.
One instance is the ribosomal DNA repeats, which are arranged
in two uninterrupted tandem arrays. Each repeat unit contains
a gene for 18S, 5.8S and 25S structural ribosomal RNAs and is
10-10.5 kb in length. The large tandem arrays of repeat units
are found at the top arms of chromosomes 2 (NOR2) and 4 (NOR4).
Each is on the order of 3-3.5 Mb, or 300-350 repeat units.
Centromeric regions are only beginning to be defined at the molecular
level in Arabidopsis, but cloning and chromosome in
situ hybridization studies have shown that these regions contain
multiple tandem repeats of short sequences, a major element of
which is 180 bp repeats and related repeats. In one case (chromosome
1) an estimate of the repeat length is 950 kb. For chromosome
4 the functional centromere is probably on one side of a 180 bp
repeat region, and so far does not seem to be unclonable. There
is some indication that BAC clones from this region may have a
higher amount of repetitive sequence in tandem arrays than other
BAC clones sequenced to date, and one BAC clone from the chromosome
2 centromere region has only 3 genes, a much lower density than
the typical 1 gene per 4-5 kb found elsewhere. Another BAC from
the centromere region of chromosome 4 has a more typical density.
Telomeres and subtelomeric regions in Arabidopsis have
been characterized and appear to be small (totaling perhaps 100
to 200 kb in the genome) and not difficult to sequence so far.
There are also small regions of simple tandem repeats, as for
example as described above in the ESSA project progress report.
This clone, BAC F9F13, contained 10 tandem copies of a 3.5 kb
repeat, as well as 2 additional copies of the same repeat.
Because the exact sequence and number of tandem repeats is not
thought to be consequential for any functional analysis, and in
fact is quite polymorphic between ecotypes, it was decided that
a sufficient characterization of these repeats would be a sequence
of one subunit, and an estimation from blotting or long-range
PCR of the number of tandem copies at each site.
Given this, the complete sequence of the nuclear genome will be
considered to be in hand when each chromosome arm is fully sequenced
as a single contig from subtelomeric repeat to "centromeric"
tandem repeats, with internal tandem repeat regions (including
rDNA repeats) characterized only as far as demonstrating that
they are pure tandem repeats, with the sequence of one repeat
unit determined, and an estimate of repeat number at each site
provided. This characterization already exists for the rDNA repeats
(Copenhaver et al. (1995) Plant J. 7:273-286). This definition
may have to change if unclonable regions are found, or if non-tandemly
organized but nonetheless impossible to sequence (with available
relevant technology) clones are found. To date there is no indication
of either unclonable regions or of clones impossible to sequence
for reasons other than large numbers of small tandem repeats.
Other sequence parameters
Accuracy
All of the participants have agreed before, and continue to agree,
that the standard for sequence accuracy should be one error in
10,000 nucleotides or better, and the projects so far seem to
be achieving this goal. The U.S. groups agreed to a common pair
of tests to monitor sequence accuracy. The first would be using
base calling programs such as Phred (Ewing et al. (1998) Genome
Res. 8:175-185) or TIGR Assembler to assess sequence accuracy
in each sequencing run. The second is to independently determine
the sequence of all regions of overlap between adjacent clones,
and only after sequence finishing to compare them for mismatches.
This serves as an independent method to determine sequence accuracy,
and since all mismatches are to be resolved by further analysis,
this test will in addition indicate the degree of sequence change
due to mutation in the clones being used for sequencing.
The European and Japanese groups have different methods to measure
sequence accuracy, but have the same goal of less than one error
in 10,000 bases.
Annotation
Proper annotation of sequences to indicate the position, structure
and nature of each of the coded genes is a critical component,
and in fact the primary product, of the genome project. It is
clear, though, that initial annotation of sequences is not fully
(or even very) accurate, as the software and algorithms used for
gene recognition can miss exons and introns, and can also indicate
the presence of exons or introns where there are none. This is
as true in animal genome projects as in plant projects. Thus,
annotation will have to be done in stages, with initial annotations
that can be useful, but that must be acknowledged to be flawed.
Each of the sequence groups performs its own annotation, as this
is not only an interesting part of the work, but also helps with
continued sequencing. It was agreed that, to provide the highest
quality initial annotation, each group would use multiple software
programs for gene recognition, and would indicate in its output
the product of each of the programs (something that GenBank cannot
do; thus this requires output to be in a form other than that
sent to GenBank or equivalent public databases). It should be
emphasized that doing this does not remove the requirement for
inclusion of the output in public databases like GenBank or DDBJ.
In addition, experimental means of annotation are to be used
by each group - that is, sequences must be compared with the EST
sequences that are available and that indicate actual RNA sequences,
and must be compared with the genes of known structure that have
been individually studied. Furthermore, feedback from the community
of Arabidopsis researchers should be invited by each group,
to allow correction or improvement of each group's annotations.
As the genome project proceeds, it is important to consider additional
experimental methods for gene recognition, and the application
of such methods should be considered important goals for the project.
Among the experimental methods to be considered is sequencing
of related genomes (such as those of Arabis lyrata or Cardaminopsis
petraea, see
http://www.arabis.net/wild.htm). Because exonic
sequences change more slowly than intronic or intergenic sequences,
this could serve as a very useful indicator of gene location and
exon boundaries. Additional experimental means for improving
annotations include RNA blots and RT-PCR to find if suggested
genic sequences in fact correspond to RNAs, and full-length sequencing
of large numbers of cDNA clones for comparison to genomic sequences.
Maintenance of summary lists of identified genes according to
the type of protein coded (see Bevan et al. 1998, Nature 391:485)
is also an important aspect of annotation.
Because annotation methods and the experimental information on
which they are based is subject to continual improvement, frequent
reannotation is worthwhile. Both the Kazusa and TIGR groups have
plans for systematic reannotation of sequences from all groups.
To facilitate this and, especially, to facilitate community access
to annotations, it was agreed that all groups would work toward
a standardized format for data presentation, and that groups doing
large-scale reannotation would make their data freely available
for mirroring on the web sites of all groups that wish to display
them.
Data release
Each of the U.S. groups sends sequence out unannotated and in
small fragments as soon as it reaches either approximate 2 kb
contigs or 7x average coverage. The sequences from two of the
three groups are sent at this stage to the high throughput genome
sequence (HTGS) part of GenBank, the third group has agreed to
start doing this as well. The sequences are now sent to each
group's own web page, each of which supports BLAST searches, and
are also sent at short intervals to AtDB, the public Arabidopsis
database, where they are also BLAST searchable (
http://genome-www2.stanford.edu/cgi-bin/AtDB/nph-blast2atdb).
The structure of the European projects, where sequence-ready clones
are allocated to many groups, and each group has some discretion
(and rules from their own national government) in how to sequence
and when to submit completed sequence, does not lend itself to
identical release methods or policies. Nonetheless, the groups
agree to collect and distribute through MIPS and AtDB all sequences
as soon as practicable, at latest after completion and before
annotation.
The Japanese group also has its own policies and level of funding
for informatics, which so far have dictated that sequence be released
only after both completion and annotation, and then posted to
DDBJ (DNA Database of Japan) and GenBank. This entails a delay
in public access relative to other groups, as the time from completion
to annotation is about a month, and the time from acquisition
of the earliest data to completion is also appreciable. The Japanese
group will consider mechanisms for earlier release, within the
constraints of policy and of funding for this aspect of the project.
Clone registration (intention to sequence)
One critical aspect of the project is coordination between groups
on the clones to be sequenced, as without tight coordination,
duplication of effort will occur, especially in the closing phases
of the project. In addition, as different groups complete their
assigned regions, reallocation of regions may become necessary
so that groups ahead of their predicted rate can help by sequencing
clones originally assigned to other groups. At present this coordination
has been supplied by direct communication between the groups,
and by the function of an international coordinating committee
of the Arabidopsis Genome Initiative (AGI: see
http://genome-www3.stanford.edu/cgi-bin/Webdriver?MIval=atdb_registry_info.html).
This committee will remain the arbitrator of international sequencing
efforts, but will be supplemented with a new committee that will
allow for closer coordination of the U.S. groups. This new committee
has been mandated by the U.S. funding agencies, as a replacement
for the three separate advisory groups that now exist, one for
each group.
One of the tasks of the U.S. committee will be clone reallocation,
and in addition frequent communication with the members of the
international AGI committee, as a way of stimulating continued
discussion among all groups. As representatives of all groups
will be invited to the meetings of the U.S. committee, these meetings
may also be able to serve as a forum for discussion and decisions
of the AGI committee. This may help the AGI by increasing the
frequency of its considerations.
NEW U.S. STEERING COMMITTEE
Given the important new role of the mandated U.S. Steering Committee
as arbitrator and communication facilitator between the U.S. groups,
and as aid to the AGI committee on the international front, the
role a responsibilities of the committee were discussed and agreed
upon.
The U.S. Steering Committee will have the following responsibilities:
1) Setting boundaries between the U.S. sequencing groups (ideally,
to be defined by sequenced clones) to avoid duplication of effort
in chromosomes where more than one group is working
2) Reallocation of clones or chromosome regions from one group
to another to fit sequencing capabilities to the remaining work.
3) Monitoring and enforcement of the common agreements described
earlier in this report, namely the agreement to work toward a
common annotation format, to provide quality control information
both from base calling programs and from clone overlap regions,
and to monitor sequence release compliance.
4) Providing annual progress reports to the Arabidopsis
community and to the U.S. funding agencies, separate from the
progress reports of each of the individual sequencing groups.
These reports will include a careful consideration not only of
amount of sequence provided by each group, but of progress in
all respects, balanced so that groups taking on difficult clones
to sequence, or who are in closing phase and thus must devote
time to closing gaps, are given full credit for such efforts.
In addition, these reports are to detail progress in the informatics
aspects of the project, including a summary of the progress and
needs of the Arabidopsis database - as an interface between
the database and its advisory committee, the sequencing groups,
and the Arabidopsis community.
5) Provide an interface between the U.S. groups and the international
AGI committee, and act to facilitate the setting of boundaries
and clone reallocation at an international level.
6) The committee should endeavor to meet in person at least once
a year, and have regularly scheduled meetings by electronic mail
or conference call.
The composition of the committee is as follows:
Members:
Ex officio:
The actual members of the committee who have so far agreed to
serve:
Elliot Meyerowitz, chair (U.S. Arabidopsis community)
Daphne Preuss (U.S. Arabidopsis community)
Gerd Jürgens (international Arabidopsis community)
Ex officio:
Joe Ecker, SPP
Dick McCombie, CSHSC
Steve Rounsley, TIGR
Ian Bancroft, ESSA III
Francis Quetier, Genoscope
Satoshi Tabata, Kazusa
Recommendations for the other members were:
Joanne Chory, Pam Green or Detlef Weigel (U.S. Arabidopsis community)
Mark Johnson, Richard Gibbs, John Sulston, Maynard Olsen (sequencing experts)
Mark Boguski (database expert)
Mike Cherry (AtDB representative)
FINAL PROSPECT
Given sufficient funding, which seems very likely, there is no
technical obstacle to the completion of the Arabidopsis
nuclear genome sequence by December 31, 2000. Although the efforts
of the project members must be focused tightly on finishing the
sequencing, it is not too early to begin considering the next
steps, among them experimental methods for annotation, and functional
analyses of genes and gene families.
submitted by:
Elliot M. Meyerowitz July 15, 1998
Summary of December 1998 AGI Meeting at CSHL
1. Daphne Preuss summarized her work on centromeric regions and
presented detailed information on approximate map locations of
BAC contigs and sequenced BACS based on hybridization (Altmann)
and fingerprint (WashU) data. She agreed to make this information
available to the community. Rob Martienssen stressed that individual
clones would need to be compared closely with fingerprint contigs
constructed at WashU because some hybridization data were unreliable.
2. Each group discussed their estimated sequencing capacity and
assigned chromosomal regions for the coming year. Kazusa expects
to finish their assigned regions on III and V by the end of 1999.
ESSA and CSHL/WashU may also complete their assignments on IV
and V at about the same time. SPP is continuing with chromosome
I and was encouraged to avoid starting many additional nucleation
points in order to focus on the same closure issues being addressed
by the other groups. Genoscope has begun sequencing the bottom
arm of III and will continue with this region through 2000. TIGR
expects to finish chromosome II by summer 1999 and will therefore
be the first funded group to run out of an assigned region to
sequence.
3. AGI members discussed the importance of finishing difficult
areas within assigned regions of the genome while also continuing
to make rapid progress on other regions to maximize release of
information to the community.
4. Both TIGR and Kazusa proposed to begin sequencing the "unassigned"
top 5-6 Mb of chromosome III during 1999. After considerable
discussion, both at the AGI meeting and later in the conference
when Satoshi Tabata arrived, a consensus was reached to have
TIGR begin sequencing this region of chromosome III during the
spring of 1999 with the aim of finishing this region by January
2000.
5. Starting in January 2000, TIGR, Kazusa, CSHL, and ESSA will likely have residual sequencing capacity ready to shift to centromeric regions and portions of chromosome 1 that have not yet been completed. By this time a minimal tiling path based on fingerprint data should be available to facilitate assignment of remaining BACs to AGI members. SPP has funding to complete most or all of chromosome I but recognizes that the entire genome
may be completed more rapidly if other groups contribute in the
year 2000 to sequencing portions of this chromosome (or possibly
part of the bottom of chromosome III depending on progress made
by Genoscope) after their own assigned regions have been essentially
completed.
6. Marcel Salanoubat and Francis Quetier led a discussion of
the Genoscope policy for sequence release. While it was clear
that the informatics capabilities of the individual laboratories
in their program varied significantly, there was a general agreement
that the group should strive for immediate release of sequences
(at least for the bigger laboratories within their program).
7 . Rob Martienssen and David Meinke discussed the status of the
CSHL/WashU consortium plans to continue sequencing and fingerprinting
efforts. NSF has now received all of the necessary paperwork
for continued funding of this consortium and expects to make an
award at a level sufficient to enable sequencing another 2.4 Mb
per year starting early in 1999. In addition, NSF has recommended
funding an informatics person at WashU to finish editing of fingerprinted
contigs and establishment of an interactive version of the BAC
physical map that can be accessed via the Internet. This person
will work closely with AtDB to avoid duplication of effort.
8. The CSHL/WashU group has agreed to release to other sequencing
groups all of their edited contig information and fingerprint
database through their ftp site no later than the end of January,
1999. The SPP and TIGR groups are particularly anxious to make
use of this information in order to avoid repeating the contig-building
steps that have already been completed elsewhere. Rob Martienssen
agreed to provide as soon as possible a minimal BAC tiling path
for regions of the genome that may require coordination during
the final year of the project..
9. Joe Ecker and David Meinke discussed a proposal by Hiroaki
Shizuya at Caltech to fingerprint and end-sequence a new BAC library
with large inserts (180 kb average). The general consensus was
that although this library might be very useful in regions of
the genome with minimal coverage and could reduce the overall
cost of sequencing other regions by reducing overlaps, it was
unlikely that many AGI participants would immediate move away
from using TAMU and IGF clones for the bulk of their sequencing
efforts. NSF is willing to discuss further the potential value
of this library with interested AGI members.
10. Rob Martienssen agreed to serve as the next AGI chairperson.
There was general agreement that AGI members should meet again
in summer 1999, perhaps at the next Arabidopsis meeting in Australia,
to assess progress and make specific plans for the future.
Joe Ecker, AGI chairperson
I. VENUE AND PARTICIPANTS
To assess the current and future database needs of the Arabidopsis
community, an NSF-supported workshop on this topic was convened
in Madison Wisconsin on June 28, 1998. The workshop participants
included the following individuals:
Rick Amasino, University of Wisconsin
Mary Anderson, Nottingham University
Mike Cherry, Stanford University
Joanne Chory, Salk Institute
Maarten Chrispeels, University of California San Diego
Jeff Dangl, University of North Carolina
Keith Davis, Ohio State University
Allan Dickerman, National Center for Genome Research
David Flanders, Stanford University
Pam Green, Michigan State University
Bertrand Lemieux, University of Delaware
David Meinke, Oklahoma State University
Larry Parnell, Cold Spring Harbor Laboratory
Daphne Preuss, University of Chicago
Ralph Quatrano, Washington University
Ernie Retzel, University of Minnesota
Steve Rounsley, The Institute for Genomic Research
Randy Scholl, Ohio State University
Chris Somerville, Carnegie Institution of Washington and Stanford
University (chair)
Desh Pal Verma, Ohio State University
The following individuals provided valuable written comments prior
to the meeting (Appendix I):
Jean Greenberg, University of Chicago
Katie Krolikowski, Harvard University
Russell Malmberg, University of Georgia
Jose Martinez-Zapater, Biology Molecular y Virologia Vegetal,
CIT-INIA
Natasha Raikhel, Michigan State University
Pierre Rouze, Flanders Institute of Biotechnology
Chris Town, Case Western Reserve University
Desh Pal S Verma, The Ohio State University
In addition, the workshop was attended by the following observers:
Peter Bretting, USDA/ARS National Program Staff
Greg Dilworth, Department of Energy
Machi Dilworth, National Science Foundation
Margarita Garcia, Stanford University
Paul Gilna, National Science Foundation
Xiaoying Lin, The Institute for Genomic Research
Bob MacDonald, US Department of Agriculture
DeLill Nasser, National Science Foundation
II. GOALS
The general goals of the workshop were to examine the present
and future database needs of the Arabidopsis community and to
outline in general terms the main issues which should be addressed
in any future proposals concerning the development of new or expanded
Arabidopsis databases. The discussions were intentionally focused
on biological and community issues and there was no attempt to
define or specify issues which are related to specific computer
hardware or specific database programs. In particular, no assumptions
were made concerning continued government funding of any current
Arabidopsis database activities.
A previous workshop with these goals was held on June 5th and
6th, 1993. A copy of the published summary that workshop was provided
to all participants and served as a reference to earlier views
and objectives of the Arabidopsis community. [1993
Dallas Workshop Report] In addition, participants were
provided with a draft summary of a BBSRC-USDA bilateral plant
bioinformatics and coordination meeting held at Llangollen Wales,
March 22-24, 1998. A copy of a memorandum, dated February 26,
1998, from the North American Arabidopsis Steering Committee to
the curators of AtDB, concerning the current Arabidopsis community
database needs was also provided. [NAASC
Memorandum] Finally, in preparation for the meeting,
written comments solicited from the community on the Arabidopsis
electronic newsgroup were provided to the participants before
the meeting. A copy of the solicitation and written comments are
appended as Appendix I.
III. RATIONALE FOR AN ARABIDOPSIS DATABASE
The genomes of higher plants, such as Arabidopsis, contain approximately
25,000 genes. During the next several years, the sequence of the
Arabidopsis genome will be completed and extensive sequence information
will become available for many other species, including many plants.
Most or all of the Arabidopsis genes will be used to develop gene
chips or microarrays that permit simultaneous measurements of
the expression (mRNA levels) of all of the genes. These will be
used to generate information about the expression of all the genes
in the organism in response to a wide variety of treatments and
genetic backgrounds. Each experiment could have as many as 25,000
data points for each time point or treatment of each genotype!
Comprehensive libraries of insertional mutations will permit the
isolation, by reverse genetics, of null mutations in any Arabidopsis
gene. Extensive collections of enhancer-trap or promoter-trap
lines are being developed that permit sensitive analyses of the
spatial patterns of gene expression down to the single-cell level.
Thousands of new classes of mutants will be isolated by selecting
for suppressors or enhancers of existing mutations. The corresponding
genes will be cloned by very high resolution mapping of the mutations
so that a limited number of candidate genes which are evident
in the delimited region of genomic sequence can be directly tested
for complementation. This will depend on the development of very
high resolution maps. It seems likely that high resolution proteomics
methods will become important for identifying the substrates of
the thousands of kinase genes that form many of the regulatory
networks in Arabidopsis and other plants. Additionally, extensive
genomic-based work in other plant species will produce a flood
of sequence information. The value of much of that information
will be greatly enhanced by comparison with the aggregate information
available in Arabidopsis. Thus, we are entering an era of explosive
growth of knowledge about Arabidopsis in particular, and plants
in general. Most of the data generated by the projects described
above will never appear in printed journals and will only be available
to the community through electronic databases.
Because Arabidopsis is one of the most intensively studied organisms, and is a direct model for 250,000 closely related species, we believe that it is appropriate to undertake a major investment in developing new information retrieval tools (IRTs) for Arabidopsis in particular and plants in general. By this we mean that because we will know everything about Arabidopsis, it is a suitable object on which to focus the building of a comprehensive database or set of linked databases. However, because the value of Arabidopsis derives from its utility in understanding other plants, it would be desirable to build a structure that permits facile high resolution linking of specific information about Arabidopsis to all other plants.
Looking into the future more generally, it is apparent that scientific
publishing is undergoing a much needed revolution. All of the
major journals will be electronic within a few years and once
that transition is complete, scientists will develop new tools
for interacting with data. The complexity of biological knowledge
in many fields is such that new mechanisms for integrating data
are required. The development of computer programs that calculate
genetic maps "on the fly" from currently available data
is an early example of what will become a more general mechanism
for integrating data. Integrated graphical representations of
patterns of gene expression in individual cells of three dimensional
models of organisms at various developmental stages is another
example that is under development. With such a model it will be
possible to find relationships between objects (eg., genes) and
processes that would be difficult or impossible with current information
retrieval technologies.
Because of the changes taking place in publishing, there may be an opportunity to develop databases that will eventually be self supporting in the same way that journals are self supporting. As the distinction between the format blurs, the concept of paying for a database subscription will become commonplace. However, there are many complex issues associated with imposing charges for database use and the question is largely academic at present.
There are many challenges in developing a new generation database.
Perhaps the foremost is the difficulty in collecting information
from the thousands of scientists who produce primary information
for conventional publication in journals.
IV. CURRENT PUBLICLY SUPPORTED DATABASE ACTIVITIES
The principal publicly supported Arabidopsis database activities
are the AtDB database at Stanford University and the stock center
databases maintained by the Arabidopsis resource centers at Ohio
State University and the University of Nottingham. In addition,
the University of Minnesota supports an EST database for all plants,
and each of the Arabidopsis genome sequencing groups provides
database access to genomic sequences, including BAC end sequences.
The AtDB goal is to provide the plant-biology research community with convenient and correlated access to the publicly available results of Arabidopsis research. This includes published and otherwise freely available information about the genome, the genes it contains, the gene products, their positions on genetic and physical maps, as well as DNA sequences. The users of the database are very diverse, ranging from Arabidopsis molecular biologists to biologists focusing on any other organism. The members of the AtDB project are currently shared with the Saccharomyces Genome Database, and the database administrator is shared with the Expression Microarray database and Genetic Footprinting database projects, all located at the Department of Genetics at Stanford University. In an effort to minimize wasteful duplication of effort, the AtDB project uses much of the same software and staffing structure as the Saccharomyces Genome Database (SGD). The combined SGD and AtDB groups thus benefit from an economy of scale by sharing computing and human resources.
At a meeting of the Arabidopsis genome community in 1992 at the
Cold Spring Harbor Banbury Center, a consensus was reached that
AtDB should take responsibility for providing centralized access
to Arabidopsis databases, a recommendation that has been repeatedly
endorsed by the North American Arabidopsis Steering Committee.
Since that time AtDB has been supported by a grant from the National
Science Foundation. However, the annual level of support for AtDB
has been only a small fraction of the support provided for database
activities for similarly advanced models such as Drosophila, yeast
and mouse.
V. SUMMARY OF CONCLUSIONS AND RECOMMENDATIONS
The highest priorities for database content are:
VI. WHAT SHOULD BE IN THE DATABASES?
The long-term goal is to provide interconnected access to all
information about Arabidopsis. However, certain classes of information
should have a higher priority for immediate inclusion and also
require a high degree of curation in order to be most useful to
the community.
A. Map-Based Information
At present, many laboratories are engaged in cloning genes by
map-based cloning methods. The use of map-based cloning is expected
to continue indefinitely and to become the most widely used method
of cloning genes in the future. The ease with which this can be
accomplished is directly proportional to the availability of information
about genetic and physical maps, polymorphisms, and large clones.
Thus, the greatest current need is a unified genetic and physical
map that incorporates all available information about polymorphic
markers (eg. CAPS, SSLPs, RFLPs), mutations, BAC and YAC clones,
mapped clones and insertions or other modifications of the genome.
Because of the pending completion of the genomic sequence, the
state of the genetic map is expected to change dramatically during
the next several years as sequence-based markers become anchored
on the genomic sequence. The availability of the sequence information
will enhance the value of the integrated map because it will stimulate
map-based cloning efforts which will remain dependent on a high
density of polymorphic markers. The integration of the genetic
and physical maps should be undertaken by a group with appropriate
expertise in both genetic and physical maps and database management
and curation.
Ready, access to primary mapping data should be given highest
priority in database development. Map information should be collected
and presented in a manner that allows the user to determine what
is known, plus what remains questionable or unresolved with respect
to map locations of genetic and molecular markers in combination
with a complete physical map anchored to the complete nucleotide
sequence. In constructing the database, it should be remembered
that recombination data generally provide only rough estimates
of map location, and that mapping data may differ widely in quality
and reliability. Therefore, some database users may prefer direct
access to primary mapping data in order to compare their results
with those obtained in other laboratories. A database that provides
options for visualizing several different maps constructed with
different mapping functions or subsets of markers and primary
mapping data would be particularly valuable to the Arabidopsis
community.
Any proposal for database development should also discuss in some
detail how the integrity of these maps would be verified and maintained.
Some mutations and cloned genes are likely to be known by several
different names. It will therefore be important to establish a
database that will accommodate multiple changes in nomenclature.
Other plant databases are moving toward the use of standard gene
names as described in the Mendel database. The Arabidopsis databases
should also adopt this policy to ensure compatibility with other
databases.
Provisions should also be made to add new types of information
to genetic and physical maps as they become available (break points
of chromosomal aberrations; regions of extensive heterochromatin;
regions with a high/low degree of sequence homology to related
plants; etc.).
B. Sequence information
The value of the genomic sequence will depend on the quality of
the annotation. The goal for the quality of annotation should
be similar or identical to that of other higher organisms. It
should be possible to arrive at an integrated map of a gene by
various routes. A user should be able to begin a query with a
sequence, a gene name, a keyword or a genetic map location. A
user should be able to highlight a region of the genome on a graphical
display and move to increasingly higher levels of resolution with
the click of a mouse. For example, one might start with a whole
chromosome, then move to a ~10 cM region which shows the contigs
of BACs and YACs, the mapped mutations, the sites of insertional
mutations or launching pads for transposons. Next the user should
be able to visualize a ~1 cm region showing all of the above features
plus the locations of open reading frames (theoretical and verified),
ESTs, polymorphic markers, potentially polymorphic markers (ie,.
SSLPs). Finally, at the next level of resolution the user should
be able to visualize the DNA sequence, the various putative open
reading frames indicated by gene finding programs, experimentally
verified genes, ESTs, BAC and YAC end sequences, polymorphisms,
mutations and other known aberrations. The open reading frames
should be linked to information about gene expression, experimentally
verified information about gene function, mutant phenotypes associated
with classical mutations or over or under expression, theoretical
information about gene function based on inference from other
organisms, subcellular localization of the gene product, known
or predicted modifications of the gene product. If there are other
genes of similar structure in the genome, the presence of these
genes should be indicated. Similarity to genes from other plants
should be indicated with a link to the appropriate databases.
The control regions of the genes should be annotated with known
or predicted motifs and with information about the identity of
other genes with similar motifs.
The sequence information should not simply be a link to raw sequence
in GenBank because the level of annotation and tools to manipulate
that sequence do not directly support the kinds of queries made
by most biologists. Thus, the sequence should be directly available
from a specialized database which provides useful tools for manipulating
the sequence. It should be possible to retrieve from the database
sequence information based on map position, type of sequence,
or other specific requirements. All information should be linked
to publications describing the data when possible.
Because the sequencing groups are not expected to have the resources
to provide continued annotation, there will be a need for a group
to take responsibility for continued upgrading of the annotation
of the genomic sequence as information about the sequence becomes
available from direct experimentation and from computational analyses
based on experimental results obtained with other organisms.
C. Expression information
The use of microarrays and gene chips are expected to provide
a massive amount of new information. Most or all of the Arabidopsis
genes will be used to develop gene chips or microarrays that permit
simultaneous measurements of the expression (mRNA levels) of all
of the genes. These will be used to generate information about
the expression of all the genes in the organism in response to
a wide variety of treatments and genetic backgrounds. Each time
point or treatment could have as many as 25,000 data points. Because
the experiments are technically straightforward, it seems likely
that a common type of experiment will be to prepare mRNA from
a mutant and a wild type and to compare the consequences of the
mutation on the expression of all the genes in the organism. In
addition to simply archiving the raw data it should be possible
to query the data in various ways. For instance, as data from
different treatment accumulates, it will become possible to search
for genes that are coregulated with a gene. This kind of query
may provide insights into the identity of otherwise anonymous
genes or reveal the existence of networks. It should also be possible
to identify all the factors that cause altered expression of a
gene, to identify all genes that specifically respond to certain
treatments, to identify mutations that cause similar effects on
gene expression. For these kinds of queries it will be necessary
to have software that can identify data sets that are most similar
from among hundreds or thousands of different data sets produced
by different treatments.
There is also a large need for a repository for information about
spatial aspects of gene expression. There are now many transgenic
lines which exhibit specific spatial patterns of reporter gene
expression, and cloned genes which confer such patterns. In the
short term a database with a controlled vocabulary for the various
cell and tissue types and linked images of the patterns of gene
expression would meet immediate needs. In the longer term, it
would be useful to have graphical tools that would integrate the
patterns of gene expression into an organismic model.
D. Phenotypic Information
Because of the diversity of processes that are being analyzed
by a mutational approach in Arabidopsis, there is a need for facile
access to information about gene function as it relates to the
organism. One aspect of the problem involves determining the genetic
basis for a phenotype. In this case it should be possible to enter
a description of a phenotype and obtain a ranked list of probable
genetic alteration that could give rise to the phenotype. Conversely,
it would be very helpful to be able to enter a gene name and obtain
a description of the corresponding mutant. This capability will
greatly enhance the efficiency with which new mutations will be
studied as the number of known mutations begins to plateau. It
is expected that we will soon have saturating collections of transposon
mutants, so having ways of describing these phenotypes, and making
them accessible, will be important. No capability of this kind
currently exists.
One strategy may be to use organizational schemes as entry points
(phenotypic indexes, so to speak). One such index is the genetic
map position. Knowledge of this provides an entry point to other
mutants and papers. Another possible organizing scheme could be
based on the EcoCyc database format of metabolic pathways, so
that biochemical phenotypes could be correlated, or the knowledge
of existing pathways could be queried. The user would click on
a pathway and learn what was known about this. Another way of
indexing and accessing the data for development might be to have
a standardized Arabidopsis growth animation - at appropriate times
during the growth animation, a user could click on a graphic representation
of an organ or other feature, and then this would lead to additional
information. Clicking on a rosette leaf might lead to various
types of leaf cells or indexed leaf morphologies.
E. Stock-Based Information
The databases maintained by the two Arabidopsis resource centers
at Ohio State University and the University of Nottingham provide
excellent access to information on the availability of biological
and chemical materials related to Arabidopsis research. These
databases have implemented many of the recommendations of the
1993 workshop report and should continue to assume responsibility
for descriptive information concerning seed stocks, clones, vectors,
libraries, cDNAs, oligonucleotides, and any other materials that
may require distribution to the Arabidopsis community. Emphasis
should be placed on careful documentation of biological materials,
controlled vocabularies, and maximal utilization of sophisticated
graphics to display plant phenotypes, molecular hybridization
patterns, and other data where appropriate.
With respect to seed stocks, it should be possible to search the database by general phenotype, not just by gene symbol, in order to obtain a broad listing of ecotypes and mutant lines with similar features. Information on phenotypes, screening methods, growth conditions, and differences between alleles should be included for all mutants available through the stock centers. It should also be possible to obtain information on additional mutants or alleles that have been isolated in specific laboratories but are not available from the stock centers.
Individuals should be able to search for specialized libraries,
vectors, transgenic lines, and molecular reagents (antibodies,
purified proteins, unusual compounds, and biochemical standards)
required for Arabidopsis research.
The stock center databases should be directly linked to a central
Arabidopsis database so that queries about the properties of a
gene or mutant can lead directly to a query about the availability
of the resources used to study these or related aspects of the
biology.
F. Community-Based Information
During the past several years there has been a proliferation of electronic resources that provide easy access to information on a wide range of community issues. For instance, it is now relatively easy to retrieve contact information for colleagues or previous postings on the Arabidopsis newsgroup, the abstracts for meetings are available on line and there is an electronic Arabidopsis journal, Weeds World, which provides a forum for discussion of methods and problems and publication of short papers. Many laboratories have mounted web pages that provide detailed information about specialized methods, specialized databases or collections of genetic materials. The curators of AtDB have provided convenient access to these diverse resources by providing a web page that facilitates connection to these resources.
While it is desirable to continue having one group take responsibility
for maintaining a centralized launcher or "data warehouse"
for Arabidopsis-related web sites, this should be a relatively
inexpensive activity and should not require significant public
financial support. The distinction between this activity and a
database does not seem to be fully appreciated by the community.
The result is that, because of the proliferation of sites which
are all superficially similar, the users do not know how to efficiently
find information. Therefore, it may be desirable to maintain a
clear distinction between a centralized internet launcher and
any future attempts to develop a unified Arabidopsis database.
G. Biology-Based Information
The focus of research with Arabidopsis is likely to change in
the future from the immediate emphasis on mapping, sequencing,
and gene identification, to the long-term questions of general
biology and gene function during plant growth and development.
Thus, there is a long-term need to develop Arabidopsis database(s)
that provide facile access to information that may be of critical
importance during this second phase. In proposing a vision of
the future requirements one correspondent wrote the following
(Appendix I):
"I envision a data base organized by levels of organization
that can be addressed at different levels. This database should
contain both structural and functional data organized at different
levels. In this way starting, for example, with the keyword root,
one can access information about root structure, root cell components,
root development, nutrients uptake, etc. and end up in the interactive
pathways and proteins responsible for these processes and the
corresponding genes. It should also be possible to address the
database by processes - for example elongation or flowering or
pollination. Of course this is likely far away from real possibilities.
Going down to the specifics, the information in the database could
be implemented with information on pathways and networks, protein
interaction maps, protein structures, subcellular organelles,
cell structure, etc. This will be a way to reach to a database
as described above."
Examples of topics that might be included in this category, include:
information on plant pathogens that infect Arabidopsis and details
on the molecular interactions that take place between host and
pathogen; information on the chemical composition of specific
plant parts (sugars, lipids, proteins, polysaccharides, specialized
compounds, etc.); physiological data on the normal life cycle
and the response of mutant and wild-type plants to various environmental
and experimental treatments; protein profiles of different plant
parts revealed through 2-D gel electrophoresis; information on
the natural distribution and ecology of Arabidopsis and closely
related species; detailed comparisons of the different ecotypes
with respect to morphology, physiology, and molecular biology;
information on the taxonomy of Arabidopsis with particular attention
to related plants used in agriculture; light and electron micrographs
of different types of cells in wild-type plants; records of expression
patterns of specific genes during growth and development; and
computer-enhanced reconstructions of serial sections through various
plant structures.
At present it appears that the development of these resources
will be best accomplished by the individual initiative of members
of the community with specific knowledge and interests in specialized
information of the kinds described above. The eventual integration
of specialized databases of this type into a unified Arabidopsis
database will be facilitated by encouraging the open exchange
of schema between database developers. Therefore, public support
for Arabidopsis databases should be contingent on unrestricted
access to all schema and source code used in Arabidopsis databases.
VII. STANDARDS FOR QUALITY OF DATA
All data that is acquired by the databases should be available to users. However, where data is suspect or in conflict with other data, it may be desirable or necessary to provide various views of data. Thus, it may be desirable to provide a user with a curated version of a certain kind of data and an uncurated version. A specific example might be in the interpretation of open reading frames. Since the various gene finder programs do not always make the same prediction, it should be possible to provide the curators best guess as one view and the various alternatives as another view. A simple tab associated with each view would provide a convenient tool for meeting this need. It is also desirable to provide access to the primary mapping data used to position mutations and genes on the genetic map.
Publication of data should not be a prerequisite for inclusion
in the databases. Indeed, the vast majority of data is unlikely
to ever be available via traditional publishing methods.
VIII. HOW WILL THE DATABASE BE USED? WHAT LINKS SHOULD BE MADE
BETWEEN CATEGORIES OF INFORMATION?
In addition to the specific ability to perform searches as described in the previous sections, the categories of information must be linked with user friendly interfaces. To facilitate maximal utility of Arabidopsis databases, there is a need to develop a standard interface for access to Arabidopsis genomic sequence information. All information must have the name of the individuals that provided the data. Attention should be paid to tight coordination between the genetic map and related genes, clones, and sequences, so that selection of any of these will lead transparently to accession of the others. Also, it is highly desirable for the database to have simple links for comparative sequence and mutant analysis with other plants and beyond that, with all organisms. The interface should allow viewing in a variety of ways.
As examples of the types of links desired, we list below a series
of questions that the system should be able to answer.
(1) If a user enters two cloned markers, the system should return
a list of all markers of a specified type that map between them.
(2) If a user points to a location on a genetic map, the genes,
clones, and sequences should appear. Likewise, a user should be
able to derive map position if a DNA sequence is used as the starting
point.
(3) For any gene, the expression pattern of the RNA encoded, by
the clone should be readily accessed. Since information may be
available about how the expression of the gene changes with different
treatments, or in different mutants, it will be necessary to allow
the user to define a set of comparator genes that can serve as
standards. If available, links to spatially resolved information
should be available as images.
(4) If a user finds a mutant that is altered in a particular way,
the system should retrieve all mutants altered in a similar manner.
A cross-species accession to similar mutants in other plants might
be useful.
(5) It should be possible for a user to rapidly determine the
map positions for all genes in a given biochemical or developmental
pathway.
(6) If a user has new mapping information, the system should have
the ability to download archived data in that region for manipulation.
IX. COMMUNITY ISSUES THAT MUST BE CONSIDERED IN THE DESIGN
AND OPERATION OF THE DATABASE
A. Advisory Committee
All Arabidopsis database proposals should include a provision
for an oversight committee that will represent the community of
Arabidopsis researchers and will advise database investigators
on priorities and data to be included. The oversight committee
should also include individuals with technical expertise in database
design and management. It may also be desirable to have representation
from individuals involved in development and operation of other
plant databases. In order to maximize accountability, it would
be desirable to have the oversight committee formally approved
by the North American Arabidopsis Steering Committee (NASC).
B. Curation, Entry, Correction, and Long-Term storage of Data
One of the major problems associated with developing a database is collecting data. Because database deposits do not currently generate a citation for inclusion in an individuals vita, there is no incentive to make the effort to deposit data. One mechanisms for encouraging deposits may be to implement a citation system for database deposits which would resemble those currently used for journal publications (ie., Author, title, date, accession number).
The task of data acquisition would be greatly facilitated if the journals would require authors to make deposits of data directly into appropriate databases at the time of publication in much the same way that all journals now require GenBank accession numbers. There was unanimous agreement that this would be a desirable development and there are indications that at least some of the plant journals are willing to implement such a change. Future proposals should include a plan for creating user-friendly interfaces that can be used by the authors of journal articles to enter data directly into an internet accessible form. Such forms could also be used by members of the community to enter unpublished data into the database. There was broad enthusiasm for a requirement that anyone receiving public research support be obliged by the funding agencies to describe how the data and research materials from previous supported research have been made available to the stock centers and databases.
Previous attempts to acquire data by soliciting input from the
community have been generally unsuccessful and curation of data
by the community is not considered feasible. Thus, the Arabidopsis
databases must be curated by professional curators. Professional
curators of Arabidopsis databases should make every effort to
leverage the database activities undertaken elsewhere and to adapt
existing software when appropriate for use in the Arabidopsis
research community. Thus, the major activity of Arabidopsis databases
should be the collection, entry, and correction of data rather
than writing software for storing, retrieval, and presentation
of data. It is clear from past experience that full time professional
curators are required for the development and operation of an
adequate database. In order to recruit and retain highly skilled
personnel to develop and operate the Arabidopsis databases, it
is essential that there be a reasonable expectation of stable
long- term funding.
C. Relation to Other Databases and Programs
All Arabidopsis databases should use industry-standard hardware
and software, so that they are both compatible with and can communicate
transparently with other data bases. However, as stated elsewhere
in this report, the primary goal should be to collect and store
data using currently accepted database models rather than to develop
new database software. The most important principle, therefore,
in the design of next generation databases is that the data be
entered in a form that makes it possible to interface easily with
other databases and which makes the data portable to future generation
database software. Any software that is written specifically for
an Arabidopsis database (display of genetic maps, for example)
should be layered and use industry standard interfaces so that
the software, as well as the underlying data, is also compatible
with and portable to future generation databases. In adition,
consideration should be given to production of generic database
structures that can be used for a variety of different organisms.
Databases are currently being developed for most plants of economic
significance. Because all higher plants are very closely related
and are thought to contain a similar basic gene set, the information
in these databases can be readily interrelated by biological criteria.
However, because of the various concerns of the groups developing
other plant databases, and because of the different kinds and
amount of information available, it is not feasible at this time
to consider a common database structure that would accommodate
Arabidopsis and other plants. Therefore, in order to facilitate
future interconnectivity between the Arabidopsis databases and
other plant databases, a concerted effort must be made to adopt
common standards whenever possible. The use of the Mendel gene
nomenclature conventions is a case in point. The developers of
Arabidopsis databases should be informed about major activities
with other plants and wherever possible should endeavor to share
software.
D. Access of Databases
Data accumulated by a publicly funded database should be community
property. There should be no restrictions on the availability
of the data in the databases and they must be accessible internationally
by the internet.
E. One or Several Databases?
It is desirable to facilitate full expression of the collective
genius of the world Arabidopsis community. Because talent in bioinformatics
and enthusiasm for Arabidopsis is distributed around the world,
and because of the ease with which databases can communicate via
the internet, a distributed database should be the goal. However,
based on past experience, the users experience difficulty if information
is fragmented or presented in a variety of different interfaces.
Thus, the current situation in which users must navigate six separate
databases to view genome sequence information is unacceptable.
Bringing all genome sequence annotation into a common format should
have a top priority. Thus, if there are several databases, each
should have a clear and defined subset of the database task, and
appropriate links to the others. It is imperative that they be
integrated and that the staff operating the different databases
be committed to cooperating with each other. Unrestricted access
to all schema and source codes should be a requirement for public
support. The goal should be to have a single user interface for
a specific class of information. Proposals requesting support
for database development must address this issue.
F. Education
The ADB investigators should be provided funding for the provision
of community education and training. This would include the development
of on-line help, training manuals, workshops, and short courses.
The ADB developers should maintain complete documentation and
source code. This information should be in the public domain.
Because educators and students in higher education (including
high-school students) may make use of ADB, sufficient documentation
for non-sophisticated users should be made available.
G. Financial Support for Arabidopsis Databases
Although there is a general willingness of most members of the
community to pay directly for database services in much the same
way that journal subscriptions are currently purchased, it was
concluded that the disadvantages of imposing charges outweigh
the likely benefits for the foreseeable future. Thus, at present,
it would be inappropriate to impose charges for the use of publicly
supported databases. As with other organism-specific databases,
the burden of funding must be borne by government agencies. In
order to retain highly qualified database curators and developers,
there must be reasonable assurance of continuing support.
H. Ownership of Databases
Because of the convergence of electronic publishing and database
activities, potential liability issues, and because of the intrinsic
value of established databases, consideration needs to be given
to the legal ownership of databases. At present, databases developed
with US federal grants are the property of the institutions that
administer the grants. Because of the importance of ensuring unrestricted
public access to Arabidopsis databases, proposals for funding
of future database activities should provide assurances that institutional
policies are consistent with the continuing need for free unrestricted
access.
X. WHAT DESIGN-FEATURE ISSUES NEED TO BE CONSIDERED?
The design considerations for Arabidopsis databases are essentially
unchanged from the 1993 workshop report. One of the most pressing
needs reported was for improved graphical visualization tools
for various forms of data.
A. Design Considerations that Should be Discussed in any Proposal:
B. Research Goals
Developers should consider and propose to carry out some short-term
research relevant to improving the quality of the Arabidopsis
thaliana database. Some possibilities for short-term research
would be:
C. Possible Long-Term Research Goals
NATIONAL SCIENCE FOUNDATION
DEADLINE: MARCH 22, 1999
MATRIX OF PROGRAM REQUIREMENTS
The Directorate for Biological Sciences (BIO) of the National Science Foundation (NSF), through the Biological Database Activities Program in the Division of Biological Infrastructure, has identified as a priority support for the design, development, and implementation of biological information resources for the Multinational Coordinated Arabidopsis thaliana Genome Research project. Therefore, the Biological Database Activities Program announces a special competition for an on-line resource to extend, maintain and distribute a user focused, on-line resource for biological information on Arabidopsis thaliana, termed here the Arabidopsis thaliana Information Resource (AtIR). The successful awardee of this competition will be required to incorporate and build on the existing Arabidopsis thaliana Database (AtDB, http://genome-www.stanford.edu/Arabidopsis/), which continues to be an unique resource in its role as a primary repository of Arabidopsis information.
Proposal preparation instructions: Standard Grant Proposal Guide (GPG) plus supplementary guidance
Deviations from standard GPG proposal preparation instructions: PIs must complete the BIO Proposal Classification Form (PCF)
Cost sharing/matching requirements: None
Indirect cost (F&A) limitations: None
Other budgetary limitations: Funds may not be requested or used for construction or renovation of facilities.
Use of FastLane in Proposal Preparation & Submission: Entire Proposal Required
FastLane point of contact for this program: E-mail biofl@nsf.gov.
Full Proposal Deadline: March 22, 1999
Description of supplementary criteria: In addition, reviewers will focus on the following issues:
Where appropriate, reviewers will also consider:
Special grant conditions anticipated: None
The Directorate for Biological Sciences (BIO) of the National Science Foundation (NSF), through the Biological Database Activities Program in the Division of Biological Infrastructure, has identified as a priority support for the design, development, and implementation of biological information resources for the Multinational Coordinated Arabidopsis thaliana Genome Research project.
The Multinational Coordinated Arabidopsis thaliana Genome Research project was established in 1990 to develop Arabidopsis thaliana as an experimental model system for flowering plants. During the next several years, the sequence of the Arabidopsis genome will be completed and extensive sequence and mapping information will become available for this and many other plant species. New technologies such as microarrays and gene chips now present the capacity to study the functional expression of thousands of genes at a time, while new capabilities in creating libraries of insertional mutations will allow detailed studies and ultimately manipulation of specific gene function. Drawing on the original goals of embarking on model organism genomes, the value of the Arabidopsis project lies in the utility of the information gathered in seeking to understand the biology of flowering plants.
Therefore, the Biological Database Activities Program announces a special competition for an on-line resource to extend, maintain and distribute a user focused, on-line resource for biological information on Arabidopsis thaliana, termed here the Arabidopsis thaliana Information Resource (AtIR). The successful awardee of this competition will be required to incorporate and build on the existing Arabidopsis thaliana Database (AtDB, http://genome-www.stanford.edu/Arabidopsis/), which continues to be an unique resource in its role as a primary repository of Arabidopsis information
The Arabidopsis thaliana Information Resource (AtIR) is expected to serve as a repository for data and information generated from multiple genomic studies on Arabidopsis. Operational priorities for this project will be predominantly needs-driven as defined by the Arabidopsis (and related) research communities, and as gathered through mechanisms established by the awardee. While it is understood that some software development will be required to meet these needs, the major mission of AtIR should be viewed as the collection, entry, and updating of data and information.
The project will be expected to focus on specific needs that have been defined by the Arabidopsis research community during the course of meetings held in Dallas, Texas in 1993 ( http://genome-www.stanford.edu/Arabidopsis/db/dallas.report.html ) and updated in a meeting in Madison, Wisconsin, in 1998 ( http://genome-www.stanford.edu/Arabidopsis/db/database.needs.html ).
The greatest current need is a unified genetic and physical map that incorporates all available information about polymorphic markers (e.g., CAPS, SSLPs, RFLPs), mutations, BAC and YAC clon>
Because of the diversity of processes that are being analyzed by a mutational approach in Arabidopsis, there is a need for the entire scientific community to have facile access to information about gene function as it relates to the organism. This capability will greatly enhance the efficiency with which new mutations will be studied as the number of known mutations begins to plateau. AtIR will be expected to incorporate this capability.
AtIR should contain cross-references to all other relevant databases (e.g., GenBank nucleotide sequence database; Arabidopsis thaliana stock center databases; cell and/or probe repository catalogue number(s); and genetic map databases for other species showing significant synteny with Arabidopsis thaliana).
Storage and dissemination of expression data. Most or all of the Arabidopsis genes will be used to develop gene chips or microarrays that permit simultaneous measurements of the expression (mRNA levels) of all of the genes. The use of microarrays and gene chips are expected to provide a massive amount of new information. The ability to query this information may provide insights into the identity of otherwise anonymous genes, reveal the existence of networks or identify factors that cause altered expression of a gene. While it is not necessarily expected that the AtIR will serve as a primary repository for such data, it is expected that user access to such resources will be enabled through the use of appropriate links to other such databases.
Links to stock-based information. The databases maintained by the two Arabidopsis resource centers at Ohio State University and the University of Nottingham provide excellent access to information on the availability of biological and chemical materials related to Arabidopsis research. These databases will continue to assume responsibility for descriptive information concerning seed stocks, clones, vectors, libraries, cDNAs, oligonucleotides, and any other materials that may require distribution to the Arabidopsis community. The AtIR should be directly linked to the stock center databases so that queries about the properties of a gene or mutant can lead in turn to information about the availability of, and ordering procedures for, associated reagents.
The task of data acquisition would be greatly facilitated if members of the Arabidopsis research community could deposit data directly. The AtIR should include a plan for creating user-friendly interfaces that can be used by scientists to deposit data directly to the AtIR via the internet, and address approaches to be taken to encourage direct submission of data from the research community.
Curation and maintenance refers to the need to add, validate and update the biological attributes of repository data. Approaches to this task have ranged from an "in-house" staff of curators or annotators to dependency on community-based methods of data correction, maintenance and updating, to, conceivably, a highly automated suite of computational tools. Curation of data in an Arabidopsis data resource has been and will continue to be an important community need and will be an important facet of the AtIR operation. Proposors will be expected to outline approaches to this task and address the utility of automated or community-based approaches to data curation.
The Arabidopsis database should use industry-standard hardware and software, so that it is both compatible, and can communicate transparently with, other databases. An important principle in designing the resource will be that the storage architecture is structured in a form that makes it possible to interface easily with other databases. Some consideration should be given to production of generic database structures that can potentially be adopted for use in a variety of different organisms and particularly in related mapping and/or sequencing activities in the Plant Genome Research community.
Proposals submitted in response to this announcement must discuss the structure of the proposed database with these goals and scope in mind, and provide detailed plans for long-term management and distribution of the database. The data should be structured and maintained in a way that permits the development and use of complex queries by knowledgeable users or by third party software developers. The AtIR will be expected to collaborate with other efforts relevant to plant databases, both nationally and internationally. Plans detailing how such collaborations might work should be provided. However, formal arrangements for the collaborations need not be made prior to an award. The proposals must also provide plans for the incorporation into the AtIR of information currently found in the Arabidopsis thaliana Database (AtDB) and for the timely assumption of responsibility for data entry, repository maintenance and database distribution, all of which are now provided by AtDB.
The Arabidopsis thaliana Information Resource Project competition, will accept applications from eligible institutions as described in the NSF "Grant Proposal Guide" (GPG), NSF 99-2, Chapter I, Section D, in categories 1 and 2 only. The GPG is available on the NSF web site at the URL ( http://www.nsf.gov/cgi-bin/getpub?nsf992). Paper copies of the GPG may be purchased from the NSF Publication Clearinghouse, P.O. Box 218 Jessup, Maryland 20794-0218, telephone (301) 947-2722, or by e-mail from pubs@nsf.gov.
Consortia of eligible individuals or organizations may also apply, but a single individual or organization must accept overall management responsibility. International collaboration is encouraged; however, financial support for any non-U.S. participant organization must be provided from within the participant's country or other non-U.S. sources.
The Principal Investigator (PI) and other senior staff responsible for the project must have the necessary skills to successfully carry out the tasks covered in this announcement, or the proposal must present convincing plans to hire such staff. The PI should have demonstrated the leadership necessary to meet the challenges of managing a large community database in a rapidly changing technological and scientific environment. The PI and other members of the senior staff should, in the aggregate, have experience with aspects of plant biology research relevant to the database, have current knowledge about computerized databases and their management, and have a demonstrated ability to interact with the members of the various scientific disciplines and other groups important for the successful operation of the database. Experience with the successful management of a database effort of comparable scope and complexity will be considered an important asset.
The NSF expects to make one five year award in Fiscal Year 1999 depending on the quality of submissions and the availability of funds. The total award size is expected to range up to $1 million per year. The exact amount will depend on the advice of reviewers and on the availability of funds. It is anticipated that the award will be administered as a grant or cooperative agreement.
Note, while the term "award" and "awardee" used herein imply a single entity, NSF is not necessarily constrained by this model and is open to proposals of innovative models involving more than one entity by which the primary functions of AtIR might be administered (e.g., a "virtual resource"). Again, a single individual or organization must accept overall management responsibility.
Proposals to Arabidopsis thaliana Information Resource (AtIR) Project competition require electronic submission via the NSF FastLane system in accordance with the guidelines provided in the "Instructions for Proposal Preparation" found in the GPG, Chapter II. The GPG is available on the NSF Web Site at the URL http://www.nsf.gov/cgi-bin/getpub?nsf992. Paper copies of the GPG may be purchased from the NSF Publication Clearinghouse, P.O. Box 218 Jessup, Maryland 20794-0218, telephone (301) 947-2722, or by e-mail from pubs@nsf.gov.
Include in proposals to AtIR the components listed in GPG, Chapter II, Section D. State information in each component as clearly and concisely as possible for merit review. Take special care in adhering to the requirements for page limitations, font size, and margins (see GPG, Chapter II, Section C). Proposals not strictly adhering to the requirements of the GPG and these guidelines are returned without review. Instructions and guidelines for the FastLane submission of proposals are detailed in Instructions for Preparing and Submitting a Standard Proposal via FastLane located at http://www.fastlane.nsf.gov/a1/newstan.htm. Also, see the "FastLane Submission" section below.
Guidelines are provided for specific sections of the proposal as follows:
In the NSF FastLane system follow instructions on proposal preparation. When completing the Cover Sheet click on the "Add Org Unit" button. Highlight "DIRECT FOR BIOLOGICAL SCIENCES" and click "OK." Highlight "Database Activities" and click "OK." Clicking "OK" designates this program as the NSF organizational unit of consideration. In the box labeled "Program Announcement/Solicitation No." enter "NSF 99-50" with no additional characters.
Begin the title of the proposal with "AtIR: . . . ."
The first-listed Principal Investigator (PI) is designated as the primary PI and is responsible for coordinating the entire proposed project.
Provide a brief (200 words or less) description of the project.
Particular attention must be paid to the following major aspects in preparing a description of the proposed project. Although some relevant technical issues are mentioned below, these details are intended only as guidelines. This section must not exceed 25 pages inclusive of tables, diagrams or other visual material. Clearly label sections and major subdivisions of the project description.
Describe your vision for the long-term future of such a database as the AtIR and the role this operation should play in the overall plant genome research forum. Address issues such as long-term economic sustainability of the database, potential economic models that invoke alternative sources of support, and possible transition plans to such models.
The proposal should provide a description of (1) the logical or conceptual model for the data, and (2) a general outline of the physical implementation schema for the repository. The general features and overall design of both must be justified in the context of efficient data management and researcher support functions. Extensibility of the design to the maintenance of data and information from other databases of plant research information may be discussed here.
Proposals should describe the manner in which the data to be placed in the resource will be acquired. Specifically, if it is intended that data be acquired from investigators as the original source of the data, procedures for the handling of such submissions should be described, including any standard or proprietary data exchange formats or tools to be used.
Because it is anticipated that the volume and rate of data generation will continue to increase in the future, an important technical issue to be considered is the development and use of approaches which are capable of scaling to anticipated increases in the volume of data.
Proposals should describe precisely the expected content of the database. The description should include some definition of what constitutes a minimum dataset, as well as a description of what might constitute a fully annotated dataset.
Minimum criteria for insuring the completeness and consistency of entries at the time they are placed in the database should be described, as should procedures for assuring that the criteria have been met. It is expected that the utility of the criteria and procedures will be periodically reviewed and approved using the formal external advisory mechanism.
Proposals should address the technical issues involved in the maintenance of a highly automated information repository, with convenient public access and off-site backup or other provision for protection from software or hardware failure. Provisions for maintenance of internal and external links should be discussed. The focus of the proposal should be the operation of a basic repository.
Proposals should also describe the distribution methods envisioned, for example network access to the complete collection using the WWW or other means, and periodic production of tapes, CD-ROM or other media containing current entries.
If mirror sites are to be used, describe how the central and mirror sites will interact, estimate the time and effort required to operate a typical mirror and provide the criteria to be used in selecting mirror sites.
Any planned charges for copies on tape or other media, or for permission to provide such copies, should be discussed briefly in the proposal. All such charges will be subject to approval by the NSF. Periodic assessment of the utility of the distribution methods will be expected as part of the management and oversight of the AtIR.
NSF expects that Principal Investigators agree to complete and open sharing of data and material in an expeditious manner. By submitting a proposal, it is understood that the submitting institution and all participants agree to these guidelines (see the NSF GPG, NSF 99-2, Chapter VII, Section H).
Describe how users will be able to develop and use direct queries of the database. The interaction with the repository and the means to insure stability and security should be specified.
Provide a timetable for the assumption of responsibility for new data entry and distribution of the database, including any efforts necessary for incorporation of entries now found in the AtDB into the new database. It is anticipated that the time required for complete assumption of the responsibility will not exceed one year from the date of the award.
Describe provisions for insuring the quality of the database and its operation, including procedures for obtaining and responding to user feedback on issues related to quality.
A sound management plan will be a crucial aspect of the proposal. The responsibilities of the various senior personnel must be clearly described, as must the time and effort to be committed by each. A mechanism for replacing key personnel who leave the project must also be described. In the event senior personnel will participate in multiple activities related to the database (e.g., outreach, data acquisition, etc.), estimate the anticipated effort with respect to each activity.
The awardee will be expected to establish a formal mechanism for insuring ongoing external input from relevant groups and interested individuals regarding AtIR policies and practices. An appropriate mechanism could, for example, consist of a standing external advisory board with relevant technical and managerial expertise. The function of such an advisory board could be to advise senior management of the AtIR and the awardee institution(s) on policies such as those regarding operational priorities, format, content and validation of entries and reports, those related to other aspects of use or distribution of the database, etc. Periodic review and approval of the utility and appropriateness of any such criteria will be expected.
Implementation of the mechanism should insure that the views of relevant research communities are represented as part of this advice. In general, the mechanism should provide an opportunity for input from the international Arabidopsis research community. The appropriateness and adequacy of the mechanism, as implemented, will be subject to approval by the NSF.
Describe provisions for timely and widespread communication of activities of the AtIR, in particular procedures for alerting user/developer communities to impending changes in software/formats/policies, etc. Describe any activities planned to train new or experienced users in use of the resource. Activities supported by this award may provide an ideal environment to train young scientists in cutting-edge research technologies and to expose them to new paradigms in plant biology informatics. In addition, these activities should promote increased participation by members of under-represented groups. Proposers should describe plans to increase diversity whenever feasible.
If the PI or any Co-PI has received federal support for the establishment or operation of a publicly available database within the last five years, provide a brief description of the relevant features of the database together with the name of the agency providing support, the award number and title, and the amount and duration of the award. This section should include a general description of the type of database, number of users, means of distribution, etc. If the database is available electronically, provide the relevant URL. If awards for more than one project have been received, describe the project most relevant to the current proposal. This section is limited to a maximum of 5 pages, including any references and is included as part of the Project Description 25 page limit.
For each of the key personnel, including senior staff and any other staff whose participation is critical to the success of the project, provide a curriculum vitae or short biographical sketch. Briefly describe relevant experience and list up to 10 publications (to include the individual's 5 most important and up to 5 other relevant publications). Include an alphabetical list of current and past collaborators of all key personnel whose biosketches are included, and of any other staff or collaborators mentioned by name in the proposal. Additionally, include names of all graduate students and postdoctoral fellows who have trained with these individuals, as well as anyone with whom these individuals have co-authored a paper within the last 4 years. The information may not exceed 2 pages for each individual. Applicants may include letters of support in the FastLane submission by scanning the documents and adding them at the end of the Project Description file, clearly labeled.
Copies of letters indicating agreement to participate should be provided by all senior personnel who do not endorse the cover page as PI or Co-PI. Such letters should include a brief description of the individual's expected role in the project and an estimate of the time and effort to be required. Scan the letters and add them at the end of the Project Description file, clearly labeled as Appendix A. This information is not counted as part of the 25 page limit of the Project Description.
Provide a budget and budget justification for each year of support requested as well as a separate, cumulative budget for all years. If funds for subcontracts are requested, then a separate budget and budget justification must be prepared by each subcontractor to show the distribution of subcontract funds across categories. Funds for facility construction or renovation may not be requested.
A brief justification for funds in each budget category should be provided. For major equipment or software materials, a particular model or source and the current or expected price should be specified whenever possible. A brief explanation of the need for each item whose cost exceeds $10,000 should be provided. This section should also include details of institutional cost sharing, if any, and of other sources of support for the project, such as government, industry, or private foundations. Although cost sharing is not required, any such commitment specified in the proposal will be referenced and included as a condition of an award resulting from this solicitation.
Appropriate documentation of any such commitments should be provided in an appendix (Appendix B). Scan the documents and add them at the end of the Project Description file, clearly labeled as Appendix B. This information is not counted as part of the 25 page limit of the Project Description.
Provide a complete list of current and pending support for all PIs and Co-PIs
Include a brief description of available facilities, including space and computational equipment available for the project. Where requested equipment or materials duplicate existing items, explain the need for duplication. This section is limited to 2 pages.
Complete the BIO PCF, available on the NSF FastLane system. The PCF is an on-line coding system that allows the Principal Investigator to characterize his/her project when submitting proposals to the Directorate for Biological Sciences. Once a PI begins preparation of his/her proposal in the NSF FastLane system and selects a division, cluster, or program within the Directorate for Biological Sciences as the first or only organizational unit to review the proposal, the PCF will be generated and available through the Form Selector screen. Additional information about the BIO PCF is available in FastLane at http://www.fastlane.nsf.gov/a1/BioInstr.htm.
Plans requiring collaborative effort by an individual not employed at the submitting institution(s) should be supported by a signed letter from the individual. Besides indicating a willingness to collaborate, the letter should provide a brief outline of the goals of the collaboration and estimate the time and effort the individual expects to devote to the collaboration. Biographical sketches should not be provided for such individuals, unless requested by NSF. A collaborator whose primary purpose is advisory (e.g., service on a committee that will provide policy advice) does not need to provide/submit such a letter.
Scan the letters and other relevant Special Information and Supplementary Documentation, as specifically described in the GPG, Chapter II, Section D.12, and add them at the end of the Project Description file after Appendices A and B, clearly labeled as "Special Information and Supplementary Documentation." Only documentation as described in the GPG, Chapter II, Section D.12 and detailed above is allowed. This information is not counted as part of the 25 page limit of the Project Description.
Only the appendices described in the "Budget Justification", and "Biographical Sketches", are allowed. Other letters of endorsement may not be included.
Proposals must be sent by 5:00 p.m., submitter's local time, March 22, 1999 via the NSF FastLane system.
Mail the following materials directly to the Biological Database Activities Program:
Do not mail copies of the full proposal. NSF will make the appropriate number of copies of the proposal.
The grantee is responsible for ensuring that the materials are received by March 26, 1999. Send materials to:
Arabidopsis thaliana Information Resource Project-NSF 99-50
Division of Biological Infrastructure
National Science Foundation
4201 Wilson Boulevard
Room 615
Arlington, VA 22230
Unless requested by NSF, additional information may not be sent following proposal submission.
In order to use NSF FastLane to prepare and submit a proposal, you must have the following software: Netscape Navigator 3.0 or above, or Microsoft Internet Explorer 4.01 or above; Adobe Acrobat Reader 3.0 or above for viewing PDF files; and Adobe Acrobat 3.X or Aladdin Ghostscript 5.10 or above for converting files to PDF.
To use FastLane to prepare the proposal your institution needs to be a registered FastLane institution. A list of registered institutions and the FastLane registration form are located on the FastLane Home Page. To register an organization, authorized organizational representatives must complete the registration form. Once an organization is registered, PIN for individual staff are available from the organization's sponsored projects office.
To access FastLane, go to the NSF Web site at http://www.nsf.gov, then select "FastLane," or go directly to the FastLane home page at http://www.fastlane.nsf.gov/. Please see "Instructions for Preparing and Submitting a Proposal to the NSF Directorate for Biological Sciences" located at http://www.fastlane.nsf.gov/a1/BioInstr.htm. Additionally, read the "PI Tipsheet for Proposal Preparation" and the "Frequently Asked Questions about FastLane Proposal Preparation," accessible at https://www.fastlane.nsf.gov/a1/A1Prep.htm.
IMPORTANT NOTE: For technical assistance with FastLane, please send an e-mail message to biofl@nsf.gov. If you have inquiries regarding other aspects of proposal preparation or submission, please contact the cognizant program officer, preferably at least three weeks before the competition deadline.
Reviews of proposals submitted to NSF are solicited from peers with expertise in the substantive area of the proposed research or education project. These reviewers are selected by Program Officers charged with the oversight of the review process. NSF invites the proposer to suggest, at the time of submission, the names of appropriate or inappropriate reviewers. Special care is taken to ensure that reviewers have no immediate and obvious conflicts with the proposer. Special efforts are made to recruit reviewers from non-academic institutions, minority serving institutions, adjacent disciplines to that principally addressed in the proposal, first time NSF reviewers, etc.
Proposals will be reviewed against the following general merit review criteria established by the National Science Board. Following each criterion are potential considerations that the reviewer may employ in the evaluation. These are suggestions and not all will apply to any given proposal. Each reviewer will be asked to address only those that are relevant to the proposal and for which he/she is qualified to make judgments.
How important is the proposed activity to advancing knowledge and understanding within its own field and across different fields? How well qualified is the proposer (individual or team) to conduct the project? To what extent does the proposed activity suggest and explore creative and original concepts? How well conceived and organized is the proposed activity? Is there sufficient access to resources?
How well does the activity advance discovery and understanding while promoting teaching, training, and learning? How well does the proposed activity broaden the participation of underrepresented groups (e.g., gender, ethnicity, geographic, etc.)? To what extent will it enhance the infrastructure for research and education, such as facilities, instrumentation, networks, and partnerships? Will the results be disseminated broadly to enhance scientific and technological understanding? What may be the benefits of the proposed activity to society?
In addition, reviewers will focus on the following issues:
Where appropriate, reviewers will also consider:
One of the principal strategies in support of NSF's goals is to foster integration of research and education through the programs, projects and activities it supports at academic and research institutions. These institutions provide abundant opportunities where individuals may concurrently assume responsibilities as researchers, educators, and students and where all can engage in joint efforts that infuse education with the excitement of discovery and enrich research through the diversity of learner perspectives. PIs should address this issue in their proposal to provide reviewers with the information necessary to respond fully to both NSF merit review criteria. NSF staff will give this careful consideration in making funding decisions.
Broadening opportunities and enabling the participation of all citizens-women and men, underrepresented minorities, and persons with disabilities-is essential to the health and vitality of science and engineering. NSF is committed to this principle of diversity and deems it central to the programs, projects, and activities it considers and supports. PIs should address this issue in their proposal to provide reviewers with the information necessary to respond fully to both NSF merit review criteria. NSF staff will give this careful consideration in making funding decisions.
Most proposals submitted to the NSF are reviewed by mail review, panel review, or some combination of mail and panel review.
Proposals submitted to this activity will be evaluated by a special emphasis panel formed to review the applications and mail reviewers. Site visits may be conducted as needed. NSF will be able to tell applicants whether their proposals have been declined or recommended for funding within six months for 95 percent of proposals in this category.
Notification of the award is made to the submitting organization by a Grants Officer in the Division of Grants and Agreements. Organizations whose proposals are declined will be advised as promptly as possible by the cognizant NSF Program Division administering the program. Verbatim copies of reviews, not including the identity of the reviewer, will be provided automatically to the lead Principal Investigator.
Grants awarded as a result of this announcement are administered in accordance with the terms and conditions of NSF GC-1 (10/98), "Grant General Conditions" (10/98), or FDP-III (7/97), "Federal Demonstration Partnership General Terms and Conditions," or CA-1 "Cooperative Agreement General Terms and Conditions" (2/98), depending on the grantee organization. Copies of these documents are available at no cost from the NSF Publications Clearinghouse, P.O. Box 218, Jessup, Maryland 20794-0218, telephone (301) 947-2722, or via e-mail to pubs@nsf.gov. More comprehensive information is contained in the NSF Grant Policy Manual (NSF 95-26), available on the NSF OnLine Document System located at http://www.nsf.gov/, or for sale through the Superintendent of Documents, Government Printing Office, Washington, D.C. 20402.
For all multi-year grants (including both standard and continuing grants), the PI must submit an annual project report to the cognizant Program Officer at least 90 days before the end of the current budget period.
Within 90 days after expiration of a grant, the PI also is required to submit a final project report. Approximately 30 days before expiration, NSF will send a notice to remind the PI of the requirement to file the final project report. Failure to provide final technical reports delays NSF review and processing of pending proposals for the PI. PIs should examine the formats of the required reports in advance to assure availability of required data.
NSF has implemented a new electronic project reporting system, available through FastLane, which permits electronic submission and updating of project reports, including information on: project participants (individual and organizational); activities and findings; publications; and other specific products and contributions. Reports will continue to be required annually and after the expiration of the grant, but PIs will not need to re-enter information previously provided, either with the proposal or in earlier updates using the electronic system.
Effective October 1, 1998, PIs are required to use the new reporting format for annual and final project reports. PIs are strongly encouraged to submit reports electronically via FastLane. For those PIs who cannot access FastLane, paper copies of the new report formats may be obtained from the NSF Clearinghouse as specified above. NSF expects to require electronic submission of all annual and final project reports via FastLane beginning in October, 1999.
If the submitting organization has never received an NSF award, it is recommended that the organization's appropriate administrative officials become familiar with the policies and procedures in the NSF Grant Policy Manual which are applicable to most NSF awards. The "Prospective New Awardee Guide" (NSF 97-100) includes information on: Administration and Management Information; Accounting System Requirements and Auditing Information; and Payments to Organizations with Awards. This information will assist an organization in preparing documents that NSF requires to conduct administrative and financial reviews of an organization. The guide also serves as a means of highlighting the accountability requirements associated with Federal awards. This document is available electronically on NSF's Web site at http://www.nsf.gov/cgi-bin/getpub?nsf97100.
Inquiries regarding the announcement should be directed to the cognizant NSF official: Dr. Paul Gilna, Division of Biological Infrastructure, National Science Foundation, 4201 Wilson Boulevard, Room 615, Arlington, VA 22230. Telephone: (703) 306-1469; FAX: (703) 306-0356; E-mail: pgilna@nsf.gov
The National Science Foundation (NSF) funds research and education in most fields of science and engineering. Grantees are wholly responsible for conducting their project activities and preparing the results for publication. Thus, the Foundation does not assume responsibility for such findings or their interpretation.
NSF welcomes proposals from all qualified scientists, engineers, and educators. The Foundation strongly encourages women, minorities, and persons with disabilities to compete fully in its programs. In accordance with federal statutes, regulations, and NSF policies, no person on grounds of race, color, age, sex, national origin, or disability shall be excluded from participation in, be denied the benefits of, or be subjected to discrimination under any program or activity receiving financial assistance from NSF. Some programs may have special requirements that limit eligibility.
Facilitation Awards for Scientists and Engineers with Disabilities (NSF 91-54) provide funding for special assistance or equipment to enable persons with disabilities (investigators and other staff, including student research assistants) to work on NSF-supported projects.
The National Science Foundation has Telephonic Device for the Deaf (TDD) and Federal Information Relay Service (FIRS) capabilities that enable individuals with hearing impairments to communicate with the Foundation regarding NSF programs, employment, or general information. TDD may be accessed at (703) 306-0090; FIRS at 1-800-877-8339.
The information requested on proposal forms and project reports is solicited under the authority of the National Science Foundation Act of 1950, as amended. The information on proposal forms will be used in connection with the selection of qualified proposals; project reports submitted by awardees will be used for program evaluation and reporting within the Executive Branch and to Congress. The information requested may be disclosed to qualified reviewers and staff assistants as part of the review process; to applicant institutions/grantees to provide or obtain data regarding the proposal-review process, award decisions, or the administration of awards; to government contractors, experts, volunteers, and researchers and educators as necessary to complete assigned work; to other government agencies needing information as part of the review process or in order to coordinate programs; and to another Federal agency, court or party in a court or Federal administrative proceeding if the government is a party. Information about Principal Investigators may be added to the Reviewer file and used to select potential candidates to serve as peer reviewers or advisory committee members. See Systems of Records, NSF-50, "Principal Investigator/Proposal File and Associated Records," 63 Federal Register 267 (January 5, 1998), and NSF-51, "Reviewer/Proposal File and Associated Records," 63 Federal Register 268 (January 5, 1998). Submission of the information is voluntary. Failure to provide full and complete information, however, may reduce the possibility of receiving an award.
Public reporting burden for this collection of information is estimated to average 120 hours per response, including the time for reviewing instructions. Send comments regarding this burden estimate and any other aspect of this collection of information, including suggestions for reducing this burden, to: Reports Clearance Officer; Information Dissemination Branch, DAS; National Science Foundation; Arlington, VA 22230.
The program described in this announcement is in the category 47.074 (BIO) of the Catalog of Federal Domestic Assistance.
In accordance with NSF Important Notice No. 120 dated June 27, 1997, Subject: Year 2000 Computer Problem, NSF awardees are reminded of their responsibility to take appropriate actions to ensure that the NSF activity being supported is not adversely affected by the Year 2000 problem. Potentially affected items include computer systems, databases, and equipment. The National Science Foundation should be notified if an awardee concludes that the Year 2000 will have a significant impact on its ability to carry out an NSF-funded activity. Information concerning Year 2000 activities can be found on the NSF Web site at http://www.nsf.gov/oirm/y2k/start.htm.
OMB NO. 3145-0058
P.T.: 34
K.W.: 1002037
NSF 99-50 Electronic Dissemination Only