The Arabidopsis Assemblies

Input sequences

ESTs :- All Arabidopsis EST sequences in dbEST were downloaded and subjected to a 'cleaning' procedure. Due to the nature of the assembly process, the presence of similar stretches of 'contaminating' sequence can cause false assemblies. The 'cleaning' process checks for vector sequence, poly(A) or (T) tails, and stretches of poly (CT) and trims these away. Similarly, a maximum of 4% of ambiguous bases are allowed in any one sequence, and the final length after trimming must be greater than 100 bp. A nightly script downloads new ESTs from dbEST and periodically these are incorporated into the existing assemblies.

Known Arabidopsis coding sequences : - To help 'anchor' ESTs from known Arabidopsis genes, we included a set of non-redundant protein coding genes from Arabidopsis in the assembly process. ESTs that assembled with these known genes can be easily assigned an identification.

23921 EST and 822 known transcripts sequences input.

3764 assemblies produced, containing 15218 ESTs.
8701 'singletons' remained.

63.6% of ESTs assemble into contigs.
Average number of ESTs per contig = 4.0 Max = 221

Assigning Identifications

Assemblies: The consensus sequence of each assembly was searched against the public nucleotide and protein databases in order to glean some information about putative identification of the gene. Alignments were assessed by eye, and the most sensible biological identification was assigned if the match was significant. Assemblies that contained a known Arabidopsis coding sequence were automatically assigned. Attempts were made to keep the putative identification as generalized as possible to enable useful keyword searching.

Classes of Identifications: Based on the sequence to which a significant match is found, the match is assigned to one of four classes. Class 1 indicates that the EST is derived from an Arabidopsis gene that has already been isolated and sequenced. Class 2 indicates that the EST is from a gene that is closely related to another previously cloned Arabidopsis gene, but the two are distinct members of the same gene family. Class 3 indicates that the hit is to a non-Arabidopsis gene, and that no similar Arabidopsis gene is present in the public databases. This class can be subdivided to indicate whether the match is from another plant species or from a more evolutionary distant ancestor. Class 4 is a class for 'junk' sequences such as rRNA genes, and contamination by bacterial sequences.

Non-assembled ESTs (Singletons): ESTs that were not part of an assembly were not subjected to manual identifications, and class assignment, but the results of blast searches contained in the dbEST database were parsed to extract the ID of the highest scoring blastx hit where the score was greater than 90. ESTs that did not have significant hits to a protein entry were not assigned any identification.

Name Assignment Summary
TC's 1826 assemblies (48.5%) have been assigned putative IDs.
Class Assemblies ESTs
1 641 (17%) 4275 (28%)
2 279 (7.4%) 1286 (8.5%)
3 888 (23.6%) 3808 (25%)
4 18 (0.5%) 71 (0.5%)
No Match 1938 (51.5%) 5778 (38%)
Singletons
(IDs parsed
automatically
from dbEST)
Significant Hit 2446 (32%)
No Significant hits 5175 (68%)
No data available 1082

Back to the Index