Arabidopsis EST Analysis

Protocol for Assembly of Arabidopsis ESTs and Transcripts

Preparation of EST data

Sequences were extracted from the dbEST public Sybase server.
Sequences were subjected to quality control screening
(vector, polyA, T, or CT removal, min length criteria, percent_N criteria)

Preparation of non-redundant transcript database

All Arabidopsis sequences in the PLANT division of GenBank were extracted.
Non-coding sequences were discarded.
Coding sequences from genomic entries were extracted.
Redundant entries for the same gene were removed, retaining link to accession number.
822 transcript sequences were used from 1314 entries extracted from GenBank.

Assembly

Cleaned EST sequences and non-redundant transcript sequences were combined.
Using TIGR Assembler sequences were assembled into contigs. Strict criteria help minimize the creation of chimeric contigs. These contigs are assigned a TC (Tentative Consensus) number.

Summary of Assembly Results for Release 1.0
23921 EST sequences input following quality control.	3764 assemblies produced, containing 15218 ESTs (63.6%).
23921 EST sequences input following quality control.	8703 'singletons' remained (36.4%).

Name Assignment

Names are assigned on the basis of the results of sequence similarity searches against protein and nucleotide databases. Only 'significant' matches are assigned names. The match is categorized in 4 different classes.

Class 1: An exact match against a known Arabidopsis coding sequence.
Class 2: A non exact match against a known Arabidopsis coding sequence.
Class 3: A non exact match against a non-Arabidopsis coding sequence.
Class 4: Various forms of contamination. (E. coli sequences, rRNA genes etc)

Name assignment for singleton ESTs is automatic. The results of blastx searches stored in dbEST are parsed, and the name of the top scoring protein match is stored if the score is above a threshold of 90. These matches, because they are not manually inspected, are not assigned a class category.

Name Assignment Summary
TCs	1826 assemblies (48.5%) have been assigned putative IDs.
	Class	Assemblies	ESTs
	1	641 (17%)	4275 (28%)
	2	279 (7.4%)	1286 (8.5%)
	3	888 (23.6%)	3808 (25%)
	4	18 (0.5%)	71 (0.5%)
	No Match	1938 (51.5%)	5778 (38%)
Singletons (IDs parsed automatica lly from dbEST)	Significant Hit	2446 (32%)
	No Significant hits	5175 (68%)
	No data available	1082

Back to the tutorial