Protocol for Assembly of Arabidopsis ESTs and Transcripts
Preparation of EST data
- Sequences were extracted from the dbEST public Sybase server.
- Sequences were subjected to quality control screening
(vector, polyA, T, or CT removal, min length criteria, percent_N criteria)
Preparation of non-redundant transcript database
- All Arabidopsis sequences in the PLANT division of GenBank were extracted.
- Non-coding sequences were discarded.
- Coding sequences from genomic entries were extracted.
- Redundant entries for the same gene were removed, retaining link to accession number.
- 822 transcript sequences were used from 1314 entries extracted
from GenBank.
Assembly
- Cleaned EST sequences and non-redundant transcript sequences were combined.
- Using TIGR Assembler sequences were assembled into contigs. Strict criteria help minimize the creation of chimeric contigs. These contigs are assigned a TC (Tentative Consensus) number.
Summary of Assembly Results for Release 1.0
|
---|
23921 EST sequences input following quality control.
| 3764 assemblies produced, containing 15218 ESTs (63.6%). |
8703 'singletons' remained (36.4%). |
Name Assignment
Names are assigned on the basis of the results of sequence similarity searches against
protein and nucleotide databases. Only 'significant' matches are assigned names. The match is categorized in 4 different classes.
- Class 1: An exact match against a known Arabidopsis coding sequence.
- Class 2: A non exact match against a known Arabidopsis coding sequence.
- Class 3: A non exact match against a non-Arabidopsis coding sequence.
- Class 4: Various forms of contamination. (E. coli sequences, rRNA genes etc)
Name assignment for singleton ESTs is automatic. The results of blastx
searches stored in dbEST are parsed, and the name of the top scoring protein
match is stored if the score is above a threshold of 90. These matches, because they are not manually inspected, are not assigned a class category.
Name Assignment Summary
|
---|
TCs
| 1826 assemblies (48.5%) have been assigned putative IDs.
|
Class
| Assemblies
| ESTs
|
---|
1
| 641 (17%)
| 4275 (28%)
|
2
| 279 (7.4%)
| 1286 (8.5%)
|
3
| 888 (23.6%)
| 3808 (25%)
|
4
| 18 (0.5%)
| 71 (0.5%)
|
No Match
| 1938 (51.5%)
| 5778 (38%)
|
Singletons (IDs parsed automatica
lly from dbEST)
| Significant Hit
| 2446 (32%)
| |
No Significant hits
| 5175 (68%)
|
No data available
| 1082
|
Back to the tutorial