RE-ANNOTATED GENOMIC SEQUENCE Release 3 Notes: Updated February 13, 2004 The January 2003 annotation of the Drosophila genome, Release 3.1, is available at the new FlyBase annotatation database, as well as the public data libraries (NCBI, EBI, DDBJ). Release 3.1 is also available via the new FlyBase Fly BLAST server and at the Download site. Unannotated BAC-based sequences were deposited in GenBank as they were finished. Finally, the WGS3 whole-genome shotgun sequence assembly has been deposited in GenBank.
SEQUENCE FINISHING The annotated D. melanogaster sequence was first released on March 24, 2000, and constituted Release 1 of the genomic sequence (Adams, et al., 2000). After Celera/BDGP filled 330 gaps and changed ~3000 annotations, Release 2 was made public in October, 2000. This whole genome shotgun assembly had ~1300 gaps. To produce the 116.8 Mb Release 3 euchromatic sequence, the BDGP closed almost all of the gaps in the euchromatic portion of the genome, and raised the sequence quality to an estimated error rate of less than one in 100,000 base pairs in the unique portion of the sequence, and less than one in 10,000 base pairs in the repetitive portion (Celniker et al. 2002). The accuracy of the assembly has been verified by restriction digestion of BAC clones, and composite sequences of transposable elements in the previous releases have been replaced in Release 3 with the true sequences of the individual transposons. The euchromatic chromosome arms 2L, 2R, 3R, 4 and numbered divisions 12-20 of the X chromosome (80.4 Mb) were finished by the BDGP at the Lawrence Berkeley National Laboratory (LBNL). Chromosome arm 3L and numbered divisions 1-11 of the X chromosome (36.3 Mb) were finished at the Human Genome Sequencing Center at Baylor College of Medicine. An additional 20.7 Mb of genomic sequence, the "WGS3 heterochromatic sequence", was produced as a part of a whole-genome shotgun assembly performed by Celera Genomics (Celniker et al. 2002). This sequence corresponds to a significant part of the heterochromatin and has been annotated (Hoskins et al. 2002). The 116.8-Mb finished euchromatic sequence and the 20.7-Mb WGS heterochromatic sequence together constitute the 137.5-Mb Release 3 version of the D. melanogaster genomic sequence. RE-ANNOTATION FlyBase re-annotated the finished euchromatic sequence (Misra et al. 2002), as chromosome arms 2R, 2L, 3R, 3L, 4, and X. The goals of this re-annotation were:
The Release 3 re-annotation improves the quality of the majority of gene models. The length of UTRs and the number of alternative transcripts have increased, due to the increase in EST and complete cDNA sequences. The fine details of the exon-intron structure are significantly improved. Numerous genes have been merged and/or split, based on the cDNA and BLASTX data; some genes predicted in earlier releases have been deleted, others are newly predicted. Genes were deleted if they overlapped transposons or if they fell below a minimum size cutoff (100aa) and had no experimental evidence beyond a computational gene prediction. Overall, these improved annotations in changes in >45% of the predicted proteins. However, it is interesting to note that the total number of protein-coding genes has changed very little (Misra et al. 2002). The confidence we have in gene models varies considerably, and those models considered particularly uncertain are marked "problematic". The annotation of genes will be ongoing (see below), and will require the continued input of the community. Please submit an error report form, including the sequence of the corrected annotation, if you notice a mistake in annotation. Of the several classes of RNA-coding genes, all tRNAs genes found in the sequenced strain are included in Release 3. The rRNA and 5S RNA genes, located in centromeric heterochromatin, are not included. Because this region is very repetitive and difficult to assemble, it is unlikely these genes will appear in annotated genomic sequence in Release 3. For other RNA families of genes, such as snRNAs, snoRNAs, and microRNAs, generally only genes reported in the literature and curated by FlyBase have been annotated. HETEROCHROMATIN A similar annotation of heterochromatic sequences was performed (Hoskins et al. 2002). The boundary between the euchromatic and heterochromatic portions of the genome is somewhat arbitrary, but a working definition based on cytology is described in Hoskins et al. 2002. In previous releases, scaffolds that could not be mapped were assigned to "chromosome U (unmapped)". Other scaffolds that were part of heterochromatin but mapped to chromosome arms were included at the ends of those arms. In Release 3, certain scaffolds (AE002743 and AE002734 on 2L, AE002751 and AE002760 on 2R) have not been included in the euchromatic portion of the genome, because they are heterochromatic and were not included in the finished Release 3 euchromatic sequence. In Release 3, WGS3 heterochromatic scaffolds corresponding to these Release 2 mapped scaffolds and all Release 2 "unmapped" scaffolds are represented as a set of heterochromatic chromosome segments: "2h", "3h", "4h", "Xh", or "Yh" if they have been mapped to particular chromosomes, and "U" if they have not yet been mapped. ACCESSING THE DATA The January 2003 Release 3.1 annotation of euchromatic and heterochromatic is available at the new FlyBase annotatation database, as well as at the public data libraries (NCBI, EBI, DDBJ). Release 3.1 is also available via the FlyBase BDGP Fly BLAST server and download site. Unlike before, when annotations were only changed at the time of new releases, we plan to update individual gene annotations more frequently. The next update of the annotation database should happen by March 2004. This will not constitute new releases, since the underlying genomic sequence will not change. Thus, it will be necessary to make a note of the date you look at a particular annotation, since it may change. When the genomic sequence changes, annotations will be mapped forward and the version will get a new release number (see Release 4 below). We need your input if you discover a mistake in an annotation; please fill out an error report form and include the sequence of the corrected annotation. The previous release (2) of our GadFly annotation database is available at the BDGP and FlyBase web sites. The Release 1, 2, and 3.0 (July 2002) datasets in FASTA, GFF, and XML format are also available at the BDGP for download. Users must be certain to check the Release number of any genomic sequence or annotation. Version numbers appear after the accession number, for example:
However, if the genomic sequence did not change between March and October, 2000, GenBank retained the .1 version number but changed the date to October. Now that the sequence has changed between Release 2 in October 2000 and Release 3 in summer of 2002, GenBank has updated the sequence to the .2 version number, for example:
Links from FlyBase gene and annotation reports will go to the most recent release at NCBI. You can always query at NCBI using the accession with version number if you need to access previous releases. Release number will appear at the top of each annotation query and report page, and also at the FlyBase and BDGP download sites for sequence (multiple FASTA) and XML- and GFF-formatted annotations. Please make a note of the release number you are working with. Click [HERE] for our GAME XML DTD. Click [HERE] for more information about GFF format. GENE AND TRANSCRIPT IDENTIFIERS In Release 1 and 2, only protein-coding genes were annotated, and CG identifiers were assigned to genes, CT identifiers to transcripts, and pp-CT identifiers to peptides. In Release 3.0 and Release 3.1, protein-coding genes were given CG identifiers, but tRNAs, snRNAs, snoRNAs, microRNAs, miscellaneous non-coding RNAs, and pseudogenes were given CR identifiers. Transposable elements were given TE identifiers. Transcripts were assigned FlyBase transcript identifiers, e.g., CG*-RA, CG*-RB, etc. And peptides were assigned FlyBase peptide identifiers, with the letter of the transcript corresponding to the peptide, e.g., CG*-PA, CG*-PB, etc. The old Release 1 and 2 CT identifiers are now obsolete, and there is no mapping between CT identifiers and the Release 3 CG*-RA identifiers. However, in most cases the CT identifier has become a synonym of the gene, and can be queried using the FlyBase Gene Search page to find out the gene they were associated with in Release 2. In some cases, a Release 2 gene may correspond to more than one Release 3 gene, e.g. if exons were redistributed or split between two new Release 3 genes.
RELEASE 4 The BDGP will continue to finish the genomic sequence to high quality, and will include the corrections submitted by the public in error reports. Individual gene annotations will be updated and made public in the annotation database by March 2004. These will not constitute new releases, since the underlying genomic sequence will not change, but will get new version numbers. When the genomic sequence does change, annotations will be mapped forward and the version will get a new release number. These changes to the sequence will be submitted every 6 months or so. The BDGP anticipates that the Release 4 finished genomic sequence will be submitted by March 2004 to GenBank, which means that annotation of the Release 4 sequence should be available by Fall 2004. TRANSPOSABLE ELEMENTS As a result of the whole genome shotgun assembly, the sequence of each transposon in Releases 1 and 2 was a composite derived from a number of elements of that transposon type. In Release 3, we have determined the sequence of each transposon in the y[1]; cn[1] bw[1] sp[1] strain (Kaminker et al. 2002). Please see the Natural Transposable Element web page for more information. KNOWN MUTATIONS IN THE SEQUENCED STRAIN Because the sequenced strain, y[1]; cn[1] bw[1] sp[1], has known mutations (in the y, cn, bw, LysC, MstPr ox and Rh6 genes), the wild-type sequence for these genes has been provided from other sequence entries in GenBank/EMBL/DDBJ. Wherever possible, a RefSeq accession based on FlyBase Annotated Reference Genome Sequences (ARGSs) has been provided. |