GenBank

18/07/13 12:44

The GenBank release notes for release 162.0 (October 2007) state that "from 1982 to the present, the number of bases in GenBank has doubled approximately every 18 months.

That is taken directly from Wikipedia today. There are billions and billions of nucleotides in there; probably billions of individual sequences and sequence fragments, plus all sorts of other data. So the question is, how did so much data get uploaded to Genbank when the main data entry portal to Genbank - the program Sequin - is so clunky and flawed?

Current options for uploading data include Sequin, which the last time I used it kept shifting coding sequences around on the mitochondrial genomes I was uploading; tbl2asn which requires that you are familiar with shell scripting, able to generate a tab-delimited data file that has a unique format almost impossible to generate automatically from the spreadsheet or .csv output from other programs; or Geneious, which so far has failed me in that it changes the genetic code from what is annotated in the software.

Yet somehow all these data are there. Most of them missing critical meta-data, like the latitude/longitude from which the sequence came, or who identified the specimen. That takes extra work, and NCBI doesn’t make it easy.

So how can we make this easier? For single-gene submissions, tbl2asn can be very easy because you can annotate most of the data as part of a FASTA file. But we are moving beyond the world of single-gene submissions very quickly. The complication of exons, introns, reverse-strand coded sequences, whole chromosomes, whole genomes means we all need to get more savvy about how to do this.

I don’t have the answer, I’m just complaining. I’m pretty computer/bioinformatics-savvy, so if I find this frustrating what about people who are new to the field?

Tags: data

Patterns of Hidden Biodiversity

Wares Lab @ UGA

GenBank