Swimming in Data
03/06/13 11:46
In the past 30 days, I have had data from hundreds of SNPs in hundreds of individuals scored, an Illumina Mi-Seq run for the barnacle Chelonibia, same for the coral Agaricia, 24 cells of Pac Bio data in Serratia, 454 data from the Dry Tortugas, and Hi-Seq data for the barnacle Notochthamalus dropped in my lap. I mean lab. We are talking about something like 15 billion nucleotides that I am in theory learning something from. And don’t forget, I am not really a power user of such data!
What is interesting about this problem is only secondarily biology. At this point, learning how to handle such information is one of the biggest challenges science is grappling with. It remains difficult to upload even simple data sets to NCBI (I have hired an IOB graduate student for the year, and his first task is handling submission of some mitochondrial genomes - only 8 15kb fragments - to Genbank, which will probably take him all day). We run out of disk space on our computers on a regular basis, and I will soon have a room full of terabyte drives that are just sitting around to back up the big data files.
At the same time, publishing is stuck in a centuries-old model. Peer review is important, but we are sending more and more submissions out into an ever-expanding galaxy of scientific journals (of varying credibility), to the extent that we actually know less and less about what has been done. It is simply too big. Too much.
It is with that idea that I am so enthusiastically behind wiki technology to combine and compile and collectively edit what we know. Wikipedia is, to my mind, an enormously successful venture. No, nothing is 100% right. But that is equally true for any creation of man. And as it turns out, it is as close to right as any other respected outlet of information.
Not all information, of course, is easily put into a narrative. Nor do we always need a narrative about every bit of news or data. So I was interested to come across Wikigenes, a repository for information on gene regions that allows the collaborative contribution (and credit to be given, for those of us who rely on our CV for promotion, etc.) to this body of information. I haven’t contributed yet - I do contribute sometimes to Wikipedia - but I will definitely consider it as another product of my research.
What is interesting about this problem is only secondarily biology. At this point, learning how to handle such information is one of the biggest challenges science is grappling with. It remains difficult to upload even simple data sets to NCBI (I have hired an IOB graduate student for the year, and his first task is handling submission of some mitochondrial genomes - only 8 15kb fragments - to Genbank, which will probably take him all day). We run out of disk space on our computers on a regular basis, and I will soon have a room full of terabyte drives that are just sitting around to back up the big data files.
At the same time, publishing is stuck in a centuries-old model. Peer review is important, but we are sending more and more submissions out into an ever-expanding galaxy of scientific journals (of varying credibility), to the extent that we actually know less and less about what has been done. It is simply too big. Too much.
It is with that idea that I am so enthusiastically behind wiki technology to combine and compile and collectively edit what we know. Wikipedia is, to my mind, an enormously successful venture. No, nothing is 100% right. But that is equally true for any creation of man. And as it turns out, it is as close to right as any other respected outlet of information.
Not all information, of course, is easily put into a narrative. Nor do we always need a narrative about every bit of news or data. So I was interested to come across Wikigenes, a repository for information on gene regions that allows the collaborative contribution (and credit to be given, for those of us who rely on our CV for promotion, etc.) to this body of information. I haven’t contributed yet - I do contribute sometimes to Wikipedia - but I will definitely consider it as another product of my research.