R Markdown

It has been a minute since I updated this blog! I guess the advent of twitter in my lab's life has given enough room to express what is going on for the most part. However, a more recent technical change has really opened up some changes in how we are doing science in the Wares Lab.

About 6 weeks ago, Scott Chamberlain from rOpenSci visited UGA for the Institute of Bioinformatics seminar. The goal was to talk about how R could be used to help develop repeatable science - in other words, intelligently using analytical packages to help explore data in a way that lets us share the data and the analysis simultaneously. It was a great talk, and gave Katie and Christine and I much to think about.

Only a few weeks prior, postdoc Paula Pappalardo had started showing us RMarkdown (available in RStudio). We are all becoming R users more and more each day, and the confluence of these directions for expressing our research have led to almost everybody in the lab - certainly all of my grads, my postdoc, and myself - writing papers in RMarkdown and learning to use analytical tools in R preferentially over external packages.

I am doing this at least in part so I never have to boot up Windows again ;)

Anyway it is going well. I'm working on an NSF proposal and 3 manuscripts simultaneously in RMarkdown, and what isn't currently available in R packages either indicates I haven't looked hard enough (!) or it forces me to explore the side of programming I actually do enjoy - figuring out how to make a task I do often easier in the future.

An example would be calculating Hudson's Snn statistic. I'm using the PopGenome package, and it calculates Snn - but doesn't appear to perform a permutation test to check the statistical deviation from null expectation. So, I've had to write that myself, which itself is improving my R skills very quickly. Learning to subset and randomly sample efficiently will help with future efforts, and I've also learned a bit about the behavior of Snn too.

The easy way to explain Snn is how Dick Hudson did so in his 2000 Genetics paper: given 2 populations, you query a haplotype and ask what the most similar haplotype is, and if it is in the same location you add a 1, a different location add a 0, divide by the number of comparisons. For 2 locations of equal size, the null expectation is 0.5 (even odds of the most similar haplotype being in the same location) and if all haplotypes from 2 locations are reciprocally monophyletic, then the answer is 1.

Clearly for 4 locations, you would have an expectation of 0.25 for the random distribution, but only if all samples are of equal size. This hadn't really clicked in my head until I had to write the code to do the calculations for the permutation test. It makes sense, however: if 95% of your samples are in one location, then 95% of the time they are likely to have a nearest neighbor in the same location, and the Snn statistic will be biased towards 1 even if the two locations are reciprocally monophyletic.

Fortunately, the way I've programmed the permutation test reflects these unequal sample sizes, and so the answer that comes out in the paper I'm working on today is: meh. Nothing exciting happening here. But for the part of my brain that enjoys coding, very exciting indeed.

(I know I haven't explained the statistic all that well right now: the point was the weird joy of coding. But if anybody has questions about it, why use it, etc. either see the Hudson paper or email me).