The effects of sampling strategy on the quality of reconstruction of viral population dynamics using Bayesian skyline family coalescent methods: A simulation study

Hall MD, Woolhouse M & Rambaut A

(2016) Virus Evolution 2, vew003.

The ongoing large-scale increase in the total amount of genetic data for viruses and other pathogens has led to a situation in which it is often not possible to include every available sequence in a phylogenetic analysis and expect the procedure to complete in reasonable computational time. This raises questions about how a set of sequences should be selected for analysis, particularly if the data are used to infer more than just the phylogenetic tree itself. The design of sampling strategies for molecular epidemiology has been a neglected field of research. This article describes a large-scale simulation exercise that was undertaken to select an appropriate strategy when using the GMRF skygrid, one of the Bayesian skyline family of coalescent methods, in order to reconstruct past population dynamics. The simulated scenarios were intended to represent sampling for the population of an endemic virus across multiple geographical locations. Large phylogenies were simulated under a coalescent or structured coalescent model and sequences simulated from these trees; the resulting datasets were then downsampled for analyses according to a variety of schemes. Variation in results between different replicates of the same scheme was not insignificant, and as a result, we recommend that where possible analyses are repeated with different datasets in order to establish that elements of a reconstruction are not simply the result of the particular set of samples selected. We show that an individual stochastic choice of sequences can introduce spurious behaviour in the median line of the skygrid plot and that even marginal likelihood estimation can suggest complicated dynamics that were not in fact present. We recommend that the median line should not be used to infer historical events on its own. Sampling sequences with uniform probability with respect to both time and spatial location (deme) never performed worse than sampling with probability proportional to the effective population size at that time and in that location and frequently was superior. As a result, we recommend this approach in the design of future studies. We also confirm that the inclusion of many recent sequences from a single geographical location in an analysis tends to result in a spurious bottleneck effect in the reconstruction and caution against interpreting this as genuine.

Andrew Rambaut, 2007