Bioinformatic Tool Development

An important component of our research is the development of new bioinformatic tools and resources to facilitate processing, analysis, and visualization of large data sets. Our software tools are available free of charge, free to modify, and many are actively maintained on GitHub. Links and further details are provided on the Software page.

QIIME is a suite of bioinformatic tools for analyzing large microbial datasets including 16S, 18S and fungal ITS amplicon sequences as well as shotgun metagenomic data from a variety of high-throughput sequencing platforms (Caporaso et al., 2010;Kuczynski et al., 2012;Navas-Molina et al., 2013;Caporaso et al., 2010). QIIME has been used in hundreds of studies of microbial diversity worldwide, and is constantly evolving to incorporate new functionality. The development and implementation of QIIME has relied heavily on the PyCogent software library (Knight et al., 2007), much of which is now being integrated into the Biological Python Package, or BiPy. BiPy provides a variety of core objects and statistical tools structured as a toolkit for researchers to build pipelines on top of it. QIIME is supported by and integrated with a database accessible to the public via a web interface (www.microbio.me/qiime), which allows external users to upload and analyze microbial sequence data and associated meta-data. This is currently being redesigned and restructured under the auspices of Qiita, which will allow for more efficient complete analysis and meta-analysis of microbial community data.

QIIME makes uses of several analysis tools and resources developed in our lab, and we have benchmarked and analyzed these tools in order to provide users with the most robust methods as well as recommendations for further downstream analyses (Bokulich et al., 2012;Kuczynski et al., 2010). QIIME offers several methods to cluster and assign taxonomy to ribosomal RNA sequences, some of which are most commonly used in conjunction with the Greengenes ribosomal RNA database (http://greengenes.secondgenome.com) to generate clusters and/or support assignments. Recently, we were part of an effort to update Greengenes and reconcile its taxonomy with that of other databases (McDonald et al., 2012). This led to the development of the tax2tree software, and resulted in highly improved taxonomic classifcations.

We developed UniFrac (Lozupone et al., 2010), a clustering metric that uses a phylogenetic tree to measure the biological distance between each pair of environments represented in the tree. We can then use clustering methods, such as hierarchical clustering, and ordination methods, such as Principal Coordinates Analysis, to identify environments that are more similar or different, and to correlate these differences with physical and biological properties of the environment. The subsequent development of FastUnifrac (Hamady et al., 2010) expanded UniFrac’s functionality to large high-throughput sequencing data sets. EMPeror, which is now integrated in QIIME, provides three-dimensional visualization of ordination plots constructed from these relationships, allowing interactive manipulation of the images produced as well as the ability to save in different formats for publication and presentation (Vázquez-Baeza et al., 2013). SitePainter provides additional visualization capabilities for studies with a spatial component, allowing the juxtaposition of taxonomy and within and between sample relationships with spatial data (Gonzalez et al., 2012).

The establishment of the biom format as a standard for tables detailing microbial community composition (OTU tables) (McDonald et al., 2012) easily allows QIIME outputs to be used as inputs for other recently developed software packages. For example, SourceTracker enables the detection of potential contaminants in microbial data sets by comparing microbial communities in environmental samples such as air or laboratory surfaces with those in the actual samples of interest (Knights et al., 2012). Pi-Crust uses data from ribosomal gene studies, which provides information on microbial taxonomy and diversity, to predict which functions microbial communities may be carrying out in a given system (Langille et al., 2013).