Unifying the analysis of high-throughput sequencing datasets: characterizing RNA-seq, 16S rRNA gene sequencing and selective growth experiments by compositional data analysis

Andrew D Fernandes¹, Jennifer Ns Reid², Jean M Macklaim², Thomas A McMurrough², David R Edgell², Gregory B Gloor²

Affiliations

VSports手机版 - Affiliations

¹ , YouKaryote Genomics, London, ON, Canada.
² Department of Biochemistry, Medical Science Building, University of Western Ontario, 1151 Richmond St, N6A 5C1, London, ON, Canada.

PMID: 24910773
PMCID: PMC4030730
DOI: "VSports手机版" 10.1186/2049-2618-2-15

Unifying the analysis of high-throughput sequencing datasets: characterizing RNA-seq, 16S rRNA gene sequencing and selective growth experiments by compositional data analysis

"V体育安卓版" Andrew D Fernandes et al. Microbiome. 2014.

. 2014 May 5:2:15.

doi: 10.1186/2049-2618-2-15. eCollection 2014.

Authors

Andrew D Fernandes¹, Jennifer Ns Reid², Jean M Macklaim², Thomas A McMurrough², David R Edgell², Gregory B Gloor²

Affiliations

¹ , YouKaryote Genomics, London, ON, Canada.
² Department of Biochemistry, Medical Science Building, University of Western Ontario, 1151 Richmond St, N6A 5C1, London, ON, Canada.

PMID: 24910773
PMCID: PMC4030730
DOI: 10.1186/2049-2618-2-15

Abstract (VSports)

Background: Experimental designs that take advantage of high-throughput sequencing to generate datasets include RNA sequencing (RNA-seq), chromatin immunoprecipitation sequencing (ChIP-seq), sequencing of 16S rRNA gene fragments, metagenomic analysis and selective growth experiments. In each case the underlying data are similar and are composed of counts of sequencing reads mapped to a large number of features in each sample. Despite this underlying similarity, the data analysis methods used for these experimental designs are all different, and do not translate across experiments. Alternative methods have been developed in the physical and geological sciences that treat similar data as compositions. Compositional data analysis methods transform the data to relative abundances with the result that the analyses are more robust and reproducible. VSports手机版.

Results: Data from an in vitro selective growth experiment, an RNA-seq experiment and the Human Microbiome Project 16S rRNA gene abundance dataset were examined by ALDEx2, a compositional data analysis tool that uses Bayesian methods to infer technical and statistical error V体育安卓版. The ALDEx2 approach is shown to be suitable for all three types of data: it correctly identifies both the direction and differential abundance of features in the differential growth experiment, it identifies a substantially similar set of differentially expressed genes in the RNA-seq dataset as the leading tools and it identifies as differential the taxa that distinguish the tongue dorsum and buccal mucosa in the Human Microbiome Project dataset. The design of ALDEx2 reduces the number of false positive identifications that result from datasets composed of many features in few samples. .

Conclusion: Statistical analysis of high-throughput sequencing datasets composed of per feature counts showed that the ALDEx2 R package is a simple and robust tool, which can be applied to RNA-seq, 16S rRNA gene sequencing and differential growth datasets, and by extension to other techniques that use a similar approach. V体育ios版.

Keywords: 16S rRNA gene sequencing; Dirichlet distribution; Monte Carlo sampling; RNA-seq; centered log-ratio transformation; compositional data; differential abundance; high-throughput sequencing; microbiome VSports最新版本. .

PubMed Disclaimer

Figures

**Figure 1**
**Outline of the approach for one feature in three control and three experimental samples.** The count values for feature i, sample j are converted to probabilities by Monte Carlo sampling from the Dirichlet distribution with the addition of a uniform prior. Each count value is now represented by a vector of probabilities 1:n, where n is the number of Monte Carlo instances sampled: three instances are shown in the example, but 128 are used by default. Each probability in the vector is consistent with the number of counts in feature i given the total number of reads observed for sample j. Each Monte Carlo Dirichlet instance is center log-ratio transformed giving a vector of transformed values. These values are the base 2 logarithm of the abundance of the feature in each Dirichlet instance in each sample divided by the geometric mean abundance of the Dirichlet instance of the sample. Significance tests for control samples (C1 : C3) vs experimental samples (E1 : E3) are performed on each element in the vector of clr values. Each resulting P value is corrected using the Benjamini–Hochberg procedure. The expected values are reported for both the distribution of P values and for the distribution of Benjamini–Hochberg corrected values. clr, centered log-ratio; FDR, false discovery rate.

**Figure 2**
**Effect of DMC sampling on the selex dataset.** The first column shows the results when the data is clr transformed without DMC sampling, the next three show the effect of 1, 16 and 128 DMC samples followed by the clr transformation. Features that pass a threshold P<0.05 are shown in cyan and those where the fdr statistic is <0.05, are shown in red. Features where the median clr value is below the geometric mean are highlighted in black if they are not significant. Those where the median clr value is greater than the geometric mean are shown in gray. clr, centered log-ratio; DMC, Dirichlet Monte Carlo.

**Figure 3**
**MA plot for DESeq.** The base 2 logarithm of average expression across all samples for a feature is plotted vs the base 2 logarithm of fold-change. Points that are significantly different with a fdr less than 0.05 are in red, all others are in gray.

**Figure 4**
**Differential features in common between ALDEx2, DESeq and baySeq.** Genes are colored in light yellow if ALDEx and at least one of the other two tools identified them as significantly different with an fdr <0.05, black if they were identified by both baySeq and DESeq, magenta if only by baySeq, cyan if only by DESeq, and orange if only by ALDEx2. Small gray dots are non-differential genes. The Venn diagram illustrates the number of differentially abundant genes identified by each method. MA, mean difference between conditions vs average expression; MW, mean difference between conditions vs maximum within-condition variance.

**Figure 5**
**OTUs with different relative abundances between tongue dorsum and buccal mucosa.** Each OTU is colored by membership in the taxonomic level indicated. OTU abundance values are median relative abundance values derived from ALDEx2. OTU, operational taxonomic unit.

See this image and copyright information in PMC

"VSports手机版" References

1. Anders S, McCarthy DJ, Chen Y, Okoniewski M, Smyth GK, Huber W, Robinson MD. Count-based 631 differential expression analysis of RNA sequencing data using R and Bioconductor. Nat Protoc. 2013;8(9):1765–86. doi: 10.1038/nprot.2013.099. - "VSports手机版" DOI - PubMed
1. Dillies M-A, Rau A, Aubert J, Hennequet-Antier C, Jeanmougin M, Servant N, Keime C, Marot G, Castel D, Estelle J, Guernec G, Jagla B, Jouneau L, Laloë D, Le Gall C, Schaëffer B, Le Crom S, Guedj M, Jaffrëzic F. on behalf of the French StatOmique Consortium. A comprehensive evaluation of normalizationmethods for Illumina high-throughput RNA sequencing data analysis. Brief Bioinform. 2013;14(6):671–83. doi: 10.1093/bib/bbs046. - DOI - PubMed
1. Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, Hollister EB, Lesniewski RA, Oakley BB, Parks DH, Robinson CJ, Sahl JW, Stres B, Thallinger GG, Van Horn DJ, Weber CF. Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol. 2009;75(23):7537–41. doi: 10.1128/AEM.01541-09. - DOI - PMC - PubMed
1. Caporaso JG, Kuczynski J, Stombaugh J, Bittinger K, Bushman FD, Costello EK, Fierer N, Peña AG, Goodrich JK, Gordon JI, Huttley GA, Kelley ST, Knights D, Koenig JE, Ley RE, Lozupone CA, McDonald D, Muegge BD, Pirrung M, Reeder J, Sevinsky JR, Turnbaugh PJ, Walters WA, Widmann J, Yatsunenko T, Zaneveld J, Knight R. Qiime allows analysis of high-throughput community sequencing data. Nat Methods. 2010;7(5):335–6. doi: 10.1038/nmeth.f.303. - DOI - PMC - PubMed
1. Faust K, Sathirapongsasuti JF, Izard J, Segata N, Gevers D, Raes J, Huttenhower C. Microbial co-occurrence relationships in the human microbiome. PLoS Comput Biol. 2012;8(7):1002606. doi: 10.1371/journal.pcbi.1002606. - DOI - PMC - PubMed

LinkOut - more resources

Full Text Sources (V体育官网入口)
Other Literature Sources (VSports最新版本)
- V体育安卓版 - The Lens - Patent Citations Database
- scite Smart Citations
Medical
- VSports - ClinicalTrials.gov
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases

Save citation to file (V体育官网入口)

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Unifying the analysis of high-throughput sequencing datasets: characterizing RNA-seq, 16S rRNA gene sequencing and selective growth experiments by compositional data analysis

VSports手机版 - Affiliations

Unifying the analysis of high-throughput sequencing datasets: characterizing RNA-seq, 16S rRNA gene sequencing and selective growth experiments by compositional data analysis

Authors

Affiliations

Abstract (VSports)

Figures

"VSports手机版" References

LinkOut - more resources

Full Text Sources (V体育官网入口)

Other Literature Sources (VSports最新版本)

Medical

Molecular Biology Databases