This tutorial shows the flavor of metagenomic analysis in R with a short “toy” example. The analysis workflow presented here appears simple, but it is quite powerful. These half-dozen commands can be modified in many ways for more specific situations. Separate tutorials describing usage details for each are available (or will be soon).
An actual analysis would start with a file of metagenome IDs, one per line. Here, for demonstration purposes, we first create such a file.
library(matR)
writeLines (colnames (xx1), "my_ids.txt")
The next command requests a matrix of abundance counts of function annotations at Subsystems level 2 for the specified metagenomes. Many variations of this command are available. Note that the request may take several minutes to complete.
zz <- biomRequest(file="my_ids.txt", request="function", group_level="level2", evalue=1)
## start stop requested ticket
## 1 1 7 TRUE 29682
## file
## 1 /var/folders/07/pm0p1bcn5f3b66dy_x51yv340000gp/T//Rtmp9P4DCt/file5dd33f10227b
Technically, the returned object zz
is not a matrix
but a classed biom
object. Such objects carry metadata that can enrich analyses. As a first step, it is usually effective to transform the raw abundance counts, as follows.
zz0 <- transform (zz, t_Threshold, t_Log)
Both columns (metagenomes) and rows (annotations) of biom
objects have metadata. The entirety of metadata, or a selection of specific entries (as below), can be examined.
columns (zz0, "host_common_name|samp_store_temp|ncbi_id")
## env_package.data.host_common_name
## mgm4440463.3 Mouse
## mgm4440464.3 Mouse
## mgm4441679.3 cow
## mgm4441680.3 cow
## mgm4441682.3 cow
## mgm4441695.3 striped bass
## mgm4441696.3 striped bass
## env_package.data.samp_store_temp project.data.ncbi_id
## mgm4440463.3 <NA> 17401
## mgm4440464.3 <NA> 17401
## mgm4441679.3 -80 28609
## mgm4441680.3 -80 28609
## mgm4441682.3 -80 28609
## mgm4441695.3 <NA> 28403
## mgm4441696.3 <NA> 28403
Plotting principal coordinates is a standard visualization, shown next. Note how the command uses metadata to determine features of the plot. Details of the princomp.biom()
function are available in a separate tutorial, to help obtain exactly the desired graphical output.
princomp (zz0, map=c(col="host_common_name", pch="samp_store_temp"), cex=1.5, labels="$$ncbi_id")
A computation of distances between samples underlies the principal coordinates plot. That computation, or a groupwise-distance (as here), can be examined directly, and various distance metrics are available.
distx (zz0, groups="$$host_common_name")
## cow Mouse striped bass
## cow 9.954 33.38 23.00
## Mouse 33.384 12.08 28.08
## striped bass 22.997 28.08 11.67
A statistical assessment of the significance for group differentiation of each annotation can be made. Here we simultaneously filter the results at a p-value threshold of 0.05.
pp <- (rowstats (zz0, groups="$$material") $ p.value < 0.05)
rownames (zz0) [pp]
## [1] "Bacterial cytostatics, differentiation factors and antibiotics"
## [2] "CRISPRs and associated hypotheticals"
## [3] "CRISPs"
## [4] "Flagella protein?"
## [5] "Plasmid related functions"
## [6] "Protein secretion system, Type VII (Chaperone/Usher pathway, CU)"
## [7] "Putative Isoquinoline 1-oxidoreductase subunit"
A heatmap restricted to the identified functions neatly shows group differentiation. In this image, annotations are labeled by their top-level place in the function hierarchy, rather than by the specific function names listed just above.
image (zz0 [pp,], labCol="$$ncbi_id", labRow="$$ontology1", margins=c(5,10))
Fine control of the heatmap image, including labeling and colors, is described in a separate tutorial for the image.biom()
function.