This tutorial discusses slightly technical programming issues, and will be most easily understood by experienced R users. Still, the gist is simple. The topic is how matR data objects relate to the built-in data types of R. Annotation information is retrieved by the biomRequest()
function as a biom
object, essentially a matrix
with additional metadata annotations for the rows and columns. Such objects can also be created by importing from a file of data in BIOM format produced by other software.
xx1
to xx4
are example biom
objects included with the matR package.
summary (xx1)
## [id:] mgm4440463.3_mgm4440464.3_mgm4441679.3_mgm4441680.3_mgm4441682.3_mgm4441695.3_mgm4441696.3_function_level2_Subsystems_all_abundance_5_60_15_0
## [generated_by:] MG-RAST revision 3.5 on [date:] 2014-09-05T20:17:55
## [type:] sparse Function table (161x7 of which 938 nonzero)
## [format:] Biological Observation Matrix 1.0
Here is an example showing part of the column metadata of xx3
. The information is returned as a data.frame
.
head (columns (xx3, "latitude|longitude"))
## sample.data.latitude sample.data.longitude
## mgm4477803.3 -77.725 162.311
## mgm4477804.3 39.1 -96.6
## mgm4477805.3 34.9 -115.65
## mgm4477807.3 -12.633 -71.233
## mgm4477872.3 35.383 -105.933
## mgm4477873.3 34.333 -106.733
Metadata is selected by a regular expression. See ?regex
for details. The regex below selects project IDs plus all metadata related to the environmental package.
names (columns (xx2, "project\\.id|^env_package"))
## [1] "env_package.data.alkalinity"
## [2] "env_package.data.ammonium"
## [3] "env_package.data.bromide"
## [4] "env_package.data.calcium"
## [5] "env_package.data.chloride"
## [6] "env_package.data.density"
## [7] "env_package.data.diss_org_carb"
## [8] "env_package.data.env_package"
## [9] "env_package.data.magnesium"
## [10] "env_package.data.misc_param"
## [11] "env_package.data.nitrate"
## [12] "env_package.data.nitrite"
## [13] "env_package.data.potassium"
## [14] "env_package.data.salinity"
## [15] "env_package.data.samp_store_temp"
## [16] "env_package.data.silicate"
## [17] "env_package.data.sulfate"
## [18] "env_package.data.suspend_part_matter"
## [19] "env_package.id"
## [20] "env_package.name"
## [21] "env_package.type"
## [22] "project.id"
The regex can be omitted to show all metadata. Metadata for rows typically consists of annotation hierarchy levels only.
names (rows (xx1))
## [1] "ontology1" "ontology2"
A data.frame
is returned even in case of a single metadata field, so note that in this example, the rownames()
and the single variable of the data.frame
coincide.
head (rows (xx1, "ontology2"))
## ontology2
## ABC transporters ABC transporters
## ATP synthases ATP synthases
## Acid stress Acid stress
## Adhesion Adhesion
## Alanine, serine, and glycine Alanine, serine, and glycine
## Aminosugars Aminosugars
Metadata fields are almost always coded as factor
s.
The biom
and matrix
classes are very similar, aside from metadata. Replacement functions, dim()
, rownames()
, colnames()
, and dimnames()
can all be applied to biom
objects. This example renames columns with information taken from metadata.
yy <- xx4
colnames (yy) <- columns (yy, "sample.data.sample_name") [[1]]
Subsetting also works in familiar ways.
xx3 [1:10,1:2]
## mgm4477803.3 mgm4477804.3
## Amino Acids and Derivatives 209800 178730
## Carbohydrates 236396 225778
## Cell Division and Cell Cycle 37902 26531
## Cell Wall and Capsule 81677 58110
## Clustering-based subsystems 375908 286724
## Cofactors, Vitamins, Prosthetic Groups, Pigments 161203 122892
## DNA Metabolism 120565 75780
## Dormancy and Sporulation 6628 3734
## Fatty Acids, Lipids, and Isoprenoids 71586 61371
## Iron acquisition and metabolism 16409 9896
##
## [id:] derived with `[.biom`(x = xx3, i = 1:10, j = 1:2)
## [generated_by:] matR: metagenomics analysis tools for R (0.9) on [date:] 2014-12-20 14:56:58
## [type:] sparse Function table (10x2 of which 20 nonzero)
## [format:] Biological Observation Matrix 1.0
Subsetting by row and column names:
xx4 [c("Bacteria", "Eukaryota"), c("mgm4575333.3", "mgm4575334.3", "mgm4575335.3")]
## mgm4575333.3 mgm4575334.3 mgm4575335.3
## Bacteria 168162 164695 179684
## Eukaryota 59 83 23
##
## [id:] derived with `[.biom`(x = xx4, i = c("Bacteria", "Eukaryota"), j = c("mgm4575333.3", "mgm4575334.3", "mgm4575335.3"))
## [generated_by:] matR: metagenomics analysis tools for R (0.9) on [date:] 2014-12-20 14:56:58
## [type:] sparse Taxon table (2x3 of which 6 nonzero)
## [format:] Biological Observation Matrix 1.0
Subsetting to keep metagenomes from only one biome:
summary (xx3 [ ,columns(xx3,"biome") == "Tundra biome"])
## [id:] derived with `[.biom`(x = xx3, j = columns(xx3, "biome") == "Tundra biome")
## [generated_by:] matR: metagenomics analysis tools for R (0.9) on [date:] 2014-12-20 14:56:58
## [type:] sparse Function table (28x2 of which 56 nonzero)
## [format:] Biological Observation Matrix 1.0
Subsetting to keep only rows matching a search term.
summary (xx1 [grepl("Protein secretion system", rownames(xx1)), ])
## [id:] derived with `[.biom`(x = xx1, i = grepl("Protein secretion system", rownames(xx1)))
## [generated_by:] matR: metagenomics analysis tools for R (0.9) on [date:] 2014-12-20 14:56:58
## [type:] sparse Function table (7x7 of which 28 nonzero)
## [format:] Biological Observation Matrix 1.0
It can be useful to merge two biom
objects, but be careful. Since the operation requires only that all colnames()
of the two objects be distinct, it’s possible to perform nonsense, as in this example where the annotations of the merged objects are entirely unrelated. (One is taxonomic and the other is functional).
tail (rownames (merge (xx1, xx4)))
## Warning: matR: merging different 'type's forces common 'type'
## [1] "proteosome related" "recX and regulatory cluster"
## [3] "tRNA sulfuration" "Archaea"
## [5] "Bacteria" "Eukaryota"
In this more likely example, merging facilitates differently normalizing metagenomes of a single original biom
object.
aa <- transform (xx4 [,1:8], t_Threshold, t_Log)
bb <- transform (xx4 [,9:16], t_Threshold=list(entry.min=5), t_Log)
xx4_norm <- merge (aa, bb)
It is easy to convert between biom
class and the built-in types of R, but note that BIOM data is often stored in a sparse format. Consequently, this variation of the as.matrix()
command is usually best:
head (as.matrix (xx1, expand=TRUE))
## mgm4440463.3 mgm4440464.3 mgm4441679.3
## ABC transporters 10 15 196
## ATP synthases 19 13 194
## Acid stress 8 1 20
## Adhesion 7 5 22
## Alanine, serine, and glycine 42 43 452
## Aminosugars 24 18 84
## mgm4441680.3 mgm4441682.3 mgm4441695.3
## ABC transporters 175 222 208
## ATP synthases 145 210 22
## Acid stress 21 9 9
## Adhesion 8 19 17
## Alanine, serine, and glycine 241 346 119
## Aminosugars 116 122 9
## mgm4441696.3
## ABC transporters 285
## ATP synthases 15
## Acid stress 8
## Adhesion 7
## Alanine, serine, and glycine 136
## Aminosugars 31
For comparison, here is the result of omitting the expand=
option.
head (as.matrix (xx1))
## [,1] [,2] [,3]
## [1,] 0 6 285
## [2,] 0 0 10
## [3,] 0 4 222
## [4,] 0 3 175
## [5,] 0 2 196
## [6,] 0 5 208
JSON text is the native format of BIOM data and can be obtained with either of the following commands. The latter outputs to a file which could be used by other software.
as.character (xx1)
as.character (xx1, file="xx1.biom")
Similarly, the next command creates a biom
object from BIOM data in a file (created by matR or other software).
biom (file="xx1.biom")