Handling biom objects and metadata

This tutorial discusses slightly technical programming issues, and will be most easily understood by experienced R users. Still, the gist is simple. The topic is how matR data objects relate to the built-in data types of R. Annotation information is retrieved by the biomRequest() function as a biom object, essentially a matrix with additional metadata annotations for the rows and columns. Such objects can also be created by importing from a file of data in BIOM format produced by other software.

xx1 to xx4 are example biom objects included with the matR package.

summary (xx1)

## [id:] mgm4440463.3_mgm4440464.3_mgm4441679.3_mgm4441680.3_mgm4441682.3_mgm4441695.3_mgm4441696.3_function_level2_Subsystems_all_abundance_5_60_15_0
## [generated_by:] MG-RAST revision 3.5 on [date:] 2014-09-05T20:17:55
## [type:] sparse Function table (161x7 of which 938 nonzero)
## [format:] Biological Observation Matrix 1.0

Here is an example showing part of the column metadata of xx3. The information is returned as a data.frame.

head (columns (xx3, "latitude|longitude"))

##              sample.data.latitude sample.data.longitude
## mgm4477803.3              -77.725               162.311
## mgm4477804.3                 39.1                 -96.6
## mgm4477805.3                 34.9               -115.65
## mgm4477807.3              -12.633               -71.233
## mgm4477872.3               35.383              -105.933
## mgm4477873.3               34.333              -106.733

Metadata is selected by a regular expression. See ?regex for details. The regex below selects project IDs plus all metadata related to the environmental package.

names (columns (xx2, "project\\.id|^env_package"))

##  [1] "env_package.data.alkalinity"         
##  [2] "env_package.data.ammonium"           
##  [3] "env_package.data.bromide"            
##  [4] "env_package.data.calcium"            
##  [5] "env_package.data.chloride"           
##  [6] "env_package.data.density"            
##  [7] "env_package.data.diss_org_carb"      
##  [8] "env_package.data.env_package"        
##  [9] "env_package.data.magnesium"          
## [10] "env_package.data.misc_param"         
## [11] "env_package.data.nitrate"            
## [12] "env_package.data.nitrite"            
## [13] "env_package.data.potassium"          
## [14] "env_package.data.salinity"           
## [15] "env_package.data.samp_store_temp"    
## [16] "env_package.data.silicate"           
## [17] "env_package.data.sulfate"            
## [18] "env_package.data.suspend_part_matter"
## [19] "env_package.id"                      
## [20] "env_package.name"                    
## [21] "env_package.type"                    
## [22] "project.id"

The regex can be omitted to show all metadata. Metadata for rows typically consists of annotation hierarchy levels only.

names (rows (xx1))

## [1] "ontology1" "ontology2"

A data.frame is returned even in case of a single metadata field, so note that in this example, the rownames() and the single variable of the data.frame coincide.

head (rows (xx1, "ontology2"))

##                                                 ontology2
## ABC transporters                         ABC transporters
## ATP synthases                               ATP synthases
## Acid stress                                   Acid stress
## Adhesion                                         Adhesion
## Alanine, serine, and glycine Alanine, serine, and glycine
## Aminosugars                                   Aminosugars

Metadata fields are almost always coded as factors.

The biom and matrix classes are very similar, aside from metadata. Replacement functions, dim(), rownames(), colnames(), and dimnames() can all be applied to biom objects. This example renames columns with information taken from metadata.

yy <- xx4
colnames (yy) <- columns (yy, "sample.data.sample_name") [[1]]

Subsetting also works in familiar ways.

xx3 [1:10,1:2]

##                                                  mgm4477803.3 mgm4477804.3
## Amino Acids and Derivatives                            209800       178730
## Carbohydrates                                          236396       225778
## Cell Division and Cell Cycle                            37902        26531
## Cell Wall and Capsule                                   81677        58110
## Clustering-based subsystems                            375908       286724
## Cofactors, Vitamins, Prosthetic Groups, Pigments       161203       122892
## DNA Metabolism                                         120565        75780
## Dormancy and Sporulation                                 6628         3734
## Fatty Acids, Lipids, and Isoprenoids                    71586        61371
## Iron acquisition and metabolism                         16409         9896
## 
## [id:] derived with `[.biom`(x = xx3, i = 1:10, j = 1:2)
## [generated_by:] matR: metagenomics analysis tools for R (0.9) on [date:] 2014-12-20 14:56:58
## [type:] sparse Function table (10x2 of which 20 nonzero)
## [format:] Biological Observation Matrix 1.0

Subsetting by row and column names:

xx4 [c("Bacteria", "Eukaryota"), c("mgm4575333.3", "mgm4575334.3", "mgm4575335.3")]

##           mgm4575333.3 mgm4575334.3 mgm4575335.3
## Bacteria        168162       164695       179684
## Eukaryota           59           83           23
## 
## [id:] derived with `[.biom`(x = xx4, i = c("Bacteria", "Eukaryota"), j = c("mgm4575333.3", "mgm4575334.3", "mgm4575335.3"))
## [generated_by:] matR: metagenomics analysis tools for R (0.9) on [date:] 2014-12-20 14:56:58
## [type:] sparse Taxon table (2x3 of which 6 nonzero)
## [format:] Biological Observation Matrix 1.0

Subsetting to keep metagenomes from only one biome:

summary (xx3 [ ,columns(xx3,"biome") == "Tundra biome"])

## [id:] derived with `[.biom`(x = xx3, j = columns(xx3, "biome") == "Tundra biome")
## [generated_by:] matR: metagenomics analysis tools for R (0.9) on [date:] 2014-12-20 14:56:58
## [type:] sparse Function table (28x2 of which 56 nonzero)
## [format:] Biological Observation Matrix 1.0

Subsetting to keep only rows matching a search term.

summary (xx1 [grepl("Protein secretion system", rownames(xx1)), ])

## [id:] derived with `[.biom`(x = xx1, i = grepl("Protein secretion system", rownames(xx1)))
## [generated_by:] matR: metagenomics analysis tools for R (0.9) on [date:] 2014-12-20 14:56:58
## [type:] sparse Function table (7x7 of which 28 nonzero)
## [format:] Biological Observation Matrix 1.0

It can be useful to merge two biom objects, but be careful. Since the operation requires only that all colnames() of the two objects be distinct, it’s possible to perform nonsense, as in this example where the annotations of the merged objects are entirely unrelated. (One is taxonomic and the other is functional).

tail (rownames (merge (xx1, xx4)))

## Warning: matR: merging different 'type's forces common 'type'

## [1] "proteosome related"          "recX and regulatory cluster"
## [3] "tRNA sulfuration"            "Archaea"                    
## [5] "Bacteria"                    "Eukaryota"

In this more likely example, merging facilitates differently normalizing metagenomes of a single original biom object.

aa <- transform (xx4 [,1:8], t_Threshold, t_Log)
bb <- transform (xx4 [,9:16], t_Threshold=list(entry.min=5), t_Log)
xx4_norm <- merge (aa, bb)

It is easy to convert between biom class and the built-in types of R, but note that BIOM data is often stored in a sparse format. Consequently, this variation of the as.matrix() command is usually best:

head (as.matrix (xx1, expand=TRUE))

##                              mgm4440463.3 mgm4440464.3 mgm4441679.3
## ABC transporters                       10           15          196
## ATP synthases                          19           13          194
## Acid stress                             8            1           20
## Adhesion                                7            5           22
## Alanine, serine, and glycine           42           43          452
## Aminosugars                            24           18           84
##                              mgm4441680.3 mgm4441682.3 mgm4441695.3
## ABC transporters                      175          222          208
## ATP synthases                         145          210           22
## Acid stress                            21            9            9
## Adhesion                                8           19           17
## Alanine, serine, and glycine          241          346          119
## Aminosugars                           116          122            9
##                              mgm4441696.3
## ABC transporters                      285
## ATP synthases                          15
## Acid stress                             8
## Adhesion                                7
## Alanine, serine, and glycine          136
## Aminosugars                            31

For comparison, here is the result of omitting the expand= option.

head (as.matrix (xx1))

##      [,1] [,2] [,3]
## [1,]    0    6  285
## [2,]    0    0   10
## [3,]    0    4  222
## [4,]    0    3  175
## [5,]    0    2  196
## [6,]    0    5  208

JSON text is the native format of BIOM data and can be obtained with either of the following commands. The latter outputs to a file which could be used by other software.

as.character (xx1)
as.character (xx1, file="xx1.biom")

Similarly, the next command creates a biom object from BIOM data in a file (created by matR or other software).

biom (file="xx1.biom")