Skip to contents

microbiomedataset provides a rectangular data container for microbiome analysis. Its design follows the core idea of massdataset: keep the abundance matrix and all related metadata synchronized, then make common processing steps explicit and traceable.

Load the demo dataset

library(microbiomedataset)
data("global_patterns", package = "microbiomedataset")

object <- global_patterns
object
#> -------------------- 
#> microbiomedataset version: 0.99.1 
#> -------------------- 
#> 1.expression_data:[ 19216 x 26 data.frame]
#> 2.sample_info:[ 26 x 8 data.frame]
#> 3.variable_info:[ 19216 x 8 data.frame]
#> 4.sample_info_note:[ 8 x 2 data.frame]
#> 5.variable_info_note:[ 8 x 2 data.frame]
#> -------------------- 
#> Processing information (extract_process_info())
#> create_microbiome_dataset ---------- 
#>             Package               Function.used                Time
#> 1 microbiomedataset create_microbiome_dataset() 2022-07-11 01:56:13

The package demo object already contains the three core data layers:

dim(object@expression_data)
#> [1] 19216    26
head(object@sample_info[, c("sample_id", "SampleType", "class")])
#>   sample_id SampleType   class
#> 1       CL3       Soil Subject
#> 2       CC1       Soil Subject
#> 3       SV1       Soil Subject
#> 4   M31Fcsw      Feces Subject
#> 5   M11Fcsw      Feces Subject
#> 6   M31Plmr       Skin Subject
head(object@variable_info[, c("variable_id", "Kingdom", "Phylum", "Genus")])
#>   variable_id Kingdom        Phylum      Genus
#> 1      549322 Archaea Crenarchaeota       <NA>
#> 2      522457 Archaea Crenarchaeota       <NA>
#> 3         951 Archaea Crenarchaeota Sulfolobus
#> 4      244423 Archaea Crenarchaeota       <NA>
#> 5      586076 Archaea Crenarchaeota       <NA>
#> 6      246140 Archaea Crenarchaeota       <NA>

Understand the object design

microbiome_dataset keeps these layers aligned:

  1. expression_data: a feature-by-sample abundance matrix.
  2. sample_info: sample metadata keyed by sample_id.
  3. variable_info: feature metadata keyed by variable_id.
  4. Optional microbiome-specific attachments: otu_tree, taxa_tree, and ref_seq.
  5. Explicit tree link metadata: otu_tree_link and taxa_tree_link.

The object structure is:

flowchart LR
  A["expression_data"] --> B["variable_info"]
  A --> C["sample_info"]
  B --> D["otu_tree_link"]
  B --> E["taxa_tree_link"]
  D --> F["otu_tree"]
  E --> G["taxa_tree"]
  B --> H["ref_seq"]

The built-in validator checks that these parts remain synchronized.

check_microbiome_dataset_class(object)
#> [1] TRUE

Explore the basic accessors

nrow(object)
#> variables 
#>     19216
ncol(object)
#> samples 
#>      26

head(extract_sample_info(object))
#>   sample_id  Primer Final_Barcode Barcode_truncated_plus_T Barcode_full_length
#> 1       CL3 ILBC_01        AACGCA                   TGCGTT         CTAGCGTGCGT
#> 2       CC1 ILBC_02        AACTCG                   CGAGTT         CATCGACGAGT
#> 3       SV1 ILBC_03        AACTGT                   ACAGTT         GTACGCACAGT
#> 4   M31Fcsw ILBC_04        AAGAGA                   TCTCTT         TCGACATCTCT
#> 5   M11Fcsw ILBC_05        AAGCTG                   CAGCTT         CGACTGCAGCT
#> 6   M31Plmr ILBC_07        AATCGT                   ACGATT         CGAGTCACGAT
#>   SampleType                                Description   class
#> 1       Soil   Calhoun South Carolina Pine soil, pH 4.9 Subject
#> 2       Soil   Cedar Creek Minnesota, grassland, pH 6.1 Subject
#> 3       Soil Sevilleta new Mexico, desert scrub, pH 8.3 Subject
#> 4      Feces    M3, Day 1, fecal swab, whole body study Subject
#> 5      Feces   M1, Day 1, fecal swab, whole body study  Subject
#> 6       Skin    M3, Day 1, right palm, whole body study Subject
head(extract_variable_info(object))
#>   variable_id Kingdom        Phylum        Class        Order        Family
#> 1      549322 Archaea Crenarchaeota Thermoprotei         <NA>          <NA>
#> 2      522457 Archaea Crenarchaeota Thermoprotei         <NA>          <NA>
#> 3         951 Archaea Crenarchaeota Thermoprotei Sulfolobales Sulfolobaceae
#> 4      244423 Archaea Crenarchaeota        Sd-NA         <NA>          <NA>
#> 5      586076 Archaea Crenarchaeota        Sd-NA         <NA>          <NA>
#> 6      246140 Archaea Crenarchaeota        Sd-NA         <NA>          <NA>
#>        Genus                  Species
#> 1       <NA>                     <NA>
#> 2       <NA>                     <NA>
#> 3 Sulfolobus Sulfolobusacidocaldarius
#> 4       <NA>                     <NA>
#> 5       <NA>                     <NA>
#> 6       <NA>                     <NA>
extract_expression_data(object)[1:6, 1:6]
#>        CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr
#> 549322   0   0   0       0       0       0
#> 522457   0   0   0       0       0       0
#> 951      0   0   0       0       0       0
#> 244423   0   0   0       0       0       0
#> 586076   0   0   0       0       0       0
#> 246140   0   0   0       0       0       0

Objects saved with an older package version can be upgraded to the current schema:

updated_object <- update_microbiome_dataset(object)
updated_object@version
#> [1] "0.99.1"

Use tidy filtering without breaking the object

Activate sample_info or variable_info, then use familiar dplyr verbs.

soil_object <-
  object %>%
  activate_microbiome_dataset("sample_info") %>%
  dplyr::filter(SampleType == "Soil")

dim(soil_object@expression_data)
#> [1] 19216     3
head(soil_object@sample_info$sample_id)
#> [1] CL3 CC1 SV1
#> 26 Levels: AQC1cm AQC4cm AQC7cm CC1 CL3 Even1 Even2 Even3 F21Plmr ... TS29
genus_defined <-
  object %>%
  activate_microbiome_dataset("variable_info") %>%
  dplyr::filter(!is.na(Genus))

dim(genus_defined@expression_data)
#> [1] 8008   26
head(genus_defined@variable_info$Genus)
#> [1] "Sulfolobus"  "Cenarchaeum" "Cenarchaeum" "Cenarchaeum" "Cenarchaeum"
#> [6] "Cenarchaeum"

Run the most common first steps

Transform counts to relative abundance:

relative_object <- transform_counts(object, method = "relative")
colSums(relative_object@expression_data)[1:5]
#>     CL3     CC1     SV1 M31Fcsw M11Fcsw 
#>       1       1       1       1       1

Aggregate features to a taxonomic rank:

genus_object <-
  summarise_taxa(
    genus_defined,
    taxonomic_rank = "Genus",
    what = "sum_intensity"
  )

dim(genus_object@expression_data)
#> [1] 983  26
head(genus_object@variable_info[, c("variable_id", "Genus")])
#>                variable_id                    Genus
#> 1               Sulfolobus               Sulfolobus
#> 2              Cenarchaeum              Cenarchaeum
#> 3           Nitrosopumilus           Nitrosopumilus
#> 4 CandidatusNitrososphaera CandidatusNitrososphaera
#> 5            Natronococcus            Natronococcus
#> 6            Natronorubrum            Natronorubrum

Export tidy long tables:

feature_table <- melt_features(object, relative = TRUE)
taxa_table <- melt_taxa(object, taxonomic_rank = "Phylum", relative = TRUE)

head(feature_table[, c("sample_id", "variable_id", "abundance")])
#>   sample_id variable_id abundance
#> 1       CL3      549322         0
#> 2       CC1      549322         0
#> 3       SV1      549322         0
#> 4   M31Fcsw      549322         0
#> 5   M11Fcsw      549322         0
#> 6   M31Plmr      549322         0
head(taxa_table[, c("sample_id", "Phylum", "abundance")])
#>   sample_id          Phylum    abundance
#> 1    AQC1cm        ABY1_OD1 0.0005138174
#> 2    AQC1cm             AC1 0.0021409059
#> 3    AQC1cm             AD3 0.0002569087
#> 4    AQC1cm   Acidobacteria 1.2521730194
#> 5    AQC1cm  Actinobacteria 2.2601114984
#> 6    AQC1cm Armatimonadetes 0.0690228049

Where to go next

The rest of the tutorial set is organized by task:

  • import_data: create objects and convert from other ecosystems.
  • preprocess: filtering, transformation, and synchronization-safe verbs.
  • merge: taxonomic aggregation and long-table export.
  • relative_intensity: diversity and ordination analysis.
  • filtering: trees, sequences, and interoperability.
  • microbiome_workflow: a compact end-to-end workflow.