vignettes/metadata.Rmd
metadata.Rmd
The data in Gemma are manually annotated by curators with terms, often using an ontology term on both dataset and sample level. In Gemma.R three primary functions allow access to these annotations for a given dataset.
get_dataset_annotations
: This function returns
annotations associated with a dataset. These try to serve as tags
describing the dataset as a whole and they characteristics that samples
within the datasets have while also including some additional
terms.
get_dataset_samples
: This function returns samples
and associated annotations related to their experimental groups for an
experiment
get_dataset_differential_expression_analyses
: This
function returns information about differential expression analyses
automatically performed by Gemma for a given experiment. Each row of the
output is a contrast where a specific property or an interaction of
properties are described.
In the examples below we will be referring to GSE48962 experiment, where striatum and cerebral cortex samples from control mice and mice belonging to a Huntington model (R6/2) were taken from 8 week and 12 week old mice.
Terms returned via get_dataset_annotations
are tags used
to describe a dataset in general terms.
get_dataset_annotations('GSE48962') %>%
gemma_kable
class.name | class.URI | term.name | term.URI | object.class |
---|---|---|---|---|
genotype | http://www.ebi…/EFO_0000513 | Overexpression | http://gemma.msl…/TGEMO_00004 | FactorValue |
study design | http://www.ebi…/EFO_0001426 | SUBSET | http://gemma.msl…/TGEMO_00022 | ExperimentTag |
strain | http://www.ebi…/EFO_0005135 | C57BL/6 | http://www.ebi…/EFO_0022397 | ExperimentTag |
organism part | http://www.ebi…/EFO_0000635 | striatum | http://purl.obolibrary…/UBERON_0002435 | FactorValue |
genotype | http://www.ebi…/EFO_0000513 | HTT [human] huntingtin | http://purl.org/../3064 | FactorValue |
strain | http://www.ebi…/EFO_0005135 | CBA | http://www.ebi…/EFO_0022463 | ExperimentTag |
organism part | http://www.ebi…/EFO_0000635 | cerebral cortex | http://purl.obolibrary…/UBERON_0000956 | FactorValue |
These tags come as a class/term pairs and inherit any terms that is assigned to any of the samples. Therefore we can see all chemicals and cell types used in the experiment.
Samples and differential expression contrasts in Gemma are annotated with factor values. These values contain statements that describe these samples and which samples belong to which experimental in a differential expression analysis respectively.
In gemma.R these values are stored in nested data.table
s
and can be found by accessing the relevant columns of the outputs.
Annotations for samples can be accessed using
get_dataset_samples
. sample.factorValues
column contains the relevant information
samples <- get_dataset_samples('GSE48962')
samples$sample.factorValues[[
which(samples$sample.name == "TSM490")
]] %>%
gemma_kable()
category | category.URI | value | value.URI | predicate | predicate.URI | object | object.URI | summary | ID | factor.ID | factor.category | factor.category.URI |
---|---|---|---|---|---|---|---|---|---|---|---|---|
organism part | http://www.ebi…/EFO_0000635 | striatum | http://purl.obolibrary…/UBERON_0002435 | NA | NA | NA | NA | striatum | 120172 | 20540 | organism part | http://www.ebi…/EFO_0000635 |
genotype | http://www.ebi…/EFO_0000513 | HTT [human] huntingtin | http://purl.org/../3064 | has_genotype | http://purl.obolibrary…/GENO_0000222 | CAG repeats | NA | HTT [human] huntingtin has_… | 120175 | 20541 | genotype | http://www.ebi…/EFO_0000513 |
genotype | http://www.ebi…/EFO_0000513 | HTT [human] huntingtin | http://purl.org/../3064 | has_genotype | http://purl.obolibrary…/GENO_0000222 | Overexpression | http://gemma.msl…/TGEMO_00004 | HTT [human] huntingtin has_… | 120175 | 20541 | genotype | http://www.ebi…/EFO_0000513 |
block | http://www.ebi…/EFO_0005067 | Device=HWUSI-EAS1563_0073_F… | NA | NA | NA | NA | NA | Device=HWUSI-EAS1563_0073_F… | 163476 | 32643 | block | http://www.ebi…/EFO_0005067 |
timepoint | http://www.ebi…/EFO_0000724 | 12 weeks | NA | NA | NA | NA | NA | 12 weeks | 120179 | 20543 | timepoint | http://www.ebi…/EFO_0000724 |
The example above shows a single factor value object for one sample.
The rows of this data.table
are statements that belong to a
factor value. Below each column of this nested table is described. If a
given field is filled by an ontology term, the corresponding URI column
will contain the ontology URI for the field.
category
/category.URI
: Category of the
individual statement, such as treatment, phenotype or strainvalue
/value.URI
: The subject of the
statement.predicate
/predicate.URI
: When a subject
alone is not enough to describe all details, a statement can contain a
predicate and an object. The predicate describes the relationship
between the subject of the statement and the object. In the example
above, these are used to denote properties of the human HTT in the mouse
modelsobject
/object.URI
: The object of a
statement is a property further describing it’s value. In this example
these describe the properties of the HTT gene in the mouse model, namely
that it has CAG repeats and it is overexpressed. If the value was a drug
this could be dosage or timepoint.summary
: A plain text summary of the factorValue.
Different statements will have the same summary if they are part of the
same factor valueID
: An integer identifier for the specific factor
value. In the example above, the genotype of the mouse is defined as a
single factor value made up of two statements stating the HTT gene has
CAG repeats and that it is overexpressed. This factor value has the ID
of 120175 which is shared by both rows containing the statements
describing it. This ID will repeat for every other patient that has the
same genotype or differential expression results using that factor as a
part of their contrast. For instance we can see which samples that was
subjected to this condition using this ID instead of trying to match the
other columns describing the statements
id <- samples$sample.factorValues[[
which(samples$sample.name == "TSM490")
]] %>% filter(value == "HTT [human] huntingtin") %>% {.$ID} %>% unique
# count how many patients has this phenotype
samples$sample.factorValues %>% sapply(\(x){
id %in% x$ID
}) %>% sum
## [1] 12
factor.ID
: An integer identifier for the factor. A
factor holds specific factor values. For the example above whether or
not the mouse is a wild type mouse or if it has a wild type genotype is
stored under the id 20541We can use this to fetch all distinct genotypes
id <- samples$sample.factorValues[[
which(samples$sample.name == "TSM490")
]] %>%
filter(value == "HTT [human] huntingtin") %>% {.$factor.ID} %>% unique
samples$sample.factorValues %>% lapply(\(x){
x %>% filter(factor.ID == id) %>% {.$summary}
}) %>% unlist %>% unique
## [1] "wild type genotype"
## [2] "HTT [human] huntingtin has_genotype CAG repeats and has_genotype Overexpression"
This shows us the dataset has control mice and Huntington Disease
model mice.. This ID can be used to match the factor between samples and
between samples and differential expression experiments -
factor.category
/factor.category.URI
: The
category of the whole factor. Usually this is the same with the
category
of the statements making up the factor value.
However in cases like the example above, where the value describes a
treatment while the factor overall represents a phenotype, they can
differ.
gemma.R includes a convenience function to create a simplified design matrix out of these factor values for a given experiment. This will unpack the nested data.frames and provide a more human readable output, giving each available factor it’s own column.
design <- make_design(samples)
design[,-1] %>% head %>% # first column is just a copy of the original factor values
gemma_kable()
block | organism part | genotype | timepoint | |
---|---|---|---|---|
ESW176 | Device=HWUSI-EAS1563_0071_F… | striatum | wild type genotype | 8 weeks |
TCW9469 | Device=HWUSI-EAS1563_0053:L… | cerebral cortex | wild type genotype | 12 weeks |
ECM175 | Device=HWUSI-EAS1563_0071_F… | cerebral cortex | HTT [human] huntingtin has_… | 8 weeks |
ESW183 | Device=HWUSI-EAS1563_0072_F… | striatum | wild type genotype | 8 weeks |
ECW178 | Device=HWUSI-EAS1563_0071_F… | cerebral cortex | wild type genotype | 8 weeks |
TSW479 | Device=HWUSI-EAS1563_0073_F… | striatum | wild type genotype | 12 weeks |
Using this output, here we look at the sample sizes for different experimental groups.
design %>%
group_by(`organism part`,timepoint,genotype) %>%
summarize(n= n()) %>%
arrange(desc(n)) %>%
gemma_kable()
## `summarise()` has grouped output by 'organism part', 'timepoint'. You can
## override using the `.groups` argument.
organism part | timepoint | genotype | n |
---|---|---|---|
cerebral cortex | 12 weeks | HTT [human] huntingtin has_… | 3 |
cerebral cortex | 12 weeks | wild type genotype | 3 |
cerebral cortex | 8 weeks | HTT [human] huntingtin has_… | 3 |
cerebral cortex | 8 weeks | wild type genotype | 3 |
striatum | 12 weeks | HTT [human] huntingtin has_… | 3 |
striatum | 12 weeks | wild type genotype | 3 |
striatum | 8 weeks | HTT [human] huntingtin has_… | 3 |
striatum | 8 weeks | wild type genotype | 3 |
For most experiments it contains, Gemma performs automated differential expression analyses. The kinds of analyses that will be performed is informed by the factor values belonging to the samples.
# removing columns containing factor values and URIs for brevity
remove_columns <- c('baseline.factors','experimental.factors','subsetFactor','factor.category.URI')
dea <- get_dataset_differential_expression_analyses("GSE48962")
dea[,.SD,.SDcols = !remove_columns] %>%
gemma_kable()
result.ID | contrast.ID | experiment.ID | factor.category | factor.ID | isSubset | probes.analyzed | genes.analyzed |
---|---|---|---|---|---|---|---|
492856 | 120175_120178 | 8972 | genotype,timepoint | 20541,20543 | TRUE | 18969 | 18972 |
492855 | 120175 | 8972 | genotype | 20541 | TRUE | 18968 | 18971 |
492854 | 120178 | 8972 | timepoint | 20543 | TRUE | 18968 | 18971 |
492853 | 120175_120178 | 8972 | genotype,timepoint | 20541,20543 | TRUE | 20621 | 20623 |
492852 | 120175 | 8972 | genotype | 20541 | TRUE | 20621 | 20623 |
492851 | 120178 | 8972 | timepoint | 20543 | TRUE | 20621 | 20623 |
The example above shows the differential expression analyses results.
Each row of this data.table represents a differential expression
contrast connected to a fold change and a p value in the output of
get_differential_expression_values
function. If we look at
the contrast.ID
we will see the factor value identifiers
returned in the ID
column of our
sample.factorValues
. These represent which factor value is
used as the experimental factor. Note that some rows will have two IDs
appended together. These represent the interaction effects of multiple
factors. For simplicity, we will start from a contrast without an
interaction.
contrast <- dea %>%
filter(
factor.category == "genotype" &
subsetFactor %>% map_chr('value') %>% {.=='cerebral cortex'} # we will talk about subsets in a moment
)
# removing URIs for brevity
uri_columns = c('category.URI',
'object.URI',
'value.URI',
'predicate.URI',
'factor.category.URI')
contrast$baseline.factors[[1]][,.SD,.SDcols = !uri_columns] %>%
gemma_kable()
category | value | predicate | object | summary | ID | factor.ID | factor.category |
---|---|---|---|---|---|---|---|
genotype | wild type genotype | NA | NA | wild type genotype | 120174 | 20541 | genotype |
contrast$experimental.factors[[1]][,.SD,.SDcols = !uri_columns] %>%
gemma_kable()
category | value | predicate | object | summary | ID | factor.ID | factor.category |
---|---|---|---|---|---|---|---|
genotype | HTT [human] huntingtin | has_genotype | CAG repeats | HTT [human] huntingtin has_… | 120175 | 20541 | genotype |
genotype | HTT [human] huntingtin | has_genotype | Overexpression | HTT [human] huntingtin has_… | 120175 | 20541 | genotype |
Here, we can see the baseline is the wild type mouse, being compared to the Huntington Disease models
If we examine a factor with interaction, both baseline and experimental factor value columns will contain two factor values.
contrast <- dea %>%
filter(
factor.category == "genotype,timepoint" &
subsetFactor %>% map_chr('value') %>% {.=='cerebral cortex'} # we're almost there!
)
contrast$baseline.factors[[1]][,.SD,.SDcols = !uri_columns] %>%
gemma_kable()
category | value | predicate | object | summary | ID | factor.ID | factor.category |
---|---|---|---|---|---|---|---|
genotype | wild type genotype | NA | NA | wild type genotype | 120174 | 20541 | genotype |
timepoint | 12 weeks | NA | NA | 12 weeks | 120179 | 20543 | timepoint |
contrast$experimental.factors[[1]][,.SD,.SDcols = !uri_columns] %>%
gemma_kable()
category | value | predicate | object | summary | ID | factor.ID | factor.category |
---|---|---|---|---|---|---|---|
genotype | HTT [human] huntingtin | has_genotype | CAG repeats | HTT [human] huntingtin has_… | 120175 | 20541 | genotype |
genotype | HTT [human] huntingtin | has_genotype | Overexpression | HTT [human] huntingtin has_… | 120175 | 20541 | genotype |
timepoint | 8 weeks | NA | NA | 8 weeks | 120178 | 20543 | timepoint |
A third place that can contain factorValues is the
subsetFactor
. Certain differential expression analyses
exclude certain samples based on a given factor. In this example we can
see that this analysis were only performed on samples from the cerebral
cortex.
contrast$subsetFactor[[1]][,.SD,.SDcols = !uri_columns] %>%
gemma_kable()
category | value | predicate | object | summary | ID | factor.ID | factor.category |
---|---|---|---|---|---|---|---|
organism part | cerebral cortex | NA | NA | cerebral cortex | 120173 | 20540 | NA |
The ids of the factor values included in
baseline.factors
and experimental.factors
along with subsetFactor
can be used to determine which
samples represent a given contrast. For convenience,
get_dataset_object
function which is used to compile
metadata and expression data of an experiment in a single object,
includes resultSets
and contrasts
argument
which will return the data already composed of samples representing a
particular contrast.
obj <- get_dataset_object("GSE48962",resultSets = contrast$result.ID,contrasts = contrast$contrast.ID,type = 'list')
obj[[1]]$design[,-1] %>%
head %>% gemma_kable()
block | organism part | genotype | timepoint | |
---|---|---|---|---|
TCW9469 | Device=HWUSI-EAS1563_0053:L… | cerebral cortex | wild type genotype | 12 weeks |
ECM175 | Device=HWUSI-EAS1563_0071_F… | cerebral cortex | HTT [human] huntingtin has_… | 8 weeks |
ECW178 | Device=HWUSI-EAS1563_0071_F… | cerebral cortex | wild type genotype | 8 weeks |
TCW9451 | Device=HWI-EAS413_0047:Lane=2 | cerebral cortex | wild type genotype | 12 weeks |
TCW9457 | Device=HWUSI-EAS1563_0053:L… | cerebral cortex | wild type genotype | 12 weeks |
TCM9450 | Device=HWI-EAS413_0047:Lane=1 | cerebral cortex | HTT [human] huntingtin has_… | 12 weeks |
We suggested that the contrast.ID
of a contrast also
corresponded to a column in the differential expression results,
acquired by get_differential_expression_values
. We can use
what we have learned to take a look at the expression of genes at the
top of the phenotype, treatment interaction. Each result.ID returns its
separate table when accessing differential expression values.
dif_vals <- get_differential_expression_values('GSE48962')
dif_vals[[as.character(contrast$result.ID)]] %>% head %>%
gemma_kable()
Probe | NCBIid | GeneSymbol | GeneName | pvalue | corrected_pvalue | rank | contrast_120175_120178_coefficient | contrast_120175_120178_log2fc | contrast_120175_120178_tstat | contrast_120175_120178_pvalue |
---|---|---|---|---|---|---|---|---|---|---|
100502959 | 100502959 | AV051173 | expressed sequence AV051173 | 1.60e-06 | 0.0163 | 9.70e-05 | 3.7586 | 3.7586 | 12.4896 | 1.60e-06 |
67993 | 67993 | Nudt12 | nudix hydrolase 12 | 1.10e-06 | 0.0163 | 4.85e-05 | -1.1088 | -1.1088 | -13.0595 | 1.10e-06 |
108096 | 108096 | Slco1a5 | solute carrier organic anio… | 1.00e-04 | 0.2082 | 6.00e-04 | -3.6495 | -3.6495 | -6.7998 | 1.00e-04 |
101883 | 101883 | Igflr1 | IGF-like family receptor 1 | 1.00e-04 | 0.2082 | 6.00e-04 | 2.3595 | 2.3595 | 6.7859 | 1.00e-04 |
51795 | 51795 | Srpx | sushi-repeat-containing pro… | 1.00e-04 | 0.2082 | 5.00e-04 | -4.4992 | -4.4992 | -6.8665 | 1.00e-04 |
16969 | 16969 | Zbtb7a | zinc finger and BTB domain … | 7.81e-05 | 0.2082 | 2.00e-04 | 1.1581 | 1.1581 | 7.3738 | 7.81e-05 |
To get the top genes found associated with this interaction we access
the columns with the correct contrast.ID
.
# getting the top 10 genes
top_genes <- dif_vals[[as.character(contrast$result.ID)]] %>%
arrange(across(paste0('contrast_',contrast$contrast.ID,'_pvalue'))) %>%
filter(GeneSymbol!='' | grepl("|",GeneSymbol,fixed = TRUE)) %>% # remove blank genes or probes with multiple genes
{.[1:10,]}
top_genes %>% select(Probe,NCBIid,GeneSymbol) %>%
gemma_kable()
Probe | NCBIid | GeneSymbol |
---|---|---|
67993 | 67993 | Nudt12 |
100502959 | 100502959 | AV051173 |
12153 | 12153 | Bmp1 |
58212 | 58212 | Srrm3 |
16969 | 16969 | Zbtb7a |
19732 | 19732 | Rgl2 |
108096 | 108096 | Slco1a5 |
101883 | 101883 | Igflr1 |
51795 | 51795 | Srpx |
108168973 | 108168973 | Gm12828 |
We can then use the expression data returned by
get_dataset_object
to examine the expression values for
these genes.
exp_subset<- obj[[1]]$exp %>%
filter(Probe %in% top_genes$Probe)
genes <- top_genes$GeneSymbol
# ordering design file
design <- obj[[1]]$design %>% arrange(genotype,timepoint)
# shorten the resistance label a bit
design$genotype[grepl('HTT',design$genotype)] = "Huntington Model"
exp_subset[,.SD,.SDcols = rownames(design)] %>% t %>% scale %>% t %>%
pheatmap(cluster_rows = FALSE,cluster_cols = FALSE,labels_row = genes,
annotation_col =design %>% select(genotype,timepoint))
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] purrr_1.0.2 pheatmap_1.0.12 dplyr_1.1.4 gemma.R_3.3.1
## [5] BiocStyle_2.32.1
##
## loaded via a namespace (and not attached):
## [1] rappdirs_0.3.3 sass_0.4.9 utf8_1.2.4
## [4] generics_0.1.3 xml2_1.3.6 stringi_1.8.4
## [7] digest_0.6.37 magrittr_2.0.3 timechange_0.3.0
## [10] evaluate_1.0.1 grid_4.4.1 RColorBrewer_1.1-3
## [13] bookdown_0.41 fastmap_1.2.0 jsonlite_1.8.9
## [16] BiocManager_1.30.25 httr_1.4.7 fansi_1.0.6
## [19] viridisLite_0.4.2 scales_1.3.0 textshaping_0.4.1
## [22] jquerylib_0.1.4 cli_3.6.3 rlang_1.1.4
## [25] bit64_4.5.2 munsell_0.5.1 withr_3.0.2
## [28] cachem_1.1.0 yaml_2.3.10 tools_4.4.1
## [31] memoise_2.0.1 colorspace_2.1-1 kableExtra_1.4.0
## [34] assertthat_0.2.1 curl_6.0.1 vctrs_0.6.5
## [37] R6_2.5.1 lubridate_1.9.4 lifecycle_1.0.4
## [40] stringr_1.5.1 fs_1.6.5 htmlwidgets_1.6.4
## [43] bit_4.5.0.1 ragg_1.3.3 pkgconfig_2.0.3
## [46] desc_1.4.3 pkgdown_2.1.1 pillar_1.9.0
## [49] bslib_0.8.0 gtable_0.3.6 data.table_1.16.4
## [52] glue_1.8.0 systemfonts_1.1.0 xfun_0.49
## [55] tibble_3.2.1 tidyselect_1.2.1 rstudioapi_0.17.1
## [58] knitr_1.49 farver_2.1.2 htmltools_0.5.8.1
## [61] svglite_2.1.3 rmarkdown_2.29 compiler_4.4.1