A guide to metadata for samples and differential expression analyses

library(gemma.R)
library(dplyr)
library(pheatmap)
library(purrr)

Introduction

The data in Gemma are manually annotated by curators with terms, often using an ontology term on both dataset and sample level. In Gemma.R three primary functions allow access to these annotations for a given dataset.

get_dataset_annotations: This function returns annotations associated with a dataset. These try to serve as tags describing the dataset as a whole and they characteristics that samples within the datasets have while also including some additional terms.
get_dataset_samples: This function returns samples and associated annotations related to their experimental groups for an experiment
get_dataset_differential_expression_analyses: This function returns information about differential expression analyses automatically performed by Gemma for a given experiment. Each row of the output is a contrast where a specific property or an interaction of properties are described.

In the examples below we will be referring to GSE48962 experiment, where striatum and cerebral cortex samples from control mice and mice belonging to a Huntington model (R6/2) were taken from 8 week and 12 week old mice.

Dataset tags

Terms returned via get_dataset_annotations are tags used to describe a dataset in general terms.

get_dataset_annotations('GSE48962') %>%
    gemma_kable

class.name	class.URI	term.name	term.URI	object.class
strain	http://www.ebi…/EFO_0005135	C57BL/6	http://www.ebi…/EFO_0022397	ExperimentTag
strain	http://www.ebi…/EFO_0005135	CBA	http://www.ebi…/EFO_0022463	ExperimentTag
study design	http://www.ebi…/EFO_0001426	SUBSET	http://gemma.msl…/TGEMO_00022	ExperimentTag
assay	http://purl.obolibrary…/OBI_0000070	bulk RNA-seq assay	http://purl.obolibrary…/OBI_0003090	ExperimentTag
organism part	http://www.ebi…/EFO_0000635	striatum	http://purl.obolibrary…/UBERON_0002435	FactorValue
organism part	http://www.ebi…/EFO_0000635	cerebral cortex	http://purl.obolibrary…/UBERON_0000956	FactorValue
genotype	http://www.ebi…/EFO_0000513	CAG repeats Overexpression…	http://purl.org/../3064	FactorValue

These tags come as a class/term pairs and inherit any terms that is assigned to any of the samples. Therefore we can see all chemicals and cell types used in the experiment.

Factor values

Samples and differential expression contrasts in Gemma are annotated with factor values. These values contain statements that describe these samples and which samples belong to which experimental in a differential expression analysis respectively.

Sample factor values

In gemma.R these values are stored in nested data.tables and can be found by accessing the relevant columns of the outputs. Annotations for samples can be accessed using get_dataset_samples. sample.factorValues column contains the relevant information

samples <- get_dataset_samples('GSE48962')
samples$sample.factorValues[[
    which(samples$sample.name == "TSM490")
    ]] %>% 
    gemma_kable()

category	category.URI	value	value.URI	predicate	predicate.URI	object	object.URI	summary	ID	factor.ID	factor.category	factor.category.URI
organism part	http://www.ebi…/EFO_0000635	striatum	http://purl.obolibrary…/UBERON_0002435	NA	NA	NA	NA	striatum	120172	20540	organism part	http://www.ebi…/EFO_0000635
genotype	http://www.ebi…/EFO_0000513	HTT [human] huntingtin	http://purl.org/../3064	has_genotype	http://purl.obolibrary…/GENO_0000222	CAG repeats	NA	CAG repeats Overexpression…	120175	20541	genotype	http://www.ebi…/EFO_0000513
genotype	http://www.ebi…/EFO_0000513	HTT [human] huntingtin	http://purl.org/../3064	has_genotype	http://purl.obolibrary…/GENO_0000222	Overexpression	http://gemma.msl…/TGEMO_00004	CAG repeats Overexpression…	120175	20541	genotype	http://www.ebi…/EFO_0000513
block	http://www.ebi…/EFO_0005067	Device=HWUSI-EAS1563_0073_F…	NA	NA	NA	NA	NA	Device=HWUSI-EAS1563_0073_F…	163476	32643	block	http://www.ebi…/EFO_0005067
timepoint	http://www.ebi…/EFO_0000724	12 week	NA	NA	NA	NA	NA	12 week	120179	20543	timepoint	http://www.ebi…/EFO_0000724

The example above shows a single factor value object for one sample. The rows of this data.table are statements that belong to a factor value. Below each column of this nested table is described. If a given field is filled by an ontology term, the corresponding URI column will contain the ontology URI for the field.

category/category.URI: Category of the individual statement, such as treatment, phenotype or strain
value/value.URI: The subject of the statement.
predicate/predicate.URI: When a subject alone is not enough to describe all details, a statement can contain a predicate and an object. The predicate describes the relationship between the subject of the statement and the object. In the example above, these are used to denote properties of the human HTT in the mouse models
object/object.URI: The object of a statement is a property further describing it’s value. In this example these describe the properties of the HTT gene in the mouse model, namely that it has CAG repeats and it is overexpressed. If the value was a drug this could be dosage or timepoint.
summary: A plain text summary of the factorValue. Different statements will have the same summary if they are part of the same factor value
ID: An integer identifier for the specific factor value. In the example above, the genotype of the mouse is defined as a single factor value made up of two statements stating the HTT gene has CAG repeats and that it is overexpressed. This factor value has the ID of 120175 which is shared by both rows containing the statements describing it. This ID will repeat for every other patient that has the same genotype or differential expression results using that factor as a part of their contrast. For instance we can see which samples that was subjected to this condition using this ID instead of trying to match the other columns describing the statements

id <- samples$sample.factorValues[[
    which(samples$sample.name == "TSM490")
]] %>% filter(value == "HTT [human] huntingtin") %>% {.$ID} %>% unique


# count how many patients has this phenotype
samples$sample.factorValues %>% sapply(\(x){
    id %in% x$ID
}) %>% sum

## [1] 12

factor.ID: An integer identifier for the factor. A factor holds specific factor values. For the example above whether or not the mouse is a wild type mouse or if it has a wild type genotype is stored under the id 20541

We can use this to fetch all distinct genotypes

id <- samples$sample.factorValues[[
    which(samples$sample.name == "TSM490")
    ]] %>% 
    filter(value == "HTT [human] huntingtin") %>% {.$factor.ID} %>% unique

samples$sample.factorValues %>% lapply(\(x){
    x %>% filter(factor.ID == id) %>% {.$summary}
}) %>% unlist %>% unique

## [1] "wild type genotype"                                    
## [2] "CAG repeats  Overexpression of  HTT [human] huntingtin"

This shows us the dataset has control mice and Huntington Disease model mice.. This ID can be used to match the factor between samples and between samples and differential expression experiments - factor.category/factor.category.URI: The category of the whole factor. Usually this is the same with the category of the statements making up the factor value. However in cases like the example above, where the value describes a treatment while the factor overall represents a phenotype, they can differ.

gemma.R includes a convenience function to create a simplified design matrix out of these factor values for a given experiment. This will unpack the nested data.frames and provide a more human readable output, giving each available factor it’s own column.

design <- make_design(samples)
design[,-1] %>% head %>%  # first column is just a copy of the original factor values
    gemma_kable()

	block	organism part	genotype	timepoint
ESW176	Device=HWUSI-EAS1563_0071_F…	striatum	wild type genotype	8 week
TCW9469	Device=HWUSI-EAS1563_0053:L…	cerebral cortex	wild type genotype	12 week
ECM175	Device=HWUSI-EAS1563_0071_F…	cerebral cortex	CAG repeats Overexpression…	8 week
ESW183	Device=HWUSI-EAS1563_0072_F…	striatum	wild type genotype	8 week
ECW178	Device=HWUSI-EAS1563_0071_F…	cerebral cortex	wild type genotype	8 week
TSW479	Device=HWUSI-EAS1563_0073_F…	striatum	wild type genotype	12 week

Using this output, here we look at the sample sizes for different experimental groups.

design %>%
    group_by(`organism part`,timepoint,genotype) %>% 
    summarize(n= n()) %>% 
    arrange(desc(n)) %>% 
    gemma_kable()

## `summarise()` has grouped output by 'organism part', 'timepoint'. You can
## override using the `.groups` argument.

organism part	timepoint	genotype	n
cerebral cortex	12 week	CAG repeats Overexpression…	3
cerebral cortex	12 week	wild type genotype	3
cerebral cortex	8 week	CAG repeats Overexpression…	3
cerebral cortex	8 week	wild type genotype	3
striatum	12 week	CAG repeats Overexpression…	3
striatum	12 week	wild type genotype	3
striatum	8 week	CAG repeats Overexpression…	3
striatum	8 week	wild type genotype	3

Differential expression analysis factor values

For most experiments it contains, Gemma performs automated differential expression analyses. The kinds of analyses that will be performed is informed by the factor values belonging to the samples.

# removing columns containing factor values and URIs for brevity
remove_columns <- c('baseline.factors','experimental.factors','subsetFactor','factor.category.URI')

dea <- get_dataset_differential_expression_analyses("GSE48962")

dea[,.SD,.SDcols = !remove_columns] %>% 
    gemma_kable()

result.ID	contrast.ID	experiment.ID	factor.category	factor.ID	isSubset	probes.analyzed	genes.analyzed
492856	120175_120178	8972	genotype,timepoint	20541,20543	TRUE	18969	18972
492855	120175	8972	genotype	20541	TRUE	18968	18971
492854	120178	8972	timepoint	20543	TRUE	18968	18971
492853	120175_120178	8972	genotype,timepoint	20541,20543	TRUE	20621	20623
492852	120175	8972	genotype	20541	TRUE	20621	20623
492851	120178	8972	timepoint	20543	TRUE	20621	20623

The example above shows the differential expression analyses results. Each row of this data.table represents a differential expression contrast connected to a fold change and a p value in the output of get_differential_expression_values function. If we look at the contrast.ID we will see the factor value identifiers returned in the ID column of our sample.factorValues. These represent which factor value is used as the experimental factor. Note that some rows will have two IDs appended together. These represent the interaction effects of multiple factors. For simplicity, we will start from a contrast without an interaction.

contrast <- dea %>% 
    filter(
        factor.category == "genotype" & 
            subsetFactor %>% map_chr('value') %>% {.=='cerebral cortex'} # we will talk about subsets in a moment
        )

# removing URIs for brevity
uri_columns = c('category.URI',
                'object.URI',
                'value.URI',
                'predicate.URI',
                'factor.category.URI')

contrast$baseline.factors[[1]][,.SD,.SDcols = !uri_columns] %>% 
     gemma_kable()

category	value	predicate	object	summary	ID	factor.ID	factor.category
genotype	wild type genotype	NA	NA	wild type genotype	120174	20541	genotype

contrast$experimental.factors[[1]][,.SD,.SDcols = !uri_columns] %>% 
     gemma_kable()

category	value	predicate	object	summary	ID	factor.ID	factor.category
genotype	HTT [human] huntingtin	has_genotype	CAG repeats	CAG repeats Overexpression…	120175	20541	genotype
genotype	HTT [human] huntingtin	has_genotype	Overexpression	CAG repeats Overexpression…	120175	20541	genotype

Here, we can see the baseline is the wild type mouse, being compared to the Huntington Disease models

If we examine a factor with interaction, both baseline and experimental factor value columns will contain two factor values.

contrast <- dea %>% 
    filter(
        factor.category == "genotype,timepoint" & 
            subsetFactor %>% map_chr('value') %>% {.=='cerebral cortex'} # we're almost there!
        )

contrast$baseline.factors[[1]][,.SD,.SDcols = !uri_columns] %>% 
     gemma_kable()

category	value	predicate	object	summary	ID	factor.ID	factor.category
genotype	wild type genotype	NA	NA	wild type genotype	120174	20541	genotype
timepoint	12 week	NA	NA	12 week	120179	20543	timepoint

contrast$experimental.factors[[1]][,.SD,.SDcols = !uri_columns] %>% 
     gemma_kable()

category	value	predicate	object	summary	ID	factor.ID	factor.category
genotype	HTT [human] huntingtin	has_genotype	CAG repeats	CAG repeats Overexpression…	120175	20541	genotype
genotype	HTT [human] huntingtin	has_genotype	Overexpression	CAG repeats Overexpression…	120175	20541	genotype
timepoint	8 week	NA	NA	8 week	120178	20543	timepoint

A third place that can contain factorValues is the subsetFactor. Certain differential expression analyses exclude certain samples based on a given factor. In this example we can see that this analysis were only performed on samples from the cerebral cortex.

contrast$subsetFactor[[1]][,.SD,.SDcols = !uri_columns] %>%
     gemma_kable()

category	value	predicate	object	summary	ID	factor.ID	factor.category
organism part	cerebral cortex	NA	NA	cerebral cortex	120173	20540	NA

The ids of the factor values included in baseline.factors and experimental.factors along with subsetFactor can be used to determine which samples represent a given contrast. For convenience, get_dataset_object function which is used to compile metadata and expression data of an experiment in a single object, includes resultSets and contrasts argument which will return the data already composed of samples representing a particular contrast.

obj <-  get_dataset_object("GSE48962",resultSets = contrast$result.ID,contrasts = contrast$contrast.ID,type = 'list')
obj[[1]]$design[,-1] %>% 
    head %>% gemma_kable()

	block	organism part	genotype	timepoint
TCW9469	Device=HWUSI-EAS1563_0053:L…	cerebral cortex	wild type genotype	12 week
ECM175	Device=HWUSI-EAS1563_0071_F…	cerebral cortex	CAG repeats Overexpression…	8 week
ECW178	Device=HWUSI-EAS1563_0071_F…	cerebral cortex	wild type genotype	8 week
TCW9451	Device=HWI-EAS413_0047:Lane=2	cerebral cortex	wild type genotype	12 week
TCW9457	Device=HWUSI-EAS1563_0053:L…	cerebral cortex	wild type genotype	12 week
TCM9450	Device=HWI-EAS413_0047:Lane=1	cerebral cortex	CAG repeats Overexpression…	12 week

We suggested that the contrast.ID of a contrast also corresponded to a column in the differential expression results, acquired by get_differential_expression_values. We can use what we have learned to take a look at the expression of genes at the top of the phenotype, treatment interaction. Each result.ID returns its separate table when accessing differential expression values.

dif_vals <- get_differential_expression_values('GSE48962')
dif_vals[[as.character(contrast$result.ID)]] %>% head %>%  
     gemma_kable()

Probe	NCBIid	gene_ensembl_id	GeneSymbol	GeneName	pvalue	corrected_pvalue	rank	contrast_120175_120178_coefficient	contrast_120175_120178_log2fc	contrast_120175_120178_tstat	contrast_120175_120178_pvalue
100502959	100502959		AV051173	expressed sequence AV051173	1.60e-06	0.0163	9.70e-05	3.7586	3.7586	12.4896	1.60e-06
67993	67993	ENSMUSG00000024228	Nudt12	nudix hydrolase 12	1.10e-06	0.0163	4.85e-05	-1.1088	-1.1088	-13.0595	1.10e-06
108096	108096	ENSMUSG00000063975	Slco1a5	solute carrier organic anio…	1.00e-04	0.2082	6.00e-04	-3.6495	-3.6495	-6.7998	1.00e-04
101883	101883	ENSMUSG00000036826	Igflr1	IGF-like family receptor 1	1.00e-04	0.2082	6.00e-04	2.3595	2.3595	6.7859	1.00e-04
51795	51795	ENSMUSG00000090084	Srpx	sushi-repeat-containing pro…	1.00e-04	0.2082	5.00e-04	-4.4992	-4.4992	-6.8665	1.00e-04
16969	16969	ENSMUSG00000035011	Zbtb7a	zinc finger and BTB domain …	7.81e-05	0.2082	2.00e-04	1.1581	1.1581	7.3738	7.81e-05

To get the top genes found associated with this interaction we access the columns with the correct contrast.ID.

# getting the top 10 genes
top_genes <- dif_vals[[as.character(contrast$result.ID)]] %>% 
    arrange(across(paste0('contrast_',contrast$contrast.ID,'_pvalue'))) %>% 
    filter(GeneSymbol!='' | grepl("|",GeneSymbol,fixed = TRUE)) %>% # remove blank genes or probes with multiple genes
    {.[1:10,]}
top_genes %>% select(Probe,NCBIid,GeneSymbol) %>% 
     gemma_kable()

Probe	NCBIid	GeneSymbol
67993	67993	Nudt12
100502959	100502959	AV051173
12153	12153	Bmp1
58212	58212	Srrm3
16969	16969	Zbtb7a
19732	19732	Rgl2
108096	108096	Slco1a5
101883	101883	Igflr1
51795	51795	Srpx
108168973	108168973	Gm12828

We can then use the expression data returned by get_dataset_object to examine the expression values for these genes.

exp_subset<- obj[[1]]$exp %>% 
    filter(Probe %in% top_genes$Probe)
genes <- top_genes$GeneSymbol

# ordering design file
design <- obj[[1]]$design %>% arrange(genotype,timepoint)

# shorten the resistance label a bit
design$genotype[grepl('HTT',design$genotype)] = "Huntington Model"

exp_subset[,.SD,.SDcols = rownames(design)] %>% t  %>% scale %>% t %>%
    pheatmap(cluster_rows = FALSE,cluster_cols = FALSE,labels_row = genes,
             annotation_col =design %>% select(genotype,timepoint))

Session info