Data Curation
Workflow
The figure above outlines the key steps taken to gather data and prepare it for use in Gemma. These steps include dataset selection, platform processing, expression data processing, metadata curation and downstream analyses. Many of these steps are automated, with some human intervention and manual curation at key stages. Some of these steps are discussed in more detail in the following sections, and additional information is available in Lim et al. 2021.
Dataset selection
When selecting which datasets to import into Gemma, we prioritize studies relating to the nervous system, as well as studies involving genetic perturbations of transcriptional regulators and drug treatments. We de-prioritize datasets that have small sample sizes as these are less suitable for the downstream analyses implemented in Gemma. We also prioritize studies that have biological replication and a clear experimental design affording specific comparisons between contrasting conditions (e.g. disease vs control). We currently only select studies conducted in human, mouse and rat.
Gemma imports data primarily from the NCBI’s Gene Expression Omnibus(GEO). Data is imported at the GEO series (GSE) level. A GEO sample (GSM) can only appear in one Gemma experiment, unlike in GEO. This means that some experiments imported from GEO do not have the same samples as the corresponding GEO entry, because samples that appear in experiments already in Gemma are removed at load time.
Platform processing
Gemma supports a wide variety of microarray platforms, including Affymetrix GeneChips, Agilent spotted arrays, Illumina BeadArrays and many two-color platforms, as well as short-read RNA-seq and single-cell RNA-seq data. A key step in platform processing is linking expression data to genes, especially for microarray platforms, where the probe sequences on the arrays were often designed prior to the availability of high-quality reference genome sequences and annotations.
For microarray platforms, we first have to obtain the nucleotide sequences of the probes. We typically acquire them from the manufacturers’ websites, as they are often not available from GEO. Next, we align the probe sequences against the appropriate reference genome using BLAT. Whenever a new genome assembly is available, the probes are realigned. Probes are then mapped to transcripts using genome annotations from UCSC GoldenPath.
For RNA-seq data, we define a ‘pseudo-platform’ for each taxon, where the entire platform’s elements are the set of known genes recorded in the reference genome annotations. The output of our RNA-seq data processing pipeline can then be linked to these ‘generic’ platforms based on NCBI gene IDs.
Data processing and quality control
The resulting expression data are processed through a common post-processing pipeline, including log2-transformation (if not already performed), quantile normalization, batch correction and outlier detection.
In Gemma, batches are defined automatically using information in the raw data files. For microarrays, date stamps can usually be found in CEL files for Affymetrix datasets or in GenePix output files for Agilent datasets. Occasionally, batch information is obtained manually from supplementary information provided by the data submitter or the associated publication. For RNA-seq datasets, we to use information from the FASTQ headers, which for Illumina platforms often contain information on the ‘device’, ‘run’, ‘flow cell’ and/or ‘lane’. Any batch information obtained is stored as a factor in the experimental design. Batch correction is conducted using an in-house implementation of the [ComBat algorithm] (https://academic.oup.com/biostatistics/article-abstract/8/1/118/252073), enhanced to automatically select covariates from the experimental design.
Gemma automatically identifies potential outlier samples using the sample–sample expression correlation matrix. To avoid misclassifying biologically meaningful groups as outliers, the correlation matrix is first adjusted by regressing out the effects of major experimental factors, such as tissue type. All outliers called by the algorithm are reviewed and approved by a curator.
A number of diagnostics are computed and stored for each dataset, such as principal component analysis (PCA) of the expression levels, the relationship between the principal components and curated experimental factors, the mean–variance relationship of expression levels and sample–sample expression level correlations. These are used for manual and automated QC processes.
Curation of dataset metadata
After loading and basic automatic pre-processing of metadata has been done, we proceed with manual curation based on an extensive set of guidelines. The manual curation includes annotating both the experimental design and the ‘topics’ of the datasets. Curators use Gemma administrative tools to label each sample according to factors such as tissue and treatment. Experiments themselves are ‘tagged’ with terms describing the study at a higher level. For both types of annotations, we primarily use terms from Open Biomedical Ontologies. Among the benefits of using ontologies are more uniformity within Gemma and better interoperability with other resources. These annotations are used to help locate data sets in searches, as well as in performing statistical analyses.
In Gemma, an ‘experimental design’ refers to the characteristics of samples in a dataset that allow groups to be formed for downstream analysis of the data, such as differential expression. These characteristics are used as input in our statistical models and also in dataset searches. An experimental design is organized around Experimental Factors. A factor is a known variable in an experiment, such as “age”, “genotype” or “treatment”. An experiment can have any number of factors, but most have only 1-3. Each sample has a specific Factor Value for each of the factors. For example, for “genotype”, the available values might be “wild-type” and “mutant”. For a parameter like “age”, the values might be continuous values like “10.1 years”.
Another aspect of design curation is the identification of the ‘baseline condition’ of the study, which is important for differential expression analysis. Assignment of the baseline condition is not performed for all factors, as there might not be any to begin with (e.g. ‘organism part’ and ‘biological sex’). For factors that do have a baseline condition (e.g. ‘genotype’, ‘treatment’ and ‘timepoint’), the condition is annotated with specific ontology terms such as ‘reference substance role’ etc.
Dataset-level annotations are used to capture information that is often not explicitly part of the experimental design and not otherwise captured. An example would be a tissue or disease state that is constant across all of the samples in a dataset but also relevant topics that might not be otherwise explicit. These tags are used to help ensure that searches for a relevant concept would retrieve the dataset. We ensure that certain types of information are always captured, including tissue of origin, cell type, cell line, mouse/rat strains and biological sex.
Gene Expression Experiment Quality (GEEQ) score
Gene Expression Experiment Quality (GEEQ) score gives a summary measure of data quality. It is meant to help answer the question ‘How well can I trust the results of an analysis of these data?’ Some of the properties we take into account when calculating the score include the number of replicates for each condition, severity of batch effects (unless corrected) and median sample–sample correlation. Prior experience has shown that datasets that have issues in these categories tend to give results that are noisy and less reproducible; we assign a lower GEEQ score to reflect that. Based on extensive experience, we calibrated the GEEQ scores such that values in the lowest and highest quintiles reflect the extremes of observed data quality. These scores are merely a rough guide to assist users in identifying datasets that might be especially suitable or problematic.