CORR — corr • ermineR

This method examines the gene expression profiles themselves, not the gene scores for each gene (which is how the other methods like ORA work). A score is computed for a gene set based on how correlated the expression profiles are.

This can be thought of as a measure of how well the genes in the set cluster together, but they need not all be in the same cluster. Thus a gene set that contains two coherent clusters that encompass most of the genes in the set will tend to get a good score (though not as good as a gene set that is just one big cluster).

If you are interested in gene clustering, as opposed to simply looking at differntial expression, this method is appropriate. If you feel limited by the choice of distance metrics in ermineJ, ORA would be an alternative, but you have to define distinct clusters of genes to do that.

One alternative use of correlation scoring is as a control for gene-score-based analysis. Correlated gene sets can cause spurious high scores, especially if the differential expression in your study is weak. To use this approach, you could first use gene-score-based analysis (e.g., gene score resampling and then analyze the data using correlation analysis. If any of your gene sets have high scores in both analyses, you should look at the data to see if the correlation is not associated with the differential expression. This is a simple (but ad hoc) alternative to using resampling over the samples to do the gene-score-based analysis.

Method overview taken from: http://erminej.msl.ubc.ca/help/tutorials/running-an-analysis-correlation/

corr(
  expression,
  annotation = NULL,
  aspects = c("Molecular Function", "Cellular Component", "Biological Process"),
  iterations = 10000,
  geneReplicates = c("mean", "best"),
  pAdjust = c("FDR", "Bonferroni"),
  geneSetDescription = "Latest_GO",
  customGeneSets = NULL,
  minClassSize = 20,
  maxClassSize = 200,
  output = NULL,
  return = TRUE
)

Arguments

expression

A file path or a data frame. Expression data. (test = CORR only) Necesary correlation anaylsis. See http://erminej.msl.ubc.ca/help/input-files/gene-expression-profiles/ for data format

annotation

Annotation. A file path, a data.frame or a platform short name (eg. GPL127). If given a platform short name it will be downloaded from annotation repository of Pavlidis Lab (https://gemma.msl.ubc.ca/annots/). To get a list of available annotations, use listGemmaAnnotations. Note that if there is a file or folder with the same name as the platform name in the directory, that file will be read instead of getting a copy from Pavlidis Lab. If this file isn't a valid annotation file, the function will fail. If providing a custom annotation file, see makeAnnotation to do it from R or erminej.msl.ubc.ca/help/input-files/gene-annotations/ to do it manually.

If you are providing a custom gene set, you can leave annotation as NULL

aspects

Character vector. Which Go aspects to include in the analysis. Can be in long form (eg. 'Molecular Function') or short form (eg. c('M','C','B'))

iterations

Number of iterations. We suggest a starting value of 10000 iterations. When you decide on parameters you like, we recommend a larger number of iterations (perhaps 200,000 or more). This is to get sufficient precision in the p-values to make multiple-test correction work correctly. (test = GSR CORR and precRecall methods only)

geneReplicates

What to do when genes have multiple scores in input file (due to multiple probes per gene)

pAdjust

Which multiple test correction method to use. Can be "FDR" or 'Westfall-Young' (slower).

geneSetDescription

"Latest_GO", a file path that leads to a GO XML or OBO file or a URL that leads to a go ontology file that ends with rdf-xml.gz.

If you left annotation as NULL and provided customGeneSets, this argument is not required and will default to NULL. Otherwise, by default it'll be set to "Latest_GO" which downloads the latest available GO XML file. This option won't work without an internet connection. To get a frozen file that you can use later, see goToday, goAtDate and getGoDates. See http://erminej.msl.ubc.ca/help/input-files/gene-set-descriptions/ for details.

customGeneSets

Path to a directory that contains custom gene set files, paths to custom gene set files themselves or a named list of character strings. Use this option to create your own gene sets. If you provide directory you can specify probes or gene symbols to include in your gene sets. See http://erminej.msl.ubc.ca/help/input-files/gene-sets/ for information about format for this file. If you are providing a list, only gene symbols are accepted.

minClassSize

minimum class size

maxClassSize

maximum class size

output

Output file name.

return

If results should be returned. Set to FALSE if you only want a file

Value

A list containing a "results" component and a "details" component. "results" is a data.frame containing the main output. The columns of this table are

Name: the name of the gene set
ID: the id of the gene set
NumProbes: the number of elements (e.g. probes) in the gene set.
NumGenes: the number of genes in the gene set.
RawScore: the raw statistic for the gene set. For explanations see this page
Pval: the p value for the gene set.
CorrectedPvalue: the corrected p pvalue. See this page for more information.
MFPvalue: pvalue after multifunctionality correction. Might be missing if correction was not performed.
CorrectedMFPvalue: Like CorrectedPvalue, but for the multifunctionality “corrected” pvalue.
Multifunctionality: How biased the genes in the set are towards multifunctional genes.
Same as: a list of gene sets which have the exact same members as this one. Such gene sets are not listed anywhere else.
GeneMembers: If you selected the “Include genes” option when saving, this will contain a list of the genes that are in the gene set, separated by “|”.

"details" section contain settings that were used to run the analysis.