Signature Analysis#
This page provides guidelines for curating and analyzing RNA-based gene expression signatures. These signatures are typically published in peer-reviewed literature and made publicly available through trusted repositories.
What is a Signature?#
A gene signature is a defined set of genes whose collective expression pattern is associated with a specific biological process, disease state, or response to therapy. Gene signatures are used in enrichment analysis, clustering, classification, and predictive modeling, and are increasingly applied to identify associations with drug response in both preclinical models and clinical datasets.
Signature Sources#
You can find gene signatures from:
-
MSigDB (Molecular Signatures Database) – curated gene sets for enrichment analysis
→ https://www.gsea-msigdb.org/gsea/msigdb -
SignatureSets R package – curated gene sets manually annotated and versioned
- Annotated with GENCODE v40 bhklab/SignatureSets
- Uses HUGO gene symbols linked to Entrez and Ensembl IDs
Computing Signature Scores#
Depending on the type of signature, different scoring strategies are used:
▸ Unweighted Signatures#
- Methods: GSVA, ssGSEA
- Description: Compute enrichment scores across samples without gene-level weights
▸ Weighted Signatures#
- Method: Weighted mean expression
- Weights: +1 (upregulated), –1 (downregulated)
▸ Signature-Specific Algorithms#
- Some signatures require custom computation as defined in their original publication
→ e.g., bhklab/PredictioR
Signature Analysis Workflows#
🔹 Cluster Analysis#
- Compute pairwise gene overlaps between signatures
- Perform PCA on the overlap matrix
- Cluster using Affinity Propagation Clustering (
apcluster) - For each cluster, aggregate genes and perform pathway enrichment analysis (e.g., KEGG, Reactome, or GO Biological Process) using tools like
enrichR,clusterProfiler, orgprofiler2
🔹 Correlation Analysis#
To assess similarity or redundancy between gene signatures:
- Compute Spearman and/or Pearson correlation coefficients between signature scores across samples.
- Spearman is rank-based, captures monotonic relationships, and is robust to outliers and non-linear patterns.
- Pearson assumes linear relationships and is sensitive to the scale and magnitude of values.
-
Use the
cor()function in R withmethod = "spearman"or"pearson"as appropriate. -
Visualize the correlation matrix using the
corrplotR package or a heatmap (e.g.,pheatmaporComplexHeatmap) to identify clusters or patterns among signatures.
🔹 Association Analysis#
Clinical Associations#
- Binary outcomes (e.g., response vs. no response): Logistic regression
- Time-to-event data (e.g., progression-free survival): Cox proportional hazards model
Preclinical Associations#
- Drug response metrics (e.g., IC50, AUC): Spearman or Pearson correlation
Tip
Apply multiple testing correction (e.g., Benjamini-Hochberg FDR, Bonferroni) where appropriate.
Additional Tools and Packages#
| Purpose | Tool / Package |
|---|---|
| Signature scoring | GSVA, ssGSEA, custom code |
| Clustering | apcluster, hclust |
| Enrichment analysis | enrichR, clusterProfiler, fgsea |
| Visualization | corrplot, ComplexHeatmap, pheatmap, ggplot2 |
| Association modeling | glm, survival (includes coxph), caret |
Note
The choice of packages or pipelines may vary based on the specific research question and analysis goals.
Related Concept: What is Pathway Analysis?#
While signature analysis focuses on scoring samples based on predefined gene sets (signatures) often linked to clinical traits or phenotypes, pathway analysis aims to identify biological pathways (e.g., KEGG, Reactome, GO) that are statistically enriched in a gene list or ranked gene set (e.g., from differential expression). A common gene set used in cancer is the Hallmark set.
Pathway analysis is often used alongside or after signature analysis to interpret what broader biological processes the gene sets represent.
Tip
Use pathway analysis when your research question involves uncovering mechanisms or biological context; use signature analysis when focusing on prediction, classification, or phenotypic scoring. The choice of packages or pipelines may vary based on the specific research question and analysis goals.