Immuno-Oncology Clinical Trial Curation#

Welcome to the Immuno-Oncology Curation Guide!

Whether you're new to the lab or need a quick refresher, this guide walks you through the process of curating Immuno-Oncology (IO) clinical trial datasets into structured, analysis-ready R objects. The goal is to standardize raw and processed data into clean MultiAssayExperiment (MAE) objects for use in downstream analysis and collaborative research.

What Is IO Curation?#

IO curation is the process of transforming clinical trial data into reusable, analysis-ready formats compatible with R-based workflows.

Each curated dataset includes two key components:

Clinical metadata: Patient/sample-level information such as treatment, response, survival, demographics.
Molecular profiles: Expression data (RNA-seq or microarray), and when available SNV and CNA data. These are formatted as SummarizedExperiment (SE) or RangedSummarizedExperiment (RangedSE) objects, depending on the assay type.
Annotation data: Row-level annotations (e.g., Ensembl ID, gene name) are stored within each assay.

We curate all data into the MultiAssayExperiment (MAE) format. All publicly available curated datasets are located on ORCESTRA. We recommend downloading one to explore the clinical metadata and molecular assay structure.

Step-by-Step Workflow#

An example of a clinical data processing pipeline is available ICB_Van_Allen Snakemake.

The standard curation process includes:

Access and download source data (raw or processed)
Process or import molecular data (e.g., RNA-seq, SNV, CNA), ensuring standardized formats and identifiers
Process and clean clinical metadata, harmonizing variable names, response labels, and survival fields
Add standardized annotations (e.g., drug names, gene identifiers, tissue types)
Create SE or RangedSE objects, depending on assay type
Assemble the final MAE object, integrating all data components
Review a reference IO dataset, curated example on ORCESTRA.

1. Download Source Data#

Begin by reviewing the original publication to confirm study design, molecular assays, and whether the data is public or private.

Dataset Categories#

Private datasets: Stored internally (e.g., Box, institutional drives). May include PHI and require ethics approval.
Public datasets: Available via GEO, dbGaP, Zenodo, and EGA.

1.1 Molecular Data#

If raw RNA (FASTQ files) are not available, look for processed files by modality:

RNA-seq: TPM or count matrices (CSV, TSV, Excel)
RNA-seq: Isoform-level expression (optional but recommended)
Microarray: Normalized expression matrices (e.g., quantile normalized)
DNA (SNV): VCF, MAF, or binary gene-level mutation calls
DNA (CNA): Gene-by-sample matrices or segment files

1.2 Expression and Mutation Data#

RNA sequencing (RNA-seq) quantifies gene expression by aligning RNA reads to a reference genome.

There are two options depending on data availability:

If only processed RNA-seq data is available: Use the provided gene-level TPM or count matrices (CSV, TSV, or Excel format). Include isoform (transcript-level) data when available.
If RNA-seq FASTQ files are available: Use the kallisto Snakemake pipeline available on HPC4Health (H4H). FASTQ files are typically stored at: /cluster/projects/bhklab/rawdata/EGA/ The pipeline is located in pipelines/kallisto_snakemake_pipeline/, with setup instructions in README.md. Expression values can be extracted using this script.

For microarray data, follow the same structure using quantile-normalized expression matrices. For SNV data, use either pre-processed mutation calls, or extract SNVs directly from FASTQ files using appropriate variant-calling pipelines (e.g., WES and RNA-seq reference).

1.3 Clinical Metadata#

Clinical metadata should be collected as CSV or Excel files and should include:

Patient/sample identifiers
Treatment and response information
OS/PFS time and event censoring (highly preferred)

2. Process Molecular Data#

You will need TPM values for downstream analysis, whether derived from raw FASTQ files or already processed expression data. The final output should be log-transformed TPM.

If you have TPM, use:

log2(TPM + 0.001)

If you have raw counts, convert to TPM using:

GetTPM <- function(counts, gene_length) {
  x <- counts / gene_length
  return(t(t(x) * 1e6 / colSums(x)))
}

Other data types: * SNV data: Binary gene × sample matrix preferred * CNA data: Gene-level amplifications, deletions, or summary scores * Ensure row and column names are clean, and sample IDs are consistent across all data types * See helpful utility functions in the ICB_Common/code repository

3. Process Clinical Data#

Format clinical metadata as:

Rows: patient/sample IDs
Columns: clinical attributes

3.1 Mandatory Columns#

Column name	Description
Patientid	This column contains unique patient identifiers
treatmentid	This column contains the treatment regimen of each patient. Individual drug names are separated by ":" and standardized based on the lab's nomenclature. For example, the drug combo "FAC" is represented as "5-fluorouracil:Doxorubicin:Cyclophosphamide"
response	This column contains the response status of the patients to the given treatment - Responders (R) and Non-responders (NR)
tissueid	Cancer type standardized based on the lab's nomenclature from Oncotree. Example: “Breast”
survival_time_pfs/survival_time_os	The time starting from taking the treatment to the occurrence of the event of interest. The event name like "pfs", "os" must be appended to survival_time to differentiate the survival measure. Example for data in this column: “2.6”
survival_unit	The unit in which the survival time is measured. If the event is measured in other units such as “day”, or “year”, it must be converted to "month" for consistency
event_occurred_pfs/event_occurred_os	Binary measurement showing whether the event of interest occurred (1) or not (0). The event name like "pfs", "os" must be appended to event_occurred to differentiate the survival measure

Note

Common columns must be the first set of columns appearing in the metadata, followed by any additional columns. You may add other metadata columns available in the source data, but the standardized columns above should be present first.

3.2 Additional Columns#

The table below shows the other common columns across the 19 ICB datasets curated

Column name	Description	type
age	Age	source
AMP	Sum of total AMP/coverage; calculated from CNA values	in-lab curation
cancer_type	Type of cancer tissue	source
CIN	Calculated from CNA values	in-lab curation
CNA_tot	Sum of total CNA/coverage; calculated from CNA values	in-lab curation
DEL	Sum of total DEL/coverage; calculated from CNA values	in-lab curation
dna	DNA sequencing type. eg: whole exome sequencing	source
dna_info	Method for normalizing DNA sequencing data	in-lab curation
histo	Histological info such as subtype	source
indel_nsTMB_perMb	-	in-lab curation
indel_nsTMB_raw	-	in-lab curation
indel_TMB_perMb	-	in-lab curation
indel_TMB_raw	-	in-lab curation
nsTMB_perMb	-	in-lab curation
nsTMB_raw	-	in-lab curation
recist	Annotated using RECIST. The most commonly used responses are CR, PR, SD, PD.	source
response.other.info	Same data as Responders (R) and Non-responders (NR)	source
rna	Type of rna processed data. eg: TPM	source
rna_info	Method for normalizing RNA sequencing data	in-lab curation
sex	Sex of the patient - Male or Female	source
stage	Cancer stage	source
survival_type	PFS or OS or both (denoted by '/'). If both, added by in-lab curation	in-lab curation
TMB_perMb	TMB per megabase (Mb) calculated where: TMB = mutns/target; mutns = number of non-synonymous mutations; and target = target size of the sequencing. See Supplementary Table S2 of PMID: 36055464	in-lab curation
TMB_raw	Tumor Mutation Burden raw values	in-lab curation
treatment	Drug target or drug name	source

4. Add Annotations #

Lab standardized annotation data are stored in BHKLab-Pachyderm's

4.1 Gene Annotations#

Check the gene annotation version used in the original dataset (typically stated in the reference paper or supplement).

Then download the matching file from the BHKLab Annotations repository. Using Gencode.v19.annotation.RData and Gencode.v40.annotation.RData files are preferred:

Each .RData file includesfeatures_gene, features_transcript, and tx2gene.

Note

The goal is to retain as many genes as possible and match the original reference. Using a mismatched annotation version can lead to a loss of gene entries—this is not preferred.

4.2 Drug Annotations#

Standardize treatment names using BHKLab’s drug annotation files, using drugs_with_ids.csv.

If the treatment is not listed there, search external databases such as PubChem to verify the correct drug name.

Note

For the treatment column, immunotherapy regimens are currently grouped into the following categories:

PD-1/PD-L1: Immune checkpoint inhibitors targeting PD-1 or PD-L1
CTLA4: Checkpoint inhibitors targeting CTLA-4
IO+combo: Combination immunotherapy
IO+chemo: Immunotherapy plus chemotherapy
IO+targeted: Immunotherapy plus targeted therapy

4.3 Tissue Annotations#

Use OncoTree to map cancer types. If unmatched, perform manual review and map to standardized tissue categories.

5. Create SE or RangedSE#

Use:

SummarizedExperiment: for expression or mutation matrices (TPM, SNV binary calls)
RangedSummarizedExperiment: for genomic ranges (e.g., VCFs with genomic coordinates)

Each object should include:

assay: main data matrix (features × samples)
rowData: feature metadata (e.g., gene symbol, Ensembl ID)
colData: sample-level metadata (clinical)

6. Build MAE#

Integrate multiple assay types and clinical data into a single MAE object.

Required Components:

experiments(): a list of SE/RangedSE objects (e.g., expr, snv, cna)
colData(): the clinical metadata
sampleMap(): map linking sample IDs to patients across assays

7. IO Example Dataset#

View the dataset online at ICB_Van_Allen — available on Orcestra.

The following tabs are included:

Dataset Tab: Contains Gencode v19 annotations and related publication references.

Pipeline Tab:

Commit: Key scripts are available available in this GitHub commit. Below is the structure of the folder:

📁  ICB_Van_Allen/
├── 📄 Snakefile                       # Snakemake workflow combining all scripts
└── 📁 scripts
    ├── 📄 format_downloaded_data.R   # Generates CLIN, EXPR, SNV input files
    ├── 📄 Format_CLIN.R              # Cleans and annotates clinical metadata
    ├── 📄 Format_EXPR.R              # Processes and logs RNA expression data
    ├── 📄 Format_SNV.R               # Cleans SNV mutation data
    ├── 📄 Format_CNA_seg.R           # Segmented CNA profiles 
    ├── 📄 Format_CNA_gene.R          # Gene-level CNA profiles 
    └── 📄 Format_cased_sequenced.R   # Flags patients with RNA/CNA/SNV data

Script: Core functions for curating clinical and molecular data, ICB_Common
Annotation: Source for drug, tissue and gene annotations files, Annotations