Pipeline Standards
A set of standards that ORCESTRA will assume are followed in all future pipelines
This document outlines a set of standards that all pipelines should follow.
These standards are designed to ensure that pipelines are easy to use, reproducible, and well-documented.
They are also designed to make it easier for multiple users to collaborate on a pipeline, and to make it easier for users to reproduce the results of a pipeline.
Fundamentally, any user should be able to:
Clone the repository from Github
Run the pipeline using the exact command provided in the
README.md
Obtain the same results as the original pipeline run by the curator without the pipeline failing.
It is highly recommended to try these steps out from a fresh environment to ensure that the pipeline is reproducible.
TL;DR
Version Control: Use git for version control as early as possible. Store the pipeline in a git repository with a README.md
file that explains how to run the pipeline.
Isolated Environments: Use conda environments to define the environment that the pipeline is run in, as well as the environment that each rule is run in. If using containers, include the Dockerfile
or Singularity
file in the repository.
Directory Structure: Follow a consistent directory structure, ensuring relative paths are used everywhere.
Data Description: Clearly describe data sources, formats, processing steps, and provide links or instructions to obtain non-public data.
Self-Documenting Code: Ensure that the Snakefile
and pipeline rules are self-documenting or well-commented. Use consistent naming conventions for rules, scripts, and environment files.
Definitions
Pipeline/Workflow: A set of rules that are run in a specific order to produce a set of output files.
Rule/Step: A set of commands/scripts that are run to produce a set of output files from a set of input files.
ORCESTRA: The project that these standards are being developed for. This projects will be used to store and run pipelines on the cloud.
Curator (of a pipeline): The person responsible for maintaining a pipeline and ensuring that it is up-to-date and working correctly.
User: The person running the pipeline. This could be the same person as the curator, or it could be someone else trying to reproduce the results of the pipeline.
Environment: The set of software and dependencies that are required to run a pipeline or rule.
Local environment: The environment on the user's computer. This includes the operating system, software, and dependencies that are installed on the user's computer.
Cluster: A set of computers that are connected together and used to run jobs in parallel.
Container: Think of a container as a digital box that holds everything needed to run a program or application. It includes all the necessary files, settings, and tools so that the program can work properly. Containers are isolated from the rest of the system, so they won't interfere with other programs or applications. Learn more at Docker.
Repository: A place where code and other files are stored. This could be a git repository on Github, Gitlab, or Bitbucket, or it could be a folder on a shared drive or cloud storage.
README: A file that contains information about the repository. This could include instructions on how to run the pipeline, information about the data used in the pipeline, or any other information that is relevant to the pipeline.
Snakefile: A file that contains the rules of the pipeline. This file is used by Snakemake to run the pipeline. See the Snakemake documentation for more information.
1. Version Control Your Pipeline
All pipelines should be version controlled using git. This allows for easy tracking of changes to the pipeline, as well as easy collaboration between multiple users.
The pipeline must be stored in a git repository, with a
README.md
file that explains how to run the pipeline.The repository must not contain any large data files. These should be stored in a separate location, such as a shared drive, cloud storage, or ignored using
.gitignore
.The repository must contain a
environment.yaml
file that lists the specific version ofsnakemake
and any other dependencies required to run the pipeline. See the Isolated Environments section for more information.The repository must contain a
Snakefile
in the root directory that defines the rules of the pipeline.If the pipeline needs to be updated or modified in any way, the curator should create a
new branch
in the repository (i.edevelopment
), make the changes, and then create a pull request to merge the changes into themain
branch when they are certain that the pipeline is working as expected.This ensures that the
main
branch always contains a working version of the pipeline that can be run by users.For more information on git-flow and branching strategies, see the Atlassian Gitflow Workflow.
2. Isolated Environments
Snakemake allows for each pipeline to define both the environment the snakemake
call is run in, as well as the environment that each rule is run in.
This is a powerful feature that allows for the pipeline to be run in a consistent environment, regardless of the user's local environment.
This is particularly important when running on a cluster, where the environment may be different than the user's local environment.
This is also important for reproducibility, as the pipeline will always be run in the same environment, regardless of the user's local environment.
See the Snakemake documentation on using conda environments for more information.
Note: Though snakemake allows using already existing conda environments, it is required that
.yaml
files are used to define environments.
If containers are defined for the pipeline specifically (i.e not using an existing container made by someone else), the
Dockerfile
orSingularity
file should be included in the repository, along with instructions on how to build the container.
3. Documentation
All pipelines should be well-documented, both in the code and in the repository.
The
README.md
file should contain detailed instructions on how to run the pipeline, including any required input files, parameters, and output files.
3.1 Pipeline Instructions
At minimum, the README.md
file should contain the exact command to run the pipeline, including any required parameters. This should be very clear to users and should not require them to read through the entire Snakefile
to figure out how to run the pipeline.
i.e "To run the pipeline, use the following command:
snakemake --use-conda --cores <num_cores>
"
If any additional steps are required, you must let the user know in the README. This could include:
Setting up additional directories
i.e "before running the pipeline you will need to create a
references
directory and download the data files from<URL>
into this directory asannotations.gtf
" using the following command:mkdir references && wget <URL> -O references/annotations.gtf
3.2 Directory Structure
The directory structure of the pipeline should follow:
3.3 Data
All sources of data must be well described, either in the README.md
or in the config.yaml
file. This includes:
The source of the data (e.g. a URL, a database, etc.)
If the data is publically available, a link to the data must be provided. You cannot assume that any user will be able to find the data on their own.
If the data is not publically available, the curator should provide instructions on how to obtain the data. This could include contacting the curator directly, using a specific tool to download the data, or the point of contact to who can provide the data (i.e a lab or a database).
As much metadata as possible about the data (e.g. the format of the data, the size of the data, etc.)
Any processing that has been done to the data (e.g. normalization, filtering, etc.) BEFORE you obtained the data.
Since the pipeline needs to run on ORCESTRA, any data that is not publically available must be shared with the ORCESTRA team. This could include:
Providing the data to the ORCESTRA team directly
Providing a link to the data that the ORCESTRA team can access
Providing instructions on how to obtain the data from a specific location
i.e if the RNASEQ data is on a HPC cluster, the ORCESTRA team should be provided with:
The location of the data (path to the data on the cluster
i.e /cluster/projects/lab/rawdata/<your-data>
)The path to the data in the pipeline (i.e "The data is used in
rule build_RNASEQ_SE
as an inputrawdata/rnaseq/xyz.fastq
)This will allow for the team to organize the data in a private cloud datastorage so it can be accessed by the ORCESTRA platform (will not be shared with anyone else).
Data from results of the pipeline should also be well described, including:
The format of the output files (e.g. RDS, CSV, etc.)
What the file contains (e.g. a table of gene expression values, a plot of the data, a list of
SummarizedExperiments
objects, aPharmacoSet
object, etc.)
Relative paths MUST be used everywhere in the pipeline.
You cannot assume that the pipeline will be run from a specific directory or that everyone has the same directory structure on their computer.
i.e
input: "procdata/drugdata.RDS"
overinput: "/home/user/project/procdata/drugdata.RDS"
i.e
input: "references/genes.gtf"
overinput: "/cluster/projects/bhklab/references/genes.gtf"
If you need to use shared data (i.e data that is not stored in your directory), you should create symbolic links to the data in your directory
i.e
ln -s /cluster/projects/bhklab/references/genes.gtf references/genes.gtf
and useinput: "references/genes.gtf"
in the pipeline.In your README, you should let the user know that they need to set up a directory for
references
with the appropriate data files
3.4 Snakefile & Rules
The Snakefile
and pipeline rules/steps should be self-documenting or well-commented.
This includes named rules, input/output files, and any parameters that are used in the rules.
note: using directories as outputs is highly discouraged unless no other option is available as this can lead to issues with the pipeline in the cloud or on a cluster.
If using a tabulated input file, such as a
samples.tsv
, relative paths should be used over absolute paths.
If using a
script
directive, the name of the script should be the same as the name of the rulei.e
rule build_PharmacoSet
andscript: "scripts/build_PharmacoSet.R"
or a similar naming.
If using a
conda
directive, the name of the environment file should be the same as the name of the rule, unless a specific environment is required for multiple rules.i.e
rule build_PharmacoSet
andconda: "envs/build_PharmacoSet.yaml"
i.e
rule build_RNASEQ_SE
andrule build_CNV_SE
can use the same environment fileconda: "envs/build_SE.yaml"
If a rule requires specific resources, this should be implemented in the rule itself and not in the
snakemake
call. See the Snakemake Resource Documentation for more information.i.e
threads: 4
orthreads: 8
i.e
resources: mem_mb=8000
orresources: mem_mb=16000
log
files should be generated for each rule, and the log file should be named the same as the rule, with a.log
extension.i.e
rule build_PharmacoSet
should generate a log file namedbuild_PharmacoSet.log
Bad Rule | Good Rule |
---|---|
rule pset:
input:
# absolute paths, poor file naming
"/home/user/mydata/data/drugdata.RDS",
"/home/user/mydata/data/assay.RDS",
"/home/user/mydata/data/cellline.RDS",
"/home/user/mydata/data/drugsdata.RDS",
output:
"/home/user/mydata/data/output.RDS",
log:
"logs/log.log"
conda:
# using a user env instead of a yaml file
"pharmacogx"
script:
# poor naming convention
"scripts/combine.R"
|
rule build_PharmacoSet:
input:
tre = "procdata/treatmentResponseExperiment.RDS",
mae = "procdata/MultiAssayExperiment.RDS",
sampleMetadata = "procdata/sampleMetadata.RDS",
treatmentMetadata = "procdata/treatmentMetadata.RDS",
output:
pset = "results/PharmacoSet.RDS",
log:
"logs/build_PharmacoSet.log"
conda:
"envs/build_PharmacoSet.yaml"
script:
"scripts/build_PharmacoSet.R"
|