Pipeline Storage
This document outlines how a pipeline will store its outputs and how the API will reference those outputs.
Context
Snakemake
In Snakemake
, the outputs of a rule are stored in the output
directive.
This directive is a list of files that
Snakemake
expects the rule will produce.If any of the files in the
output
directive do not exist, the rule will be run.
Basic example
In the following example, running snakemake --cores 1
will result in looking for where output.txt
is produced. If output.txt
does not exist, the rule my_rule
will be run.
Snakemake Storage Plugins
Snakemake
has a plugin system that allows for the storage of outputs in different locations.
This is useful for storing outputs in a cloud storage system like
AWS S3
orGoogle Cloud Storage
.i.e The HTTP plugin will likely be used in the pipelines to retrieve
rawdata
from publicly available sources.
The idea for the API
is to take any pipeline and run it so that all the outputs
are stored in a cloud storage system; Google Cloud Storage in this case.
This can be achieved by appending --default-storage-prodivder gcs
to the snakemake
command.
This requires a dedicated bucket for all the pipelines. For the sake of this example, the bucket will be called orcestra-data-public
.
We would then append --default-storage-provider gcs --default-storage-prefix gs://orcestra-data-public
to the snakemake
command.
In the above example, this would then upload the output.txt
to gs://orcestra-data-public/results/output.txt
(assuming no errors occur).
API Reference to Outputs
The requirements for the API would be:
Establish a
Name
of each pipelineThis would be used in the
API
to reference the pipeline itself for the front-end (pipeline selection) and used to generate theGCS prefix
for the pipeline outputs.
Upon submission of a pipeline, the user needs to specify
outputs
by their relative paths to the root directory.This will then be used by the
API
after the pipeline is run, to retrieve the outputs from theGCS
The
API
can then expose the outputs to the user/front-end client
Like the previous orcestra implementation, except instead of just file name
it should the relative path
(s) to the root directory.
So if this is the given info (random example and not a real pipeline):
After the pipeline runs, the API
will then retrieve the following files:
gs://orcestra-data-public/test_pipeline/results/pset.rds
gs://orcestra-data-public/test_pipeline/results/dnl.json
because the name of the pipeline is test_pipeline
and the outputs
are results/pset.rds
and results/dnl.json
.