Working with DMP Directories
Over the years, we have developed a standardized directory structure via the data management plan (DMP) in our projects. This structure is designed to:
- Facilitate reproducible research
- Provide a clear organization for data, code, and documentation
- Support collaboration and data sharing
- Promote familiarity and ease of use across projects
Understanding DMP Directories is mandatory in the BHKLab
This page is just a brief overview of the DMP directory structure. For a comprehensive overview, please take the time to read and understand the damply documentation.
The standardized DMP directory structure is implemented via the
damply
package, which provides tools and
conventions for organizing project files in accordance with DMP guidelines.
DMP reproducibility litmus test
If you are unsure whether your project is DMP-reproducible, ask yourself the following questions:
- If I had no prior knowledge of the data used in this project, does the documentation provide enough information to understand the data, its sources, and how to obtain it?
- If I were a new collaborator, would I be able to understand the code and documentation I need to understand and reproduce the project?
- If I were to delete all the
procdata
andresults
directories, could I reproduce the results with just therawdata
and workflow content?
DMP Directory Structure
As of writing, the recommended directory structure is as follows:
project_root/
├── config/ # Configuration files
├── data/ # All data in one parent directory
│ ├── procdata/ # Processed/intermediate data
│ ├── rawdata/ # Raw input data
│ └── results/ # Analysis outputs
├── logs/ # Log files
├── metadata/ # Dataset descriptions
└── workflow/ # Code organization
├── notebooks/ # Jupyter notebooks
└── scripts/ # Analysis scripts
DamplyDirs
Overview
Assuming the above directory structure, the damply
package provides a
simple way to access these directories via the DamplyDirs
class.
This class takes advantage of the following environment variables that are
defined in the template's pixi.toml
file:
[activation]
# convenient variables which can be used in scripts
env.CONFIG = "${PIXI_PROJECT_ROOT}/config"
env.METADATA = "${PIXI_PROJECT_ROOT}/metadata"
env.LOGS = "${PIXI_PROJECT_ROOT}/logs"
env.RAWDATA = "${PIXI_PROJECT_ROOT}/data/rawdata"
env.PROCDATA = "${PIXI_PROJECT_ROOT}/data/procdata"
env.RESULTS = "${PIXI_PROJECT_ROOT}/data/results"
env.SCRIPTS = "${PIXI_PROJECT_ROOT}/workflow/scripts"
This allows you to programmatically access the directories in your project without hardcoding paths, making your code more portable and easier to maintain.
Example Usage
Here is an example of how to use the DamplyDirs
class in your project:
from damply import dirs
fastq_file = dirs.RAWDATA / "fastq" / "sample_1.fq.gz"
print(f"Processing FASTQ file: {fastq_file}")
# Processing FASTQ file: /home/bhkuser/proejcts/data/rawdata/fastq/sample_1.fq.gz
A full comprehensive walkthrough of the DamplyDirs
utility can be found in the
damply documentation.