Auditing the DMP on HPC4Health
What is HPC4Health?
The High Performance Computing for Health (HPC4Health / H4H) is a computing cluster operated at the University Health Network. The audit functionality of damply
was specifically developed for usage by the BHKLab on H4H. The following instructions will not work unless you have access to the BHKLab project folders on H4H.
Damply comes equipped with ability to audit your DMP setup to track things like memory usage and metadata. This functionality is to be used for managing the bhklab
and radiomics
project directories on H4H.
Installing Damply for the Command Line
-
salloc
onto the build node on H4H. -
Install
uv
, which is a tool that will make any python package tools available globally without messing around with virtual environments orbase
conda environments. -
Install Damply by running
uv tool install --refresh --force --upgrade damply
Now you can use damply
from the command line on any node on HPC4Health, including the login node.
$ which damply
~/.local/bin/damply
$ damply --help
Usage: damply [OPTIONS] COMMAND [ARGS]...
A tool to interact with systems implementing the Data Management Plan (DMP)
standard.
This tool is meant to allow sys-admins to easily query and audit the
metadata of their projects.
To enable logging, set the env variable `DAMPLY_LOG_LEVEL`.
Options:
--version Show the version and exit.
-h, --help Show this message and exit.
Commands:
audit Audit all subdirectories and aggregate damply output into...
full-audit Run a full audit for the specified project group.
groups-table Generate a user-group membership table from group names.
project Display information about the current project.
whose Print the owner of the file or directory.
Audit
TLDR pipeline
- Run
damply full-audit bhklab
to submit a bunch of jobs to the cluster - Wait for the jobs to finish
- Run
damply collect-audits bhklab
to collect the results and save csv to - Run
damply plot <filepath>
to create a damply plot html file scp
the html file to your local machine and open it in Chrome
Full Audit
The basic audit batch submissions have been embedded into the damply
CLI tool.
$ damply full-audit --help
Usage: damply full-audit [OPTIONS] PROJECT_GROUP
Run a full audit for the specified project group.
This will essentially, submit a bunch of sbatch jobs to the cluster for all
the directories in the project group.
Options:
-f, --force-compute-details Force the computation of details for the
directory and subdirectories regardless of
cache.
Step 1: salloc
into a compute node. This does not need a lot of compute power.
salloc -c 1 --mem=256 --time=0:15:0
Step 2: Run the full-audit
command. This will submit a bunch of jobs, that will run damply audit
on all the subdirectories. See the source code for the details of how it chooses the directories to audit.
damply full-audit bhklab
damply full-audit radiomics
The results will be stored in:
/cluster/projects/bhklab/admin/audit/...
/cluster/projects/radiomics/admin/audit/...
Info
Damply uses caching to avoid recomputing the same metadata for directories that have not changed. This means that jobs that dont need recomputation will run in a few seconds, while jobs that need recomputation can take up to an hour or more, depending on the size of the directory and the number of files in it.
If you want to force the recomputation of the metadata, you can use the --force-compute-details
flag
Collecting Results
After all the jobs have finished, we need to collect the results.
$ damply collect-audits --help
Usage: damply collect-audits [OPTIONS] PROJECT_GROUP
Collect audits for a project group (after full-audit).
keep_children: If
Options:
-f, --force Force collection even if summary exists
--keep-children Only collect source directories (aka higher level
directories only)
-h, --help Show this message and exit.
Run the command:
damply collect-audits bhklab
damply collect-audits radiomics
This will collect all the audit results from the subdirectories and aggregate them into a single CSV file. The results will be stored in:
/cluster/projects/bhklab/admin/audit/results/<date>/
/cluster/projects/radiomics/admin/audit/results/<date>/
Plotting Results
Run the command:
damply plot /cluster/projects/bhklab/admin/audit/results/<date>/<path_to_csv>