Organizing Projects on H4H🔗
H4H provides large group directories for each lab to store raw and processed data, as well as analysis code and results. While most groups have on the order of 50 to 100 TB of raw storage space available, this quickly becomes cluttered as multiple users accumulate data for different analyses over time. In the worst case, the drive quota can be completely saturated and prevent you and your colleagues from downloading new data or running anayses.
To prevent this situation, it is of the utmost importance to keep your projects organized and to archive data that is no longer being used actively.
The Data Management Plan (DMP)🔗
To ensure organization of data, as well as to streamline archiving and prevent duplication, we have created a general framework for organizing data on H4H called the Data Management Plan, or DMP. By adopting this framework in your projects and analyses, you can ensure that you are making optimal use of the H4H resources without interfering with the work of your colleagues or creating extra work for the H4H data manager/coordinator.
Adherence to the DMP is mandatory, and scripts to automate data audits have been implemented. If you are found to be in violation of the DMP, you will first be notified by email. Failure to act will result in escalation of the issue to senior lab staff or even your PI. Please be respectful of other people's time and do your best to keep your projects organized and up to date. While initial adoption may require some extra planning, doing so will ultimately simplify your workflow and ensure you can maximize both your productivity and the reproducibilty of your analyses.
The DMP is simply a set of standard directory names which split up data into logical components of a scientific analysis. The basic structure is as follows:
- rawdata: immutable raw data files plus a README file for each sub-directory describing the source of the data, version, owner (see README structure below)
- data directories will be structured as
<source>_<model system>/<molecular data type>/<YYYYMM>/…
- e.g.,
gdsc_human/rnaseq/{201811/…, 201901/…, latest/<symlink to most recent>}
- data directories will be structured as
- procdata: pre-processed data for use in projects (e.g., normalized) as well as a README file describing the data
- structure mirrors the raw data folder
- projects: sub-directory will be project name; inside the project should be folders for results, figures, etc., as well as a README file describing the project
- These folders should be small; ideally, small enough to sync each one to a GitHub repo for version control
- pipelines: bioinformatics processing pipelines that have been generalized for use with multiple projects
- references: immutable standard reference genomes; can symlink to these in your project to avoid duplication
- scripts: general-purpose scripts related to data management, job submission, and any other scripts which may be useful to other lab members. Scripts in this directory must include a man page with basic usage (parameter/flags, input, output, etc.).
- e.g., Chris will put in some scripts he wrote for transferring data and checking its integrity
- tmp: a scratch space for temporary files generated throughout an analysis
- This directory will be deleted if we run out of disk quota, so only use it for temporary data!
- archive: a directory where data ready for archiving should be moved; automatic archiving jobs are scheduled once a week on Fridays!
- Please move your directory in the corresponding source folder in
bhklab/archive
(i.e., rawdata, procdata, projects, etc.) - For example, if I want to archive
bhklab/procdata/my_project
I will move it tobhklab/archive/procdata/my_project
- Please move your directory in the corresponding source folder in
README structure:
-
Write the tags EXACTLY as shown here as they will be used in scripts to ensure all files are properly documented; if a regex (grep) on your README does not return lines with the mandatory tags your folder will be flagged as violating the DMP!
-
#OWNER
: a mandatory tag specifying who should be contacted for questions about this data -
#DATE
: a mandatory tag with the date the directory was created in the format YYYY/MM -
#DESC
: a brief description of the directory contents with information, such as the source- You can add any other information you like; we recommend including a brief overview of your directory structure to help others understand where to look for specific files
Recommended Project Structure🔗
Organization of your projects within the projects DMP folder is not something that is audited as part of the DMP; however, if you mirror the structure of the DMP for some basic components of your project it can make it much easier to conform your projects to this framework.
A key tool to achieve the goals of the DMP are called symbolic links (or "symlinks"). These act much like a hyperlink does in your web browser, allowing you to reference files or directories in other locations on your file system. Such links are essential to separate different types of data (simplifying archiving) while still running your analysis from within a single project directory.
Symbolic Links🔗
To make a symbolic link run:
You can confirm this worked by running:
You should see that the symbolic link you just created points (->
) to the
source directory from your command.
Try it yourself:
-
SSH into the Login node
Solution
ssh -p "$H4HLOGIN_PORT" "<username>@$H4HLOGIN"
-
Allocate an interactive job
Solution
salloc -c 2 --mem=8G -t 0:30:0
-
Navigate to the projects directory
Solution
cd $PROJ # assumes you setup H4H environment variables as in Tutorial 0
-
Create a new project directory and navigate to it
Solution
mkdir "$(whoami)_test_project" cd "$(whoami)_test_project" pwd # check that it worked
-
Create a new data directory in the procdata DMP folder
Solution
mkdir "$PDATA/$(whoami)_test_data"
-
Create a symbolic link from the procdata folder in your group directory in your project and confirm that it worked.
Solution
ln -s "$PDATA/$(whoami)_test_data" "$(pwd)/procdata" # recommended to use absolute paths ls -lah # should see that procdata -> /cluster/projects/bhklab/procdata/<\test data direcotry>
-
Remove remove the folders you created
Solution
rm procdata cd .. rmdir "$(whoami)_test_project" # Note use of rm -r is NOT recommended on H4H! This command is dangerous! rmdir "$PDATA/$(whoami)_test_data"
Project Template🔗
To simplify project creation for lab members, a basic project template is
available in the templates/basic_project
directory. This includes the recommended
directory structure to make following the DMP easy, as well as README files
explaining the intended use for each of these directories.
Copy this folder structure for your new projects and following the DMP should be very straight forward. You can create additional directories or modify the template as needed for your specific analysis use case.
DMP Project Workflow🔗
As mentioned above, the purpose of the DMP is not to micromanage lab members as they create and run their analyses. It is instead to ensure that all data is kept on H4H for the minimum time possible and that all analyses are well documented and reproducible.
To achieve this, regular data archiving is required. Archiving will be taken care of by the Data Coordinator for your group directory, but you are responsible for moving your data into the archive directory to communicate that it is ready.
For an end-to-end analysis, ideally data can be incrementally archived as your project progresses. Using symbolic links to the rawdata and procdata folders inside your project directory allow you to keep all your project code and metadata in one place while still separating the physical location of your project data from the code and results. Code should be written to analyze each step and the associated data should be archived as soon as possible after the analysis is completed.
For example, lets say I want to analyze some RNA sequencing data from TCGA:
- I contact the cluster administrators to retrieve the raw
.FASTA
sequence files from the source database - These files are downloaded to the rawdata DMP directory in a folder called
TCGA_human/rnaseq/202210
- I then create an associated folder in procdata with the same naming conventions:
procdata/TCGA_human/rnaseq/202210
- Now I create a project directory where my analysis will live and add symbolic links to my raw and procesed data directories
- I select a reference genome to align against and symbolic link it in the references folder
- I then create my sequencing QC and alignment scripts in the scripts directory of my project
- After QC, I write the
.FASTQ
files to the rawdata directory, since they are still relatively large and closely related to the.FASTA
files - I then align the QC'd sequencing data to my reference genome and write out the
.BAM
files to the procdata directory - Once I am satisfied this process has been successful, I can move the raw data folder into the archive DMP directory so that the space I was using gets freed up at the end of the week!
- Now I can move on to write my script for calling transcript counts from the aligned
.BAM
files. Since the final count matrix is going to be relatively small, it is likely approriate to write it to the results folder of my project. This will allow me to archive the.BAM
files after I have finished calling the RNA counts from them. - I can now move my procdata folder into the archive directory that the space I was using will be freed up on Friday of that week!
If additional follow up analysis will be conducted using the RNA sequencing counts, it is probably best to create a new project directory and organize the additional steps using the RNA count matrix as my processed data input. This ensures analysis projects are modular and that data can be archived promptly when the information needed for downstream analysis has been extracted.