`DamplyDirs`: Simplified Data Science Directory Management

Data science projects often involve complex directory structures to organize raw data, processed data, results, and code. Managing these directories manually can be error-prone and time-consuming. The damply.dmpdirs module solves this problem by providing standardized access to these directories through a simple, consistent interface.

The Problem

As data scientists, we face several challenges when working with project directories:

Inconsistent Naming: Different team members might use different directory names
Path Construction: Building paths can be error-prone across operating systems
Missing Directories: Time wasted creating directories that should already exist
Environment Variables: Need to use the same paths in Python, R, and shell scripts
Project Transfer: Moving projects between systems requires path adjustments

Solution with `DamplyDirs`

DamplyDirs provides intuitive, standardized access to common data science project directories with environment variable support:

from damply import dirs

# Access directory paths easily
data_file = dirs.RAWDATA / "dataset.csv"

# Write outputs to standard locations
results_file = dirs.RESULTS / "analysis_results.csv"

# Get a nice visual representation of your directories
print(dirs)
# DamplyDirs<Strict Mode: OFF>
# Project Root: /private/tmp/ctrpv2-treatmentresponse-snakemake
# CONFIG       : ├── config
# LOGS         : ├── logs
# METADATA     : ├── metadata
# NOTEBOOKS    : ├── workflow/notebooks
# PROCDATA     : ├── data/procdata
# RAWDATA      : ├── data/rawdata
# RESULTS      : ├── data/results
# SCRIPTS      : └── workflow/scripts

Directory Resolution

DamplyDirs resolves directory paths in the following order:

Environment Variables(RECOMMENDED): If an environment variable with the same name exists (e.g., RAWDATA), its value is used
Default Structure: Otherwise, falls back to a standard directory structure

Default Directory Structure

When environment variables aren't set, DamplyDirs uses this standard structure:

project_root/
├── config/         # Configuration files
├── data/           # All data in one parent directory
│   ├── procdata/   # Processed/intermediate data
│   ├── rawdata/    # Raw input data
│   └── results/    # Analysis outputs
├── logs/           # Log files
├── metadata/       # Dataset descriptions
└── workflow/       # Code organization
    ├── notebooks/  # Jupyter notebooks
    └── scripts/    # Analysis scripts

Recommended Environment Variable Integration

DamplyDirs seamlessly integrates with environment variables, making it perfect for projects using tools like pixi:

# Example pixi.toml configuration
[activation]
# convenient variables which can be used in scripts
env.CONFIG = "${PIXI_PROJECT_ROOT}/config"
env.METADATA = "${PIXI_PROJECT_ROOT}/metadata"
env.LOGS = "${PIXI_PROJECT_ROOT}/logs"
env.RAWDATA = "${PIXI_PROJECT_ROOT}/data/rawdata"
env.PROCDATA = "${PIXI_PROJECT_ROOT}/data/procdata"
env.RESULTS = "${PIXI_PROJECT_ROOT}/data/results"
env.SCRIPTS = "${PIXI_PROJECT_ROOT}/workflow/scripts"

This will automatically set these environment variables when you activate your project environment via pixi shell or pixi run.

With this setup, your paths will be consistent across:

Python scripts using DamplyDirs
R scripts using environment variables
Shell scripts and commands
Snakemake workflows
Any other tools that can access environment variables

Getting Started

Basic Usage

from damply import dirs

# Access paths directly
config_path = dirs.CONFIG / "analysis_config.yaml"
data_path = dirs.RAWDATA / "experiment_1" / "samples.csv"
results_path = dirs.RESULTS / "figures" / "figure1.png"

Auto-Directory Creation

By default, DamplyDirs operates in non-strict mode, which automatically creates missing directories when they are accessed. This behavior helps get you started quickly without having to manually create all directories first:

# Accessing this path will create the directory if it doesn't exist
missing_dir = dirs.RESULTS / "new_analysis"

# You can enable strict mode if you prefer to get errors for missing directories
dirs.set_strict_mode(True)

# Now this will raise DirectoryNameNotFoundError if the directory doesn't exist
try:
    missing_dir = dirs.RESULTS / "another_analysis"
except DirectoryNameNotFoundError:
    print("Directory doesn't exist and won't be created in strict mode")

When not in strict mode, you'll see informative log messages when directories are created automatically.

Advanced Usage

Project Root Discovery

DamplyDirs finds your project root in the following order:

From DMP_PROJECT_ROOT environment variable
From PIXI_PROJECT_ROOT environment variable
Current working directory as fallback (not recommended as this won't work in jupyter notebooks)

Set the environment variable to ensure consistent behavior across scripts:

export DMP_PROJECT_ROOT=/path/to/my/project

Using in Other Languages and Tools

Note

Assuming you have set the environment variables as shown in the pixi.toml example, you can access these directories in various languages and tools.

PythonR ScriptsShell ScriptsSnakemake

from damply import dirs

# Access directories using environment variables
raw_data_path = dirs.RAWDATA / "dataset.csv"
results_path = dirs.RESULTS / "analysis_results.csv"

# Read data and save results
import pandas as pd
data = pd.read_csv(raw_data_path)
data.to_csv(results_path, index=False)

# Access the same directories in R
RAWDATA <- Sys.getenv("RAWDATA")
RESULTS <- Sys.getenv("RESULTS")

# Read data and save results using those paths
data <- read.csv(file.path(RAWDATA, "dataset.csv"))
write.csv(results, file.path(RESULTS, "analysis_results.csv"))

# Access the same directories in shell scripts
echo $RAWDATA
ls $RAWDATA/dataset.csv
cp $RAWDATA/dataset.csv $PROCDATA/processed_dataset.csv

# Snakemake is a python superset, so you can use damply.dmpdirs directly!
from damply import dirs
rule all:
    input:
        dirs.RESULTS / "final_results.txt"
rule process_data:
    input:
        dirs.RAWDATA / "dataset.csv"
    output:
        dirs.PROCDATA / "processed_data.csv"
    shell:
        "${dirs.SCRIPTS}/process_data.sh {input} > {output}"

Real-World Examples

Data Processing Workflow

from damply import dirs
import pandas as pd

# Load input data
input_file = dirs.RAWDATA / "experiment_2023" / "samples.csv"
data = pd.read_csv(input_file)

# Process data
processed_data = data.groupby('sample_id').mean()

# Save intermediate result
interim_file = dirs.PROCDATA / "aggregated_samples.csv"
processed_data.to_csv(interim_file)

# Generate and save visualization
output_file = dirs.RESULTS / "sample_means.png"
processed_data.plot(kind='bar').figure.savefig(output_file)
print(f"Results saved to {output_file}")
# Results saved to /path/to/project_root/data/results/sample_means.png

Configuration Management

from damply import dirs
import yaml

# Load configuration
config_file = dirs.CONFIG / "analysis_params.yaml"
with open(config_file, 'r') as f:
    config = yaml.safe_load(f)

# Use configuration in analysis
threshold = config['filtering']['threshold']
print(f"Using threshold: {threshold}")

Troubleshooting

"Directory Not Found" Errors

If you get a DirectoryNameNotFoundError:

Check that you're in the correct project root
Verify that the directory exists or the environment variable is set
Consider setting dirs.set_strict_mode(False) to auto-create directories

Environment Variable Not Set

If you get an EnvironmentVariableNotSetError or see a warning log about falling back to default paths:

Make sure your environment is activated and you have set the environment variables in your pixi.toml!
Check that the variable is set correctly (try echo $VARNAME)

Best Practices

Use DamplyDirs consistently across all project scripts
Set environment variables in your project configuration (pixi.toml, etc.)
Keep a clear separation between raw data, processed data, and results
Monitor the logs - they provide useful information about directory creation and fallbacks

By using DamplyDirs with environment variables, you ensure that your data science projects remain organized, portable, and consistent across languages, tools, and team members.

Strict Mode and Environment Variables

When using environment variables with DamplyDirs, the behavior depends on the strict mode setting:

In Strict Mode (dirs.set_strict_mode(True)):
If an environment variable for a directory is not set, an EnvironmentVariableNotSetError is raised
If a directory doesn't exist (even with environment variable), a DirectoryNameNotFoundError is raised
In Non-Strict Mode (dirs.set_strict_mode(False), the default):
If an environment variable isn't set, falls back to default paths with a warning log message
If a directory doesn't exist, it's automatically created with an info log message

This behavior gives you flexibility to enforce environment variable usage in production while allowing more permissive behavior during development:

# In development: auto-create missing directories (default behavior)
dirs.set_strict_mode(False)
data_path = dirs.RAWDATA / "my_data.csv"  # Will use default path if env var not set

# In production: enforce that environment variables are properly set
dirs.set_strict_mode(True)
try:
    data_path = dirs.RAWDATA / "my_data.csv"
except EnvironmentVariableNotSetError:
    print("RAWDATA environment variable must be set!")

DamplyDirs: Simplified Data Science Directory Management