Data Sources
Overview
This section should document all data sources used in your project. Proper documentation ensures reproducibility and helps others understand your research methodology.
How to Document Your Data
For each data source, include the following information:
1. External Data Sources
- Name: Official name of the dataset
- Version/Date: Version number or access date
- URL: Link to the data source
- Access Method: How the data was obtained (direct download, API, etc.)
- Access Date: When the data was accessed/retrieved
- Data Format: Format of the data (FASTQ, DICOM, CSV, etc.)
- Citation: Proper academic citation if applicable
- License: Usage restrictions and attribution requirements
Example:
## TCGA RNA-Seq Data
- **Name**: The Cancer Genome Atlas RNA-Seq Data
- **Version**: Data release 28.0 - March 2021
- **URL**: https://portal.gdc.cancer.gov/
- **Access Method**: GDC Data Transfer Tool
- **Access Date**: 2021-03-15
- **Citation**: The Cancer Genome Atlas Network. (2012). Comprehensive molecular portraits of human breast tumours. Nature, 490(7418), 61-70.
- **License**: [NIH Genomic Data Sharing Policy](https://sharing.nih.gov/genomic-data-sharing-policy)
2. Internal/Generated Data
- Name: Descriptive name of the dataset
- Creation Date: When the data was generated
- Creation Method: Brief description of how the data was created
- Input Data: What source data was used
- Processing Scripts: References to scripts/Github Repo used to generate this data
Example:
## Processed RNA-Seq Data
- **Name**: Processed RNA-Seq Data for TCGA-BRCA
- **Creation Date**: 2021-04-01
- **Creation Method**: Processed using kallisto and DESeq2
- **Input Data**: FASTQ Data obtained from the SRA database
- **Processing Scripts**: [GitHub Repo](https://github.com/tcga-brca-rnaseq)
3. Data Dictionary
For complex datasets, include a data dictionary that explains:
Column Name | Data Type | Description | Units | Possible Values |
---|---|---|---|---|
patient_id | string | Unique patient identifier | N/A | TCGA-XX-XXXX format |
age | integer | Patient age at diagnosis | years | 18-100 |
expression | float | Gene expression value | TPM | Any positive value |
Best Practices
- Store raw data in
data/rawdata/
and never modify it - Store processed data in
data/procdata/
and all code used to generate it should be inworkflow/scripts/
- Document all processing steps
- Track data provenance (where data came from and how it was modified)
- Respect data usage agreements and licenses! This is especially important for data that should not be shared publicly