Design Documentation 1.0 Help

Data Nutrition Label

We have a Data Nutrition Label (DNL) Mockup

Context

Data Nutrition Label (DNL) is a way to communicate the contents and quality of data to the client (user or front-end)

It is expected that the DNL will be generated and stored according to the Pipeline Storage documentation.

As this data structure is designed, we can then standardize what needs to be outputted from the pipeline and how the API will reference those outputs.

Standards across ALL datasets

This diagram is based on all the sections below:

pipelinegCSI-Pharmacoset_Snakemakeabout   technical_information   pipeline   data_quality   namegCSIdescriptiongCSI dataset is a pharmacogenomics ...DisclaimerThe gCSI data were generated and shared by Genentech ...Curated ByBHK LabCurated On2021-09-01Dataset DOI10.5281/zenodo.123456Data TypePharmacoSetData FormatRDSLicenseCC BY 4.0GitHub Repogithub.com/repoRelease Notesv2024.2 - 10 new drugs; added chembl MOA...Annotation standards????Quality Controllink to HTML file

Python Pydantic class definitions

import datetime from typing import List, Optional from pydantic import BaseModel, Field class About_DNL(BaseModel): name: str description: str Disclaimer: str Curated_By: str Curated_On: datetime.datetime class TechnicalInformation_DNL(BaseModel): Dataset_DOI: str Data_Type: str Data_Format: str License: str class Pipeline_DNL(BaseModel): GitHub_Repo: str Release_Notes: str class DataQuality_DNL(BaseModel): Annotation_standards: str Quality_Control: str class DataNutritionLabel(BaseModel): about: About_DNL technical_information: TechnicalInformation_DNL pipeline: Pipeline_DNL data_quality: DataQuality_DNL

ABOUT

DNL_about.png

fld

value

source

default

name

name of pipeline used across databases

API/client

NA

description

outputted from the pipeline metadata file

Pipeline Output

NA

Disclaimer

some disclaimer indicating usage? see disclaimer below

Pipeline Output

NA

Curated By

Group that curated the pipeline

API/client

BHK Lab

Curated On

Date of curation should be outputted as well

Pipeline Output

NA

Disclaimer Data Usage Policy:The gCSI data were generated and shared by Genentech as part of the Genentech Cell Line Screening Initiative (Eva Lin, Yihong Yu, Scott Martin, Department of Discovery Oncology). The Haibe-Kains Lab has reprocessed and re-annotated the data to maximize overlap with other pharmacogenomic datasets.

Technical Information

TechnicalInformation.png

fld

value

source

default

Dataset DOI

Zenodo DOI of the dataset

API/server

NA

Data Type

Type of data (e.g MultiAssayExperiment, PharmacoSet, SummarizedExperiment)

client/Pipeline Output

NA

Data Format

Format of data (e.g rds, csv, tsv, txt)

client/Pipeline Output

NA

License

License of the data (e.g CC BY 4.0, MIT, GPL-3)

github repo/pipeline output/client

Creative Commons Attribution 4.0 license

Pipeline & Release Notes

pipeline.png
release_notes.png

fld

value

source

default

GitHub Repo

URL of the GitHub repository where the pipeline is stored

API/client

NA

Release Notes

Notes on the pipeline release

repo/API/client

NA

Data Quality

badges.png

fld

value

source

default

Annotation standards

????

????

????

Quality Control

Link to an HTML file with the quality control metrics

????

????

Last modified: 28 May 2024