A Standard for Annotations
AnnotationStandards.Rmd
Introduction
The goal of AnnotationGx is to provide the tools that may help annotate chemi- and bio-informatic data. While the package is still in its early stages, it already provides a number of functions that may be useful for the annotation of data. In the interest of standardizing the annotation process, we propose a standard for annotations that may be used in the future.
Starting Point
The starting point of any annotation process might be a table with a number of columns or a list of identifiers that need to be annotated.
For example, we might have a data frame with a column of cell line names that we would like to annotate with information about the cell lines or a list of drugs that we would like to annotate with information about the drugs.
# "sample" refers to the cell line names
data(CCLE_sampleMetadata)
head(CCLE_sampleMetadata)
#> CCLE_ID depMapID Name CCLE_ID_parsed
#> <char> <char> <char> <char>
#> 1: DMS53_LUNG ACH-000698 DMS 53 DMS53
#> 2: SW1116_LARGE_INTESTINE ACH-000489 SW1116 SW1116
#> 3: NCIH1694_LUNG ACH-000431 NCI-H1694 NCIH1694
#> 4: P3HR1_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE ACH-000707 P3HR-1 P3HR1
#> 5: HUT78_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE ACH-000509 HuT 78 HUT78
#> 6: UMUC3_URINARY_TRACT ACH-000522 UM-UC-3 UMUC3
# "treatment" refers to the drug names
data(CCLE_treatmentMetadata)
head(CCLE_treatmentMetadata)
#> CCLE.treatmentid
#> <char>
#> 1: Erlotinib
#> 2: Lapatinib
#> 3: PHA-665752
#> 4: PF-2341066
#> 5: TAE684
#> 6: Vandetanib
Generic names for classes of data
The first standard is to use generic names for different classes so that the annotation process can be generalized. When referring to cell lines, patients, or other biological entities that are being studied, we should use the name “sample”. When referring to drugs, chemicals, or other treatments that are being applied to the samples, we should use the name “treatment”.
Use the name “sample” for biological entities and “treatment” for treatments.
Use the name “treatment” for drugs, chemicals, radiological treatments, etc that are being applied to the samples.In the CCLE example data provided, the name of the data frames are
CCLE_sampleMetadata
and
CCLE_treatmentMetadata
, already following this
standard.
However, within the dataframes, the names of the columns are not
standardized. CCLE_treatmentMetadata
correctly identifies
the column with the treatment names as “treatmentid”, but
CCLE_sampleMetadata
uses the varying names.
Before we rename the columns, we introduce the second standard for column names.
Standardized column names
Throughout the annotation process, many sources might be used for generating metadata.
For example, in annotating treatments, one might use the DrugBank database, the PubChem database, and the ChEMBL database.
For transparency and reproducibility, we should use standardized column names for the metadata that we collect from these sources.
Use the format {SOURCE}.{COLUMN_NAME} for column names in the metadata.
For example, if we are using the Pubchem and DrugBank database, we might have columns like “pubchem.CID”, “pubchem.SMILES”, “drugbank.ID”, “drugbank.SMILES”, etc.
This also applies to the data we start with. Take for example the GDSC example data provided:
head(GDSC_sampleMetadata)
#> GDSC.Sample_Name GDSC.COSMIC_ID
#> <char> <num>
#> 1: A253 906794
#> 2: BB30-HNC 753531
#> 3: BB49-HNC 753532
#> 4: BHY 753535
#> 5: BICR10 1290724
#> 6: BICR22 1240121
The column names above follow both Standard 1 and Standard 2. It tells us that the data for the two columns comes from the GDSC database.
This is especially useful when we are combining data from multiple sources and they might have columns with the same name. Additionally, it provides a level of confidence for users trying to compare data from different sources.
Example Annotated Data
In the example below, we have a data frame annotating the treatments for the 4 datasets CCLE, GDSC, CTRP, and gCSI.
treatmentMetadata <- data.table::fread(system.file("extdata", "treatmentMetadata_annotated_pubchem_unichem_chembl.tsv", package = "AnnotationGx"))
# two drugs: Erlotinib and Tanespimycin
str(treatmentMetadata[pubchem.CID %in% c("6505803", "176870"),])
#> Classes 'data.table' and 'data.frame': 2 obs. of 26 variables:
#> $ unichem.ChEMBL : chr "CHEMBL553" "CHEMBL109480"
#> $ pubchem.CID : int 176870 6505803
#> $ CCLE.treatmentid : chr "Erlotinib" "17-AAG"
#> $ CTRP.treatmentid : chr "erlotinib" "tanespimycin"
#> $ GDSC.treatmentid : chr "Erlotinib" "Tanespimycin"
#> $ gCSI.treatmentid : chr "Erlotinib" "17-AAG"
#> $ pubchem.MolecularFormula : chr "C22H23N3O4" "C31H43N3O8"
#> $ pubchem.MolecularWeight : num 393 586
#> $ pubchem.InChIKey : chr "AAKJLRGGTJKAMG-UHFFFAOYSA-N" "AYUNIORJHRXIBJ-TXHRRWQRSA-N"
#> $ pubchem.Title : chr "Erlotinib" "Tanespimycin"
#> $ unichem.ChEBI : chr "114785" "64153"
#> $ unichem.FDA_SRS : chr "J4T82NDH7E" "4GY0AVT3L4"
#> $ unichem.LINCS : chr "LSM-1097" "LSM-43238"
#> $ unichem.rxnorm : chr "ERLOTINIB HYDROCHLORIDE,TARCEVA,ERLOTINIB" ""
#> $ unichem.clinicaltrials : chr "TARCEVA,OSI-774,CP-358,CP-358,774,CP-358774,ERLOTINIB HYDROCHLORIDE,ERLOTINIB" "KOS-953,BMS-722782,CNF-1010,CNF1010,NSC-330507,17-AAG,TANESPIMYCIN"
#> $ unichem.DrugBank : chr "DB00530" "DB05134"
#> $ unichem.NIH_Clinical_Collection : chr "" ""
#> $ unichem.PharmGKB : chr "PA134687924" ""
#> $ chembl.action_type : chr "" "INHIBITOR"
#> $ chembl.disease_efficacy : int NA 1
#> $ chembl.max_phase : int NA 3
#> $ chembl.mec_id : int NA 5733
#> $ chembl.mechanism_of_action : chr "" "Heat shock protein HSP90 inhibitor"
#> $ chembl.target_chembl_id : chr "" "CHEMBL2095165"
#> $ chembl.site_id : int NA NA
#> $ chembl.parent_molecule_chembl_id: chr "" "CHEMBL109480"
#> - attr(*, ".internal.selfref")=<externalptr>
We can see above how the dataset sources are named in the column names (‘CCLE.treatmentid’, ‘GDSC.treatmentid’, ‘CTRP.treatmentid’, ‘gCSI.treatmentid’).
If a user wanted to get the InChiKey, they would use the “pubchem.InChiKey” column, and understand that these inchikeys are from the PubChem database.
Similarly, they have access to mechanism_of_action data in the “chembl.mechanism_of_action” column, and understand that these mechanisms are from the ChEMBL database.