Skip to contents

Introduction

This vignette compares annotating CTRP-provided treatment ids to PubChem CIDs and CTD information.

Whereas the PubChem CID is a unique identifier for a compound, the PubChem API does not easily map treatment names to CIDs, atleast not in a way that easy for commonly misnamed treatments. Specifically, for the CTRP treatment names (n=545), the PubChem API does not correctly map all of them to PubChem CIDs.

The CTD2 database is the central database where CTRP data is hosted. They happen to expose (an API)[https://ctd2-dashboard.nci.nih.gov/dashboard/#api-documentation] for their database.

Developer Note: The API calls they describe on their API documentation is useful, but they have an endpoint: GET /compound/{compoundId} that is not documented. This endpoint is useful for mapping compound names in the way their data (i.e CTRP) names them to PubChem CIDs.

The functionality for this is implemented in the mapCompound2CTD function.

It is an investigation to see which of the methods might map more compounds

library(AnnotationGx)

data(CTRP_treatmentMetadata)

# get a random row from the CTRP_treatmentMetadata

treatment <- CTRP_treatmentMetadata[1, CTRP.treatmentid]
sprintf("CTRP treatment id : %s", treatment)
#> [1] "CTRP treatment id : CIL55"

# map the treatment to a CID using the CTD database
mapCompound2CTD(treatment)[, .(displayName, PUBCHEM)]
#>    displayName PUBCHEM
#>         <char>  <char>
#> 1:       CIL55 6623618


# map the treatment to a CID using PubChem
mapCompound2CID(treatment)
#>      name    cids
#>    <char>   <int>
#> 1:  CIL55 6623618

Annotating using the CTD database

result <- CTRP_treatmentMetadata[, mapCompound2CTD(CTRP.treatmentid, query_only = F, raw = F)]
#> Iterating ■■                                 3% | ETA: 29s
#> Iterating ■■■■■■■■■■                        30% | ETA:  4s
#> Iterating ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■  100% | ETA:  0s

show(result)
#>                        displayName BROAD_COMPOUND CTRP ID           CTRP NAME
#>                             <char>         <char>  <char>              <char>
#>   1:                         CIL55           1788    1788               CIL55
#>   2:                       BRD4132           3588    3588             BRD4132
#>   3:                       BRD6340          12877   12877             BRD6340
#>   4:                         ML006          17712   17712               ML006
#>   5:           Bax channel blocker          18311   18311 Bax channel blocker
#>  ---                                                                         
#> 541:                      avicin D         688975  688975            avicin D
#> 542: BRD9876:MK-1775 (4:1 mol/mol)           <NA>    <NA>                <NA>
#> 543:                 BRD-K30748066         689506  689506       BRD-K30748066
#> 544:                    linsitinib         705300  705300          linsitinib
#> 545:                        AT-406         710154  710154              AT-406
#>          DepMap compound             IMAGE  PUBCHEM         CAS DRUG BANK
#>                   <char>            <char>   <char>      <char>    <char>
#>   1:               CIL55   struct_1788.png  6623618        <NA>      <NA>
#>   2:             BRD4132   struct_3588.png  7326481        <NA>      <NA>
#>   3:             BRD6340  struct_12877.png  1641662        <NA>      <NA>
#>   4:               ML006  struct_17712.png  2842253        <NA>      <NA>
#>   5: BAX-channel-blocker  struct_18311.png  2729027        <NA>      <NA>
#>  ---                                                                     
#> 541:            avicin D struct_688975.png 73707595        <NA>      <NA>
#> 542:                <NA>              <NA>     <NA>        <NA>      <NA>
#> 543:       BRD-K30748066 struct_689506.png 11257553        <NA>      <NA>
#> 544:          linsitinib struct_705300.png 11640390 867160-71-2   DB06075
#> 545:              AT-406 struct_710154.png 25022340        <NA>      <NA>
message("Failed results: ", result[is.na(result$PUBCHEM), .N])
#> Failed results: 92

failed_names <- result[is.na(result$PUBCHEM),displayName]

Annotating using PubChem

(compounds_to_cids <- 
  CTRP_treatmentMetadata[, 
    AnnotationGx::mapCompound2CID(
        names =  CTRP.treatmentid,
        first = TRUE
        )
      ]
)
failed <- 
  attributes(compounds_to_cids)$failed |> 
    names()
failed <- unique(CTRP_treatmentMetadata[CTRP.treatmentid %in% failed, ])

failed[, CTRP.treatmentid_CLEANED := cleanCharacterStrings(CTRP.treatmentid)]

(failed_to_cids <-
  failed[, 
    AnnotationGx::mapCompound2CID(
      names = CTRP.treatmentid_CLEANED,
      first = TRUE
    )
  ]
)
failed_again <-
  attributes(failed_to_cids)$failed |> 
    names()
failed_dt <- merge(failed_to_cids[!is.na(cids),], failed, by.x = "name", by.y = "CTRP.treatmentid_CLEANED", all.x = F)
failed_dt$name <- NULL

successful_dt <- merge(CTRP_treatmentMetadata, compounds_to_cids[!is.na(cids),],by.x = "CTRP.treatmentid", by.y = "name",  all.x = F)

mapped_PubChem <- data.table::rbindlist(list(successful_dt, failed_dt), use.names = T, fill = T)