Querying PubChem • AnnotationGx

Introduction to PubChem APIs

PubChem is a database of chemical molecules and their biological activities. It is a part of the National Center for Biotechnology Information (NCBI), which is a part of the National Institutes of Health (NIH). PubChem provides a set of APIs to query its database. The AnnotationGx package provides a set of functions to query PubChem using these APIs.

The first of these APIs is the PubChem PUG REST API which is designed to - make specific queries based on some input identifier and return data which PubChem has labelled or computed internally [1]. - This API is useful for querying information about a specific chemical compound such as getting the standardized PubChem identifier (CID) for a given chemical name or smiles string, or getting the chemical structure for a given CID. - It provides access to a wide range of data including chemical properties, bioassay data, and chemical classification data, given a specific identifier.

The second API is the PubChem PUG VIEW API which is designed to: - give accesse to aggregated annotations for a given chemical compound [3] that is mapped to their data, but not curated by PubChem itself. - i.e it provides access to annotations from external sources such as UniProt, ChEBI, and ChEMBL, given a specific identifier.

Setup

library(AnnotationGx)

Mapping from chemical name to PubChem CID

The main function that is provided by the package is mapCompound2CID.

mapCompound2CID("aspirin")
#>       name  cids
#>     <char> <int>
#> 1: aspirin  2244

You can pass in a list of compound names to get the CIDs for all of them at once.

drugs <- c(
  "Aspirin", "Erlotinib", "Acadesine", "Camptothecin", "Vincaleukoblastine",
  "Cisplatin"
)

mapCompound2CID(drugs)
#>                  name    cids
#>                <char>   <int>
#> 1:            Aspirin    2244
#> 2:          Erlotinib  176870
#> 3:          Acadesine   17513
#> 4:       Camptothecin   24360
#> 5: Vincaleukoblastine   13342
#> 6: Vincaleukoblastine  241903
#> 7: Vincaleukoblastine 3823887
#> 8:          Cisplatin 5702198

It is possible for names to multimap to CIDs. This is the case for ‘Vincaleukoblastine’ in the above query. In cases of multimapping, usually the first entry has the highest similarity to the requested drug. To subset to only the first occurrence of each of drug name use the first = TRUE argument:

mapCompound2CID(drugs, first = TRUE)
#>                  name    cids
#>                <char>   <int>
#> 1:            Aspirin    2244
#> 2:          Erlotinib  176870
#> 3:          Acadesine   17513
#> 4:       Camptothecin   24360
#> 5: Vincaleukoblastine   13342
#> 6:          Cisplatin 5702198

In the case of a compound that can’t be mapped, NA will be returned and a warning will be issued.

(result <- mapCompound2CID(c(drugs, "non existent compound", "another bad compound"), first = TRUE))
#> Waiting 30s for retry backoff ■■                              
#> Waiting 30s for retry backoff ■■■■                            
#> Waiting 30s for retry backoff ■■■■■■■                         
#> Waiting 30s for retry backoff ■■■■■■■■■■                      
#> Waiting 30s for retry backoff ■■■■■■■■■■■■■                   
#> Waiting 30s for retry backoff ■■■■■■■■■■■■■■■■                
#> Waiting 30s for retry backoff ■■■■■■■■■■■■■■■■■■■             
#> Waiting 30s for retry backoff ■■■■■■■■■■■■■■■■■■■■■■          
#> Waiting 30s for retry backoff ■■■■■■■■■■■■■■■■■■■■■■■■■       
#> Waiting 30s for retry backoff ■■■■■■■■■■■■■■■■■■■■■■■■■■■■    
#> Waiting 30s for retry backoff ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 
#> Querying PubCHEM REST API.... ■■■■■■■■■■■■■■■■■■■■■■■■■■■       88% | ETA:  4s
#> Querying PubCHEM REST API.... ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■  100% | ETA:  0s
#> [22:05:30][WARNING][AnnotationGx::getPubchemCompound]  Some queries failed. See the 'failed' object for details. 
#>                     name    cids
#>                   <char>   <int>
#> 1:               Aspirin    2244
#> 2:             Erlotinib  176870
#> 3:             Acadesine   17513
#> 4:          Camptothecin   24360
#> 5:    Vincaleukoblastine   13342
#> 6:             Cisplatin 5702198
#> 7: non existent compound      NA
#> 8:  another bad compound      NA

failed <- attributes(result)$failed

# get the list of failed inputs
names(failed)
#> [1] "non existent compound" "another bad compound"

# get the error message for the failed input
print(failed[1])
#> $`non existent compound`
#> $`non existent compound`$Code
#> [1] "PUGREST.NotFound"
#> 
#> $`non existent compound`$Message
#> [1] "No CID found"
#> 
#> $`non existent compound`$Details
#> [1] "No CID found that matches the given name"

Mapping from PubChem CID to Properties

Once CIDs are obtained, they can be used to query the properties of the compound. To view the available properties from Pubchem, use the getPubchemProperties function.

getPubchemProperties()
#>                         name         type
#>                       <char>       <char>
#>  1:                      CID          int
#>  2:         MolecularFormula       string
#>  3:          MolecularWeight       string
#>  4:          CanonicalSMILES       string
#>  5:           IsomericSMILES       string
#>  6:                    InChI       string
#>  7:                 InChIKey       string
#>  8:                IUPACName       string
#>  9:                    XLogP       double
#> 10:                ExactMass       string
#> 11:         MonoisotopicMass       string
#> 12:                     TPSA       double
#> 13:               Complexity          int
#> 14:                   Charge          int
#> 15:          HBondDonorCount          int
#> 16:       HBondAcceptorCount          int
#> 17:       RotatableBondCount          int
#> 18:           HeavyAtomCount          int
#> 19:         IsotopeAtomCount          int
#> 20:          AtomStereoCount          int
#> 21:   DefinedAtomStereoCount          int
#> 22: UndefinedAtomStereoCount          int
#> 23:          BondStereoCount          int
#> 24:   DefinedBondStereoCount          int
#> 25: UndefinedBondStereoCount          int
#> 26:        CovalentUnitCount          int
#> 27:                 Volume3D       double
#> 28:      XStericQuadrupole3D       double
#> 29:      YStericQuadrupole3D       double
#> 30:      ZStericQuadrupole3D       double
#> 31:           FeatureCount3D          int
#> 32:   FeatureAcceptorCount3D          int
#> 33:      FeatureDonorCount3D          int
#> 34:      FeatureAnionCount3D          int
#> 35:     FeatureCationCount3D          int
#> 36:       FeatureRingCount3D          int
#> 37: FeatureHydrophobeCount3D          int
#> 38:     ConformerModelRMSD3D       double
#> 39:    EffectiveRotorCount3D       double
#> 40:         ConformerCount3D          int
#> 41:            Fingerprint2D base64Binary
#> 42:                    Title       string
#> 43:              PatentCount          int
#> 44:        PatentFamilyCount          int
#> 45:          LiteratureCount          int
#>                         name         type

After deciding which properties to query, you can use the mapCID2Properties function to get the properties for a specific CID.

properties <- c("Title", "MolecularFormula", "InChIKey", "MolecularWeight")

# Need to remove NA values from the query as they will cause an error
result[!is.na(cids), mapCID2Properties(ids = cids, properties = properties)]
#> Waiting 30s for retry backoff ■■■                             
#> Waiting 30s for retry backoff ■■■■■■                          
#> Waiting 30s for retry backoff ■■■■■■■■■                       
#> Waiting 30s for retry backoff ■■■■■■■■■■■■                    
#> Waiting 30s for retry backoff ■■■■■■■■■■■■■■■                 
#> Waiting 30s for retry backoff ■■■■■■■■■■■■■■■■■■              
#> Waiting 30s for retry backoff ■■■■■■■■■■■■■■■■■■■■■           
#> Waiting 30s for retry backoff ■■■■■■■■■■■■■■■■■■■■■■■■        
#> Waiting 30s for retry backoff ■■■■■■■■■■■■■■■■■■■■■■■■■■■     
#> Waiting 30s for retry backoff ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■  
#> Waiting 30s for retry backoff ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 
#>        CID MolecularFormula MolecularWeight                    InChIKey
#>      <int>           <char>          <char>                      <char>
#> 1:    2244           C9H8O4          180.16 BSYNRYMUTXBXSQ-UHFFFAOYSA-N
#> 2:  176870       C22H23N3O4           393.4 AAKJLRGGTJKAMG-UHFFFAOYSA-N
#> 3:   17513        C9H14N4O5          258.23 RTRQQBHATOEIAF-UUOKFMHZSA-N
#> 4:   24360       C20H16N2O4           348.4 VSJKWCGYPAHWDS-FQEVSTJZSA-N
#> 5:   13342       C46H58N4O9           811.0 JXLYSJRDGCGARV-CFWMRBGOSA-N
#> 6: 5702198        Cl2H6N2Pt          300.05 LXZZYRPGZAFOLE-UHFFFAOYSA-L
#>                     Title
#>                    <char>
#> 1:                Aspirin
#> 2:              Erlotinib
#> 3:              Acadesine
#> 4:           Camptothecin
#> 5:            Vinblastine
#> 6: azane;dichloroplatinum

Mapping from PubChem CID to Annotations

Pubchem’s VIEW API provides access to annotations from external sources such as UniProt, ChEBI, and ChEMBL, given a specific identifier. Before querying annotations, we need to use the exact heading we want to query.

You can use the getPubchemAnnotationHeadings function to get the available annotation headings and types.

Get ALL available annotation headings:

getPubchemAnnotationHeadings()
#>               Heading     Type
#>                <char>   <char>
#>   1:  11B NMR Spectra Compound
#>   2:  13C NMR Spectra Compound
#>   3:  15N NMR Spectra Compound
#>   4:  17O NMR Spectra Compound
#>   5:  19F NMR Spectra Compound
#>  ---                          
#> 640: Wiley References Compound
#> 641:      WormBase ID     Gene
#> 642:      WormBase ID  Protein
#> 643:  Xenbase Gene ID     Gene
#> 644:          ZFIN ID     Gene

Get annotation headings for a specific type:

getPubchemAnnotationHeadings(type = "Compound")
#>                                        Heading     Type
#>                                         <char>   <char>
#>   1:                           11B NMR Spectra Compound
#>   2:                           13C NMR Spectra Compound
#>   3:                           15N NMR Spectra Compound
#>   4:                           17O NMR Spectra Compound
#>   5:                           19F NMR Spectra Compound
#>  ---                                                   
#> 478: Volatilization from Water/Soil (Complete) Compound
#> 479:                   WHO Essential Medicines Compound
#> 480:                                  Wikidata Compound
#> 481:                                 Wikipedia Compound
#> 482:                          Wiley References Compound

Get annotation headings for a specific heading:

getPubchemAnnotationHeadings(heading = "ChEMBL ID")
#>      Heading     Type
#>       <char>   <char>
#> 1: ChEMBL ID Compound

Get annotation headings for a specific type and heading:

getPubchemAnnotationHeadings(type = "Compound", heading = "CAS")
#>           Heading     Type
#>            <char>   <char>
#> 1:            CAS Compound
#> 2: Deprecated CAS Compound
#> 3:    Related CAS Compound

Query annotations for a specific CID and heading

We can then use the heading to query the annotations for a specific CID.

result[!is.na(cids), CAS := annotatePubchemCompound(cids, "CAS")]
result
#>                     name    cids                                CAS
#>                   <char>   <int>                             <char>
#> 1:               Aspirin    2244                            50-78-2
#> 2:             Erlotinib  176870                        183321-74-6
#> 3:             Acadesine   17513                          2627-69-2
#> 4:          Camptothecin   24360                          7689-03-4
#> 5:    Vincaleukoblastine   13342                           865-21-4
#> 6:             Cisplatin 5702198 15663-27-1; 26035-31-4; 14913-33-8
#> 7: non existent compound      NA                               <NA>
#> 8:  another bad compound      NA                               <NA>

References

PUG REST. PubChem Docs [website]. Retrieved from https://pubchemdocs.ncbi.nlm.nih.gov/pug-rest.
Kim S, Thiessen PA, Cheng T, Yu B, Bolton EE. An update on PUG-REST: RESTful interface for programmatic access to PubChem. Nucleic Acids Res. 2018 July 2; 46(W1):W563-570. doi:10.1093/nar/gky294.
PUG VIEW. PubChem Docs [webiste]. Retrieved from https://pubchemdocs.ncbi.nlm.nih.gov/pug-view.
Kim S, Thiessen PA, Cheng T, Zhang J, Gindulyte A, Bolton EE. PUG-View: programmatic access to chemical annotations integrated in PubChem. J Cheminform. 2019 Aug 9; 11:56. doi:10.1186/s13321-019-0375-2.