Querying PubChem
PubChemAPI.Rmd
Introduction to PubChem APIs
PubChem is a database of chemical molecules and their biological
activities. It is a part of the National Center for Biotechnology
Information (NCBI), which is a part of the National Institutes of Health
(NIH). PubChem provides a set of APIs to query its database. The
AnnotationGx
package provides a set of functions to query
PubChem using these APIs.
The first of these APIs is the PubChem PUG REST API
which is designed to - make specific queries based on some input
identifier and return data which PubChem has labelled or
computed internally [1]. - This API is useful for querying information
about a specific chemical compound such as getting the standardized
PubChem identifier (CID) for a given chemical name or smiles string, or
getting the chemical structure for a given CID. - It provides access to
a wide range of data including chemical properties, bioassay data, and
chemical classification data, given a specific identifier.
The second API is the PubChem PUG VIEW API
which is
designed to: - give accesse to aggregated annotations for a given
chemical compound [3] that is mapped to their data, but not curated by
PubChem itself. - i.e it provides access to annotations from external
sources such as UniProt, ChEBI, and ChEMBL, given a specific
identifier.
Mapping from chemical name to PubChem CID
The main function that is provided by the package is
mapCompound2CID
.
mapCompound2CID("aspirin")
#> name cids
#> <char> <int>
#> 1: aspirin 2244
You can pass in a list of compound names to get the CIDs for all of them at once.
drugs <- c(
"Aspirin", "Erlotinib", "Acadesine", "Camptothecin", "Vincaleukoblastine",
"Cisplatin"
)
mapCompound2CID(drugs)
#> name cids
#> <char> <int>
#> 1: Aspirin 2244
#> 2: Erlotinib 176870
#> 3: Acadesine 17513
#> 4: Camptothecin 24360
#> 5: Vincaleukoblastine 13342
#> 6: Vincaleukoblastine 241903
#> 7: Vincaleukoblastine 3823887
#> 8: Cisplatin 5702198
It is possible for names to multimap to CIDs. This is the case for
‘Vincaleukoblastine’ in the above query. In cases of
multimapping, usually the first entry has the highest similarity to the
requested drug. To subset to only the first occurrence of each of drug
name use the first = TRUE
argument:
mapCompound2CID(drugs, first = TRUE)
#> name cids
#> <char> <int>
#> 1: Aspirin 2244
#> 2: Erlotinib 176870
#> 3: Acadesine 17513
#> 4: Camptothecin 24360
#> 5: Vincaleukoblastine 13342
#> 6: Cisplatin 5702198
In the case of a compound that can’t be mapped, NA
will
be returned and a warning will be issued.
(result <- mapCompound2CID(c(drugs, "non existent compound", "another bad compound"), first = TRUE))
#> Waiting 30s for retry backoff ■■
#> Waiting 30s for retry backoff ■■■■
#> Waiting 30s for retry backoff ■■■■■■■
#> Waiting 30s for retry backoff ■■■■■■■■■■
#> Waiting 30s for retry backoff ■■■■■■■■■■■■■
#> Waiting 30s for retry backoff ■■■■■■■■■■■■■■■■
#> Waiting 30s for retry backoff ■■■■■■■■■■■■■■■■■■■
#> Waiting 30s for retry backoff ■■■■■■■■■■■■■■■■■■■■■■
#> Waiting 30s for retry backoff ■■■■■■■■■■■■■■■■■■■■■■■■■
#> Waiting 30s for retry backoff ■■■■■■■■■■■■■■■■■■■■■■■■■■■■
#> Waiting 30s for retry backoff ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
#> Querying PubCHEM REST API.... ■■■■■■■■■■■■■■■■■■■■■■■■■■■ 88% | ETA: 4s
#> Querying PubCHEM REST API.... ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 100% | ETA: 0s
#> [22:05:30][WARNING][AnnotationGx::getPubchemCompound] Some queries failed. See the 'failed' object for details.
#> name cids
#> <char> <int>
#> 1: Aspirin 2244
#> 2: Erlotinib 176870
#> 3: Acadesine 17513
#> 4: Camptothecin 24360
#> 5: Vincaleukoblastine 13342
#> 6: Cisplatin 5702198
#> 7: non existent compound NA
#> 8: another bad compound NA
failed <- attributes(result)$failed
# get the list of failed inputs
names(failed)
#> [1] "non existent compound" "another bad compound"
# get the error message for the failed input
print(failed[1])
#> $`non existent compound`
#> $`non existent compound`$Code
#> [1] "PUGREST.NotFound"
#>
#> $`non existent compound`$Message
#> [1] "No CID found"
#>
#> $`non existent compound`$Details
#> [1] "No CID found that matches the given name"
Mapping from PubChem CID to Properties
Once CIDs are obtained, they can be used to query the properties of
the compound. To view the available properties from Pubchem, use the
getPubchemProperties
function.
getPubchemProperties()
#> name type
#> <char> <char>
#> 1: CID int
#> 2: MolecularFormula string
#> 3: MolecularWeight string
#> 4: CanonicalSMILES string
#> 5: IsomericSMILES string
#> 6: InChI string
#> 7: InChIKey string
#> 8: IUPACName string
#> 9: XLogP double
#> 10: ExactMass string
#> 11: MonoisotopicMass string
#> 12: TPSA double
#> 13: Complexity int
#> 14: Charge int
#> 15: HBondDonorCount int
#> 16: HBondAcceptorCount int
#> 17: RotatableBondCount int
#> 18: HeavyAtomCount int
#> 19: IsotopeAtomCount int
#> 20: AtomStereoCount int
#> 21: DefinedAtomStereoCount int
#> 22: UndefinedAtomStereoCount int
#> 23: BondStereoCount int
#> 24: DefinedBondStereoCount int
#> 25: UndefinedBondStereoCount int
#> 26: CovalentUnitCount int
#> 27: Volume3D double
#> 28: XStericQuadrupole3D double
#> 29: YStericQuadrupole3D double
#> 30: ZStericQuadrupole3D double
#> 31: FeatureCount3D int
#> 32: FeatureAcceptorCount3D int
#> 33: FeatureDonorCount3D int
#> 34: FeatureAnionCount3D int
#> 35: FeatureCationCount3D int
#> 36: FeatureRingCount3D int
#> 37: FeatureHydrophobeCount3D int
#> 38: ConformerModelRMSD3D double
#> 39: EffectiveRotorCount3D double
#> 40: ConformerCount3D int
#> 41: Fingerprint2D base64Binary
#> 42: Title string
#> 43: PatentCount int
#> 44: PatentFamilyCount int
#> 45: LiteratureCount int
#> name type
After deciding which properties to query, you can use the
mapCID2Properties
function to get the properties for a
specific CID.
properties <- c("Title", "MolecularFormula", "InChIKey", "MolecularWeight")
# Need to remove NA values from the query as they will cause an error
result[!is.na(cids), mapCID2Properties(ids = cids, properties = properties)]
#> Waiting 30s for retry backoff ■■■
#> Waiting 30s for retry backoff ■■■■■■
#> Waiting 30s for retry backoff ■■■■■■■■■
#> Waiting 30s for retry backoff ■■■■■■■■■■■■
#> Waiting 30s for retry backoff ■■■■■■■■■■■■■■■
#> Waiting 30s for retry backoff ■■■■■■■■■■■■■■■■■■
#> Waiting 30s for retry backoff ■■■■■■■■■■■■■■■■■■■■■
#> Waiting 30s for retry backoff ■■■■■■■■■■■■■■■■■■■■■■■■
#> Waiting 30s for retry backoff ■■■■■■■■■■■■■■■■■■■■■■■■■■■
#> Waiting 30s for retry backoff ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
#> Waiting 30s for retry backoff ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
#> CID MolecularFormula MolecularWeight InChIKey
#> <int> <char> <char> <char>
#> 1: 2244 C9H8O4 180.16 BSYNRYMUTXBXSQ-UHFFFAOYSA-N
#> 2: 176870 C22H23N3O4 393.4 AAKJLRGGTJKAMG-UHFFFAOYSA-N
#> 3: 17513 C9H14N4O5 258.23 RTRQQBHATOEIAF-UUOKFMHZSA-N
#> 4: 24360 C20H16N2O4 348.4 VSJKWCGYPAHWDS-FQEVSTJZSA-N
#> 5: 13342 C46H58N4O9 811.0 JXLYSJRDGCGARV-CFWMRBGOSA-N
#> 6: 5702198 Cl2H6N2Pt 300.05 LXZZYRPGZAFOLE-UHFFFAOYSA-L
#> Title
#> <char>
#> 1: Aspirin
#> 2: Erlotinib
#> 3: Acadesine
#> 4: Camptothecin
#> 5: Vinblastine
#> 6: azane;dichloroplatinum
Mapping from PubChem CID to Annotations
Pubchem’s VIEW API provides access to annotations from external sources such as UniProt, ChEBI, and ChEMBL, given a specific identifier. Before querying annotations, we need to use the exact heading we want to query.
You can use the getPubchemAnnotationHeadings
function to
get the available annotation headings and types.
Get ALL available annotation headings:
getPubchemAnnotationHeadings()
#> Heading Type
#> <char> <char>
#> 1: 11B NMR Spectra Compound
#> 2: 13C NMR Spectra Compound
#> 3: 15N NMR Spectra Compound
#> 4: 17O NMR Spectra Compound
#> 5: 19F NMR Spectra Compound
#> ---
#> 640: Wiley References Compound
#> 641: WormBase ID Gene
#> 642: WormBase ID Protein
#> 643: Xenbase Gene ID Gene
#> 644: ZFIN ID Gene
Get annotation headings for a specific type:
getPubchemAnnotationHeadings(type = "Compound")
#> Heading Type
#> <char> <char>
#> 1: 11B NMR Spectra Compound
#> 2: 13C NMR Spectra Compound
#> 3: 15N NMR Spectra Compound
#> 4: 17O NMR Spectra Compound
#> 5: 19F NMR Spectra Compound
#> ---
#> 478: Volatilization from Water/Soil (Complete) Compound
#> 479: WHO Essential Medicines Compound
#> 480: Wikidata Compound
#> 481: Wikipedia Compound
#> 482: Wiley References Compound
Get annotation headings for a specific heading:
getPubchemAnnotationHeadings(heading = "ChEMBL ID")
#> Heading Type
#> <char> <char>
#> 1: ChEMBL ID Compound
Get annotation headings for a specific type and heading:
getPubchemAnnotationHeadings(type = "Compound", heading = "CAS")
#> Heading Type
#> <char> <char>
#> 1: CAS Compound
#> 2: Deprecated CAS Compound
#> 3: Related CAS Compound
Query annotations for a specific CID and heading
We can then use the heading to query the annotations for a specific CID.
result[!is.na(cids), CAS := annotatePubchemCompound(cids, "CAS")]
result
#> name cids CAS
#> <char> <int> <char>
#> 1: Aspirin 2244 50-78-2
#> 2: Erlotinib 176870 183321-74-6
#> 3: Acadesine 17513 2627-69-2
#> 4: Camptothecin 24360 7689-03-4
#> 5: Vincaleukoblastine 13342 865-21-4
#> 6: Cisplatin 5702198 15663-27-1; 26035-31-4; 14913-33-8
#> 7: non existent compound NA <NA>
#> 8: another bad compound NA <NA>
References
- PUG REST. PubChem Docs [website]. Retrieved from https://pubchemdocs.ncbi.nlm.nih.gov/pug-rest.
- Kim S, Thiessen PA, Cheng T, Yu B, Bolton EE. An update on PUG-REST: RESTful interface for programmatic access to PubChem. Nucleic Acids Res. 2018 July 2; 46(W1):W563-570. doi:10.1093/nar/gky294.
- PUG VIEW. PubChem Docs [webiste]. Retrieved from https://pubchemdocs.ncbi.nlm.nih.gov/pug-view.
- Kim S, Thiessen PA, Cheng T, Zhang J, Gindulyte A, Bolton EE. PUG-View: programmatic access to chemical annotations integrated in PubChem. J Cheminform. 2019 Aug 9; 11:56. doi:10.1186/s13321-019-0375-2.