Module 1: Introduction to pharmacogenomics

Jermiah J. Joseph

Princess Margaret Cancer Centre
jermiah.joseph@uhn.ca

Julia Nguyen

Princess Margaret Cancer Centre
julia.nguyen@uhn.ca

Nikta Feizi

Princess Margaret Cancer Centre
nikta.feizi@uhn.ca

18 October 2024

Source: vignettes/Module1.Rmd

Module1.Rmd

Lab 1 Overview

Instructor(s) name(s) and contact information

Jermiah Joseph jermiah.joseph@uhn.ca
Nikta Feizi nikta.feizi@uhn.ca
Julia Nguyen julia.nguyen@uhn.ca

Lab Description

Learning goals

Understand the data structure of a PharmacoSet
Learn how to access features and metadata from a PharmacoSet
Learn how to design linear multivariate predictors
Learn how to filter out outliers and missing values

Learning objectives

Describe the use cases for PharmacoGx in Pharmacogenomics
Understand the structure of the CoreSet and PharmacoSet classes to facilitate their use in subsequent analyses
Download/load a PharmacoSet using PharmacoGx or orcestra.ca
Subset and filter a PharmacoSet by samples and/or treatments
Access the molecular features, dose-response, and metadata contained within the PharmacoSet

Getting Started

Exploring preclinical datasets for pharmacogenomic analysis

*See list of available subsetted datasets from Reference

Molecular profiles

We will start with RNA-Seq data as a simple example.

data(GDSC_rnaseq)
GDSC_rnaseq |> head()
#>    model_id model_name data_source   gene_id gene_symbol read_count  fpkm
#> 1 SIDM00794       A388      sanger SIDG00082       ABCC6         10  0.01
#> 2 SIDM00794       A388      sanger SIDG00106       ABCF3      20264 25.95
#> 3 SIDM00794       A388      sanger SIDG00108       ABCG2       1070  1.47
#> 4 SIDM00794       A388      sanger SIDG00148        ABI3          6  0.02
#> 5 SIDM00794       A388      sanger SIDG00177      ACADSB       1410  1.50
#> 6 SIDM00794       A388      sanger SIDG00198       ACER1         12  0.06

Few key things to notice here: there are identifiers for the sample (model_id, model_name), identifiers for the gene (gene_id, gene_symbol), as well as two expression values (read_count, fpkm).

When we create our expression matrices, we will select one sample identifier, one feature (gene) identifier, and one expression value.

Metadata / annotation files

When preparing for pharmacogenomic analysis, it is ideal to have accompanying metadata for both the samples (cell lines) and the features (genes).

We have made this data available through the package as well. We’ll start with the gene annotations:

data(GDSC_gene_identifiers)
GDSC_gene_identifiers |> head()
#>     gene_id cosmic_gene_symbol ensembl_gene_id entrez_id    hgnc_id hgnc_symbol
#> 1 SIDG00001               A1BG ENSG00000121410         1     HGNC:5        A1BG
#> 2 SIDG00002                    ENSG00000268895    503538 HGNC:37133    A1BG-AS1
#> 3 SIDG00003               A1CF ENSG00000148584     29974 HGNC:24086        A1CF
#> 4 SIDG00004                A2M ENSG00000175899         2     HGNC:7         A2M
#> 5 SIDG00005                    ENSG00000245105    144571 HGNC:27057     A2M-AS1
#> 6 SIDG00006              A2ML1 ENSG00000166535    144568 HGNC:23336       A2ML1
#>   refseq_id uniprot_id
#> 1 NM_130786     P04217
#> 2 NR_015380           
#> 3 NM_014576     Q9NQ94
#> 4 NM_000014     P01023
#> 5 NR_026971           
#> 6 NM_144670     A8K2U0

The data above has been provided by GDSC and enables mapping across various gene annotations. It is important to identify which gene annotation maps to the RNA-Seq data and to check for completeness.

GDSC_rnaseq$gene_id %in% GDSC_gene_identifiers$gene_id |> table()
#> 
#>   TRUE 
#> 135000

GDSC_rnaseq$gene_symbol %in% GDSC_gene_identifiers$hgnc_symbol |> table()
#> 
#>  FALSE   TRUE 
#>   2100 132900

GDSC_rnaseq$gene_symbol %in% GDSC_gene_identifiers$cosmic_gene_symbol |> table()
#> 
#> FALSE  TRUE 
#> 69600 65400

We can see that gene_id maps completely to the genes in our RNA-seq data, whereas hgnc_symbol and cosmic_gene_symbol are missing gene symbols (see the numbers under FALSE). This is a pretty obvious indicator to move forward with the gene_id for downstream analysis.

Now we move over to the cell line annotations. There are a few attributes made available, we first look to confirm the mapping id to the RNA-Seq data.

data(GDSC_model_list)
print(colnames(GDSC_model_list)[1:39])
#>  [1] "model_id"                "sample_id"              
#>  [3] "patient_id"              "parent_id"              
#>  [5] "model_name"              "synonyms"               
#>  [7] "tissue"                  "cancer_type"            
#>  [9] "cancer_type_ncit_id"     "tissue_status"          
#> [11] "sample_site"             "cancer_type_detail"     
#> [13] "model_type"              "growth_properties"      
#> [15] "model_treatment"         "sampling_day"           
#> [17] "sampling_month"          "sampling_year"          
#> [19] "doi"                     "pmed"                   
#> [21] "msi_status"              "ploidy_snp6"            
#> [23] "ploidy_wes"              "ploidy_wgs"             
#> [25] "mutational_burden"       "model_comments"         
#> [27] "model_relations_comment" "COSMIC_ID"              
#> [29] "BROAD_ID"                "CCLE_ID"                
#> [31] "RRID"                    "HCMI"                   
#> [33] "suppliers"               "supplier"               
#> [35] "cat_number"              "species"                
#> [37] "gender"                  "ethnicity"              
#> [39] "age_at_sampling"

GDSC_rnaseq$model_id %in% GDSC_model_list$model_id |> table()
#> 
#>   TRUE 
#> 135000

Below are some examples of other available variables that may be of interest for downstream analysis.

GDSC_model_list[, c("model_id", "model_name", "tissue", "ploidy_wes", "mutational_burden", "gender", "ethnicity")] |> head()
#>    model_id model_name                      tissue ploidy_wes mutational_burden
#> 1 SIDM01774      PK-59                    Pancreas   3.510751             24.79
#> 2 SIDM00192   SNU-1033             Large Intestine   2.780367             23.29
#> 3 SIDM01447    SNU-466      Central Nervous System   2.054101             20.58
#> 4 SIDM01554  IST-MES-2                        Lung   1.851007             22.92
#> 5 SIDM01689     MUTZ-5 Haematopoietic and Lymphoid   1.941110             28.76
#> 6 SIDM01460      TM-31      Central Nervous System   2.885529             25.89
#>    gender  ethnicity
#> 1 Unknown    Unknown
#> 2  Female East Asian
#> 3    Male    Unknown
#> 4    Male      White
#> 5    Male    Unknown
#> 6  Female East Asian

Drug response data

Finally, we’ll load in the corresponding drug response data for these cell lines.

data(GDSC_drug_response)
GDSC_drug_response |> head()
#>   DATASET NLME_RESULT_ID NLME_CURVE_ID COSMIC_ID CELL_LINE_NAME SANGER_MODEL_ID
#> 1   GDSC2            343      15946320    683667         PFSK-1       SIDM01132
#> 2   GDSC2            343      15946560    684052           A673       SIDM00848
#> 3   GDSC2            343      15946840    684057            ES5       SIDM00263
#> 4   GDSC2            343      15947099    684059            ES7       SIDM00269
#> 5   GDSC2            343      15947381    684062          EW-11       SIDM00203
#> 6   GDSC2            343      15947663    684072        SK-ES-1       SIDM01111
#>      TCGA_DESC DRUG_ID DRUG_NAME PUTATIVE_TARGET     PATHWAY_NAME COMPANY_ID
#> 1           MB    1017  Olaparib    PARP1, PARP2 Genome integrity       1046
#> 2 UNCLASSIFIED    1017  Olaparib    PARP1, PARP2 Genome integrity       1046
#> 3 UNCLASSIFIED    1017  Olaparib    PARP1, PARP2 Genome integrity       1046
#> 4 UNCLASSIFIED    1017  Olaparib    PARP1, PARP2 Genome integrity       1046
#> 5 UNCLASSIFIED    1017  Olaparib    PARP1, PARP2 Genome integrity       1046
#> 6 UNCLASSIFIED    1017  Olaparib    PARP1, PARP2 Genome integrity       1046
#>   WEBRELEASE MIN_CONC MAX_CONC  LN_IC50      AUC     RMSE   Z_SCORE
#> 1          Y 0.010005       10 4.488810 0.974081 0.072391  0.201882
#> 2          Y 0.010005       10 1.782152 0.842679 0.068257 -1.881795
#> 3          Y 0.010005       10 2.116072 0.869909 0.070087 -1.624732
#> 4          Y 0.010005       10 1.685857 0.834608 0.092726 -1.955925
#> 5          Y 0.010005       10 2.078938 0.844879 0.114103 -1.653318
#> 6          Y 0.010005       10 0.592900 0.727416 0.081839 -2.797320

unique(GDSC_rnaseq$model_id) %in% GDSC_drug_response$SANGER_MODEL_ID |> table()
#> 
#> FALSE  TRUE 
#>    10    90

We can use SANGER_MODEL_ID to map back to our RNA-Seq data. DRUG_NAME will be used as the identifier for the treatment. We also have both the IC50 (LN_IC50) and the AUC (AUC) values for each cell-drug pair.

Notice that some of the cell lines do not have drug response data. These will need to be filtered before downstream analysis.

Exploring other multi-omic profiles

We have prepared a variety of other molecular profiles from both GDSC and CCLE. We look through a few more examples below to better understand these data types.

Driver mutations

Load in the driver mutations data from GDSC:

data(GDSC_drivers)
GDSC_drivers |> head()
#>     gene_id gene_symbol  model_id protein_mutation             rna_mutation
#> 1 SIDG27130       RGPD3 SIDM02101       p.N241fs*6  r.809_819delAAUCUUAUGCU
#> 2 SIDG08129        ESR1 SIDM02095       p.L15fs*69        r.412_416delACUGC
#> 3 SIDG03559        CBLC SIDM02090      p.Q419fs*81          r.1295_1296insc
#> 4 SIDG02114        BAP1 SIDM02090           p.C91Y                 r.402g>a
#> 5 SIDG02114        BAP1 SIDM02090           p.N78S                 r.363a>g
#> 6 SIDG36265        SPEN SIDM02089      p.R753fs*53 r.2618_2627delAGGAGGCUUU
#>              cdna_mutation cancer_driver cancer_predisposition_variant
#> 1  c.721_731delAATCTTATGCT          True                         False
#> 2          c.42_46delACTGC          True                         False
#> 3          c.1253_1254insC          True                         False
#> 4                 c.272G>A          True                         False
#> 5                 c.233A>G          True                         False
#> 6 c.2257_2266delAGGAGGCTTT          True                         False
#>       effect    vaf coding source            model_name
#> 1 frameshift 0.2319   True Sanger Mesobank_CellLine-53T
#> 2 frameshift 0.4259   True Sanger  Mesobank_CellLine-26
#> 3 frameshift 0.5217   True Sanger Mesobank_CellLine-50T
#> 4   missense 0.5217   True Sanger Mesobank_CellLine-50T
#> 5   missense 0.4595   True Sanger Mesobank_CellLine-50T
#> 6 frameshift 0.6333   True Sanger  Mesobank_CellLine-45

Notice that this data is not a continuous expression like the RNA-Seq. This data will have to be further processed before it can be used to predict response.

Methylation

Load in the methylation matrix from GDSC:

data(GDSC_methylation)
GDSC_methylation[1:5, 1:5]
#>                          X8359018054_R03C01 X8359018053_R04C02
#> chr1:1051178-1052445              0.3733729          0.4144962
#> chr1:109824313-109824526          0.4644322          0.5959816
#> chr1:109825710-109826207          0.1411910          0.1770613
#> chr1:110008962-110010124          0.4143231          0.5317602
#> chr1:110527248-110528026          0.3022454          0.1882153
#>                          X8221932075_R03C02 X8221924165_R04C02
#> chr1:1051178-1052445              0.2892333          0.3023764
#> chr1:109824313-109824526          0.5048080          0.4594520
#> chr1:109825710-109826207          0.1668371          0.1698714
#> chr1:110008962-110010124          0.5259936          0.4551888
#> chr1:110527248-110528026          0.2679353          0.2989812
#>                          X7970368131_R04C02
#> chr1:1051178-1052445              0.3880624
#> chr1:109824313-109824526          0.5114854
#> chr1:109825710-109826207          0.1706867
#> chr1:110008962-110010124          0.4740822
#> chr1:110527248-110528026          0.2603969

This data has already been processed into a matrix. Notice though that the sample names are not present, instead there is the array ID and position. We can use the provided annotation file to map back to the sample names in our model list.

data(GDSC_methylation_model_list)
GDSC_methylation_model_list |> head()
#>   Sample_Name Sample_Well Sample_Plate Sample_Group Pool_ID Sentrix_ID
#> 1       HL-60         A06     SMET0001           NA      NA 5684819030
#> 2      IGR-37         B06     SMET0001           NA      NA 5684819030
#> 3      WM793B         C07     SMET0001           NA      NA 5723654013
#> 4       IGR39         C08     SMET0001           NA      NA 5723654013
#> 5      SW-480         D09     SMET0001           NA      NA 5723654015
#> 6         C32         E09     SMET0001           NA      NA 5723654015
#>   Sentrix_Position   Investigator Project                             Tissue
#> 1           R05C01 Catia Moutinio    <NA> HAEMATOPOIETIC AND LYMPHOID TISSUE
#> 2           R06C01   Javi Carmona    <NA>                               SKIN
#> 3           R03C01   Javi Carmona    <NA>                               SKIN
#> 4           R05C02   Javi Carmona    <NA>                               SKIN
#> 5           R02C02   Javi Carmona    <NA>                    LARGE INTESTINE
#> 6           R03C02 Catia Moutinio    <NA>                               SKIN
#>                     Type EBV Cell_Line Wildtype Normal     Coment  Scan_Date
#> 1 ACUTE MYELOID LEUKEMIA  No       YES      Yes     No       <NA> 2011-02-12
#> 2               MELANOMA  No       YES      Yes     No Metastasis 2011-02-12
#> 3               MELANOMA  No       YES      Yes     No Metastasis 2011-02-12
#> 4               MELANOMA  No       YES      Yes     No    Primary 2011-02-12
#> 5         ADENOCARCINOMA  No       YES      Yes     No    Primary 2011-02-12
#> 6               MELANOMA  No       YES      Yes     No       <NA> 2011-02-12
#>   GDSC1                   GDSC2 cosmic_id
#> 1 blood acute_myeloid_leukaemia    905938
#> 2  skin                melanoma   1240153
#> 3  skin                melanoma   1299081
#> 4  skin                melanoma   1298148
#> 5  <NA>                    <NA>        NA
#> 6  skin                melanoma    906830

GDSC_methylation_model_list$sampleid <- paste0(
  "X", GDSC_methylation_model_list$Sentrix_ID,
  "_", GDSC_methylation_model_list$Sentrix_Position
)
colnames(GDSC_methylation) %in% GDSC_methylation_model_list$sampleid |> table()
#> 
#> TRUE 
#>  100

colnames(GDSC_methylation) <- GDSC_methylation_model_list$Sample_Name[
  match(colnames(GDSC_methylation), GDSC_methylation_model_list$sampleid)
]
GDSC_methylation[1:5, 1:5]
#>                               A673       RT4   8-MG-BA  U-118-MG CHAGO-K-1
#> chr1:1051178-1052445     0.3733729 0.4144962 0.2892333 0.3023764 0.3880624
#> chr1:109824313-109824526 0.4644322 0.5959816 0.5048080 0.4594520 0.5114854
#> chr1:109825710-109826207 0.1411910 0.1770613 0.1668371 0.1698714 0.1706867
#> chr1:110008962-110010124 0.4143231 0.5317602 0.5259936 0.4551888 0.4740822
#> chr1:110527248-110528026 0.3022454 0.1882153 0.2679353 0.2989812 0.2603969

We have provided a few other subsetted datasets. A full list is available from Reference.

We encourage independent exploration of these datasets.

Creating expression matrices for pharmacogenomic analysis

To facilitate downstream pharmacogenomic analysis, we want to create an expression matrix such that:

Features are the rows
Samples are the columns
Feature expression as the individual values

Below, we show a example of such matrix using dummy data.

dummy_data <- setNames(
  as.data.frame(replicate(5, rnorm(5))),
  paste0("Sample", 1:5)
)
rownames(dummy_data) <- paste0("Feature", 1:5)
dummy_data
#>               Sample1    Sample2    Sample3     Sample4    Sample5
#> Feature1 -1.400043517  1.1484116 -0.5536994 -1.86301149  0.4681544
#> Feature2  0.255317055 -1.8218177  0.6289820 -0.52201251  0.3629513
#> Feature3 -2.437263611 -0.2473253  2.0650249 -0.05260191 -1.3045435
#> Feature4 -0.005571287 -0.2441996 -1.6309894  0.54299634  0.7377763
#> Feature5  0.621552721 -0.2827054  0.5124269 -0.91407483  1.8885049

Let’s revisit the RNA-Seq example. The data is currently in a long format (i.e. there is one row for each sample-feature observation).

GDSC_rnaseq |> head()
#>    model_id model_name data_source   gene_id gene_symbol read_count  fpkm
#> 1 SIDM00794       A388      sanger SIDG00082       ABCC6         10  0.01
#> 2 SIDM00794       A388      sanger SIDG00106       ABCF3      20264 25.95
#> 3 SIDM00794       A388      sanger SIDG00108       ABCG2       1070  1.47
#> 4 SIDM00794       A388      sanger SIDG00148        ABI3          6  0.02
#> 5 SIDM00794       A388      sanger SIDG00177      ACADSB       1410  1.50
#> 6 SIDM00794       A388      sanger SIDG00198       ACER1         12  0.06
GDSC_rnaseq |> dim()
#> [1] 135000      7

# number of cell line samples
length(unique(GDSC_rnaseq$model_id))
#> [1] 100

# number of genes
length(unique(GDSC_rnaseq$gene_id))
#> [1] 1350

We want to convert this into a wide format such that each row is a gene, each column is a sample, and the values are the gene expression.

expr <- reshape2::dcast(GDSC_rnaseq, gene_id ~ model_name, value.var = "fpkm")
rownames(expr) <- expr$gene_id
expr$gene_id <- NULL

expr[1:5, 1:10]
#>            A388  A427 BB65-RCC Becker BICR78 C-33-A Ca-Ski Ca9-22 CCK-81 CHL-1
#> SIDG00082  0.01  0.29     0.04   0.15   0.03   0.01   0.01   0.01   0.94  1.23
#> SIDG00106 25.95 16.45     8.19  13.88  14.04  10.99  16.42   9.04   9.37 14.86
#> SIDG00108  1.47  0.52     0.02   0.03   1.02   0.01   0.27   0.09   0.01  0.04
#> SIDG00148  0.02  0.02     0.35   0.01   0.00   0.02   0.00   0.10   0.00  0.00
#> SIDG00177  1.50  6.18     3.07   3.40   1.97   7.21   1.01   1.83   5.56 11.63

expr |> dim()
#> [1] 1350  100

Notice that we have the 1350 genes as the rows and the 100 cell lines as the columns.

Feature extraction techniques to define biomarkers

While using the continuous expression of single features is a convenient method for quantifying biomarkers, there are cases when other techniques are needed and/or are more appropriate.

Binarization

Recall that the driver mutations data was not presented as continuous numeric values. One method to prepare this data is to binarize the mutation status.

GDSC_drivers |> head()
#>     gene_id gene_symbol  model_id protein_mutation             rna_mutation
#> 1 SIDG27130       RGPD3 SIDM02101       p.N241fs*6  r.809_819delAAUCUUAUGCU
#> 2 SIDG08129        ESR1 SIDM02095       p.L15fs*69        r.412_416delACUGC
#> 3 SIDG03559        CBLC SIDM02090      p.Q419fs*81          r.1295_1296insc
#> 4 SIDG02114        BAP1 SIDM02090           p.C91Y                 r.402g>a
#> 5 SIDG02114        BAP1 SIDM02090           p.N78S                 r.363a>g
#> 6 SIDG36265        SPEN SIDM02089      p.R753fs*53 r.2618_2627delAGGAGGCUUU
#>              cdna_mutation cancer_driver cancer_predisposition_variant
#> 1  c.721_731delAATCTTATGCT          True                         False
#> 2          c.42_46delACTGC          True                         False
#> 3          c.1253_1254insC          True                         False
#> 4                 c.272G>A          True                         False
#> 5                 c.233A>G          True                         False
#> 6 c.2257_2266delAGGAGGCTTT          True                         False
#>       effect    vaf coding source            model_name
#> 1 frameshift 0.2319   True Sanger Mesobank_CellLine-53T
#> 2 frameshift 0.4259   True Sanger  Mesobank_CellLine-26
#> 3 frameshift 0.5217   True Sanger Mesobank_CellLine-50T
#> 4   missense 0.5217   True Sanger Mesobank_CellLine-50T
#> 5   missense 0.4595   True Sanger Mesobank_CellLine-50T
#> 6 frameshift 0.6333   True Sanger  Mesobank_CellLine-45

Looking at the first row, we can see that there is a mutation on the RGPD3 gene in the SIDM02101 cell line model. We would represent such mutation events with 1.

The code below again casts the long data frame into a wide format. This time we specify an aggregate function length(), which returns the number of rows (mutation events) for each gene-cell line pair. This was done by passing the option fun.aggregate = length.

expr <- reshape2::dcast(
  GDSC_drivers,
  gene_symbol ~ model_id,
  value.var = "cdna_mutation",
  fun.aggregate = length
)
rownames(expr) <- expr$gene_symbol
expr$gene_symbol <- NULL

expr["RGPD3", "SIDM02101"]
#> [1] 1

expr[1:5, 1:10]
#>       SIDM00001 SIDM00002 SIDM00003 SIDM00006 SIDM00007 SIDM00008 SIDM00009
#> ABCB1         0         0         0         0         0         0         0
#> ABI1          0         0         0         0         0         0         0
#> ABL1          0         0         0         0         0         0         0
#> ABL2          0         0         0         0         0         0         0
#> ACVR1         0         0         0         0         0         0         0
#>       SIDM00011 SIDM00013 SIDM00014
#> ABCB1         0         0         0
#> ABI1          0         0         0
#> ABL1          0         0         0
#> ABL2          0         1         0
#> ACVR1         0         0         0

There was one mutation event on the RGPD3 gene in the SIDM02101 cell line model, hence the value of this combination is 1.

Mutation events are relatively sparse, so we see 0 for the majority of the matrix.

Signature extraction

There are cases when individual features have low predictive power, but when combined become much more informative of drug response.

Let’s revisit our methylation data. Recall that each row is a CpG site. There are 1000 CpG sites.

GDSC_methylation[1:5, 1:5]
#>                               A673       RT4   8-MG-BA  U-118-MG CHAGO-K-1
#> chr1:1051178-1052445     0.3733729 0.4144962 0.2892333 0.3023764 0.3880624
#> chr1:109824313-109824526 0.4644322 0.5959816 0.5048080 0.4594520 0.5114854
#> chr1:109825710-109826207 0.1411910 0.1770613 0.1668371 0.1698714 0.1706867
#> chr1:110008962-110010124 0.4143231 0.5317602 0.5259936 0.4551888 0.4740822
#> chr1:110527248-110528026 0.3022454 0.1882153 0.2679353 0.2989812 0.2603969

GDSC_methylation |> dim()
#> [1] 1000  100

Signatures refer to combinations of features that form some pattern with biological relevance. For example, you may choose to define a signature X to represent CpG sites located on promoters of genes involved in pathway Y.

For simplicity, we can define some arbituary signatures from our CpG sites.

set.seed(123)
signatures <- data.frame(
  CpG = rownames(GDSC_methylation),
  Signature = sample(gl(10, 100, length = 1000))
)
signatures |> head()
#>                        CpG Signature
#> 1     chr1:1051178-1052445         5
#> 2 chr1:109824313-109824526         5
#> 3 chr1:109825710-109826207         2
#> 4 chr1:110008962-110010124         6
#> 5 chr1:110527248-110528026         2
#> 6 chr1:114354373-114355300        10

Each of the 1000 CpG sites was randomly assigned to one of 10 signatures.

Next we want to quantify the signature for each cell line. Again, for simplicity, we will sum the beta values across each CpG for each signature.

sScores <- data.frame(matrix(NA, nrow = 0, ncol = 100))

for (s in c(1:10)) {
  # get CpGs within each signature
  cpgs <- signatures[signatures$Signature == s, ]$CpG
  mSig <- GDSC_methylation[rownames(GDSC_methylation) %in% cpgs, ]

  # compute sum of beta values for each cell line
  sSum <- colSums(mSig)
  sScores <- rbind(sScores, sSum)
}

rownames(sScores) <- paste0("Signature", 1:10)
colnames(sScores) <- colnames(GDSC_methylation)

sScores[1:10, 1:5]
#>                 A673      RT4  8-MG-BA U-118-MG CHAGO-K-1
#> Signature1  35.73567 41.50092 30.31945 35.03195  40.93194
#> Signature2  36.54479 41.78597 34.23020 37.28902  42.43054
#> Signature3  35.08198 39.01984 32.67874 35.14082  39.57340
#> Signature4  37.26978 42.31938 35.94349 38.44868  41.68582
#> Signature5  37.06404 42.90746 32.04316 37.13874  38.69905
#> Signature6  35.32663 40.04647 31.79493 34.86532  39.63569
#> Signature7  36.46212 40.74357 33.85272 36.77940  40.16127
#> Signature8  35.69553 40.46161 32.70537 36.50410  39.12737
#> Signature9  34.37157 38.31962 31.18276 34.40533  39.80003
#> Signature10 40.49063 42.52888 33.00193 39.79155  39.62519

We now have a new expression matrix, this time of the 10 defined signatures for each of our cell lines.

Individual practice

For the remainder of this session, we provide more sample datasets to practice the above techniques.

data(CCLE_model_list)   #metadata for CCLE cell lines 
data(CCLE_metabolomics) #CCLE metabolite expression levels
data(CCLE_rrpa)         #CCLE RPPA protein expression levels
data(CCLE_chromatin)    #CCLE histone modifications