Cell type deconvolutiond of (cf)MeDIP-seq data with decemedip

Citation

If you use decemedip in published research, please cite:

Shen, Ning, and Keegan Korthauer. “decemedip: hierarchical Bayesian modeling for cell type deconvolution of immunoprecipitation-based DNA methylome” bioRxiv, forthcoming.

Installation

This package will be submitted to Bioconductor. For now, you can install the development version from GitHub:

# Install stable version from Bioconductor (once available)
# BiocManager::install("decemedip")

# Install development version from GitHub
remotes::install_github("nshen7/decemedip")

After installation, load the decemedip package:

library(SummarizedExperiment)
library(dplyr)
library(ggplot2)
library(decemedip)
options(digits = 2)

Background

Cell-free and bulk DNA methylation data obtained through MeDIP-seq reflect a mixture of methylation signals across multiple cell types. Decomposing these signals to infer cell type composition can provide valuable insights for cancer diagnosis, immune response monitoring, and other biomedical applications. However, challenges like enrichment-induced biases and sparse reference data make this task complex. The decemedip package addresses these challenges through a hierarchical Bayesian framework that estimates cell type proportions and models the relationship between MeDIP-seq counts and reference methylation data.

decemedip couples a logit-normal model with a generalize additive model (GAM) framework. For each site $i \in \{1, ..., N\}$ in the reference panel, the input to the model is the fractional methylation levels $x_{ik}$ for each cell type $k \in \{1, ..., K\}$ , the CpG density level $z_i$ and the MeDIP-seq read count $y_i$ , where $K$ is the total number of cell types. A unit simplex variable that follows a logit-normal prior, $\boldsymbol{\pi} = (\pi_1, ..., \pi_K)$ where $\pi_k > 0, \sum_{k=1}^K \pi_k = 1$ , is included to describe the proportions of cell types in the reference panel while taking into account the correlations between these cell types.

Input Data

decemedip requires three primary inputs:

Reference methylation matrix: A matrix of methylation levels for multiple cell types at selected CpG sites.
CpG density: Information about CpG site density in the genome.
MeDIP-seq read counts: Coverage values from the sample of interest.

Reference panel

Our package provides default reference matrices for hg19 and hg38 along with the corresponding CpG density information, as objects in class SummarizedExperiment. The objects can be accessed by calling data(hg19.ref.cts.se) and data(hg19.ref.anc.se), or data(hg38.ref.cts.se) and data(hg38.ref.anc.se). By default, the main function decemedip() applies hg19.ref.cts.se and hg19.ref.anc.se as the reference panels. Please refer to the manuscript for details of how the default reference panels were constructed.

On a side note, we provide the function makeReferencePanel() to allow user to build their own reference panels, which only requires input of reference CpGs and corresponded fractional methylation level matrix. The function computes CpG density on its own. Note that the cell type-specific sites and anchor sites need to be included in two SummarizedExperiment objects to be inputs to the main function decemedip(). See ?makeReferencePanel for more information.

We show how the reference panel look like in the following chunks:

data(hg19.ref.cts.se)
print(hg19.ref.cts.se)
## class: RangedSummarizedExperiment 
## dim: 2500 25 
## metadata(0):
## assays(1): beta
## rownames(2500): cg18856478 cg20820767 ... cg01071459 cg20726993
## rowData names(4): probe label pos n_cpgs_100bp
## colnames(25): Monocytes_EPIC B-cells_EPIC ... Upper_GI Uterus_cervix
## colData names(0):
head(granges(hg19.ref.cts.se))
## GRanges object with 6 ranges and 4 metadata columns:
##              seqnames    ranges strand |       probe                  label
##                 <Rle> <IRanges>  <Rle> | <character>            <character>
##   cg18856478     chr1  43814358      * |  cg18856478 Monocytes_EPIC hyper..
##   cg20820767     chr1  45082840      * |  cg20820767 Monocytes_EPIC hyper..
##   cg14855367     chr3 191048308      * |  cg14855367 Monocytes_EPIC hyper..
##   cg25913761    chr15  90727560      * |  cg25913761 Monocytes_EPIC hyper..
##   cg21546950     chr1  77904032      * |  cg21546950 Monocytes_EPIC hyper..
##   cg14981189    chr10   5113871      * |  cg14981189 Monocytes_EPIC hyper..
##                    pos n_cpgs_100bp
##              <integer>    <integer>
##   cg18856478  43814358            7
##   cg20820767  45082840           12
##   cg14855367 191048308            1
##   cg25913761  90727560            4
##   cg21546950  77904032            1
##   cg14981189   5113871            1
##   -------
##   seqinfo: 22 sequences from hg19 genome; no seqlengths

data(hg19.ref.anc.se)
print(hg19.ref.anc.se)
## class: RangedSummarizedExperiment 
## dim: 1000 25 
## metadata(0):
## assays(1): beta
## rownames(1000): 353534 76294 ... 73948 87963
## rowData names(7): probe pos ... avg_beta_rank n_cpgs_100bp
## colnames(25): Monocytes_EPIC B-cells_EPIC ... Upper_GI Uterus_cervix
## colData names(0):
head(granges(hg19.ref.cts.se))
## GRanges object with 6 ranges and 4 metadata columns:
##              seqnames    ranges strand |       probe                  label
##                 <Rle> <IRanges>  <Rle> | <character>            <character>
##   cg18856478     chr1  43814358      * |  cg18856478 Monocytes_EPIC hyper..
##   cg20820767     chr1  45082840      * |  cg20820767 Monocytes_EPIC hyper..
##   cg14855367     chr3 191048308      * |  cg14855367 Monocytes_EPIC hyper..
##   cg25913761    chr15  90727560      * |  cg25913761 Monocytes_EPIC hyper..
##   cg21546950     chr1  77904032      * |  cg21546950 Monocytes_EPIC hyper..
##   cg14981189    chr10   5113871      * |  cg14981189 Monocytes_EPIC hyper..
##                    pos n_cpgs_100bp
##              <integer>    <integer>
##   cg18856478  43814358            7
##   cg20820767  45082840           12
##   cg14855367 191048308            1
##   cg25913761  90727560            4
##   cg21546950  77904032            1
##   cg14981189   5113871            1
##   -------
##   seqinfo: 22 sequences from hg19 genome; no seqlengths

Fit the Bayesian Model

The main function decemedip() fits the decemedip model. It allows two types of input:

A BAM file of the sample MeDIP-seq data, or
read counts of the reference CpG sites of the sample.

We provide instructions for both input types as follows.

Input with a BAM file

sample_bam_file <- "path/to/bam/files"
paired <- TRUE # whether the sequencing is paired-end

output <- decemedip(sample_bam_file = sample_bam_file, 
                    paired = paired, 
                    cores = 4)

PS: By default, the decemedip() function uses a hg19 reference panel. But users may add the arguments ref_cts = hg38.ref.cts.se, ref_anc = hg38.ref.anc.se to apply read counts extraction on hg38 data.

Input with read counts

We use built-in objects of the package that contains read counts of the prostate tumor patient-derived xenograft (PDX) samples from the Berchuck et al. 2022 [1] study, pdx.counts.cts.se and pdx.counts.anc.se, to demonstrate the output and diagnostics in this vignette.

[1] Berchuck JE, Baca SC, McClure HM, Korthauer K, Tsai HK, Nuzzo PV, et al. Detecting neuroendocrine prostate cancer through tissue-informed cell-free DNA methylation analysis. Clinical Cancer Research. 2022;28(5):928–938.

data(pdx.counts.cts.se)
data(pdx.counts.anc.se)

We extract the sample LuCaP_147CR from this example dataset for follwing illustration:

counts_cts <- assays(pdx.counts.cts.se)$counts[,'LuCaP_147CR'] # read counts of cell type-specific CpGs of the sample 'LuCaP_147CR'
counts_anc <- assays(pdx.counts.anc.se)$counts[,'LuCaP_147CR'] # read counts of anchor CpGs of the sample 'LuCaP_147CR'

Due to the vignette running time limit by Bioconductor, we only run 300 iterations (iter = 300) for the purpose of demonstration, which causes the warning of Bulk Effective Samples Size (ESS) is too low. In regular cases, we recommend to run 2000 iterations (the default), or at least 1000 iterations for a stable posterior inference.

output <- decemedip(counts_cts = counts_cts,
                    counts_anc = counts_anc,
                    diagnostics = TRUE,
                    cores = 4,
                    iter = 100)
## Warning: The largest R-hat is NA, indicating chains have not mixed.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#r-hat
## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#bulk-ess
## Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#tail-ess
## MCMC converged with seed 2024

The output of decemedip() is a list containing two elements:

names(output)
## [1] "data_list" "posterior"

data_list: An organized list of variables used as input to the Stan posterior sampling function.
posterior: An stanfit object produced by Stan representing the fitted posteriors.

Checking model outputs

Cell type proportions

After running the model, you may extract and save the summary of fitted posteriors using the monitor() and extract() functions provided by the RStan package. See documentation of RStan for details of these functions.

library(rstan)

Extract the fitted posterior of cell type proportions ( $\boldsymbol\pi$ ):

smr_pi.df <- getSummaryOnPi(output$posterior)

head(smr_pi.df)
##              cell_type    mean se_mean      sd    2.5%     25%     50%     75%
## pi[1]   Monocytes_EPIC 0.00029 1.1e-04 0.00120 1.5e-12 3.9e-08 1.8e-06 5.7e-05
## pi[2]     B-cells_EPIC 0.00084 3.5e-04 0.00395 3.6e-12 2.3e-08 1.4e-06 2.4e-05
## pi[3]  CD4T-cells_EPIC 0.00085 3.6e-04 0.00397 1.3e-12 2.0e-08 1.9e-06 6.7e-05
## pi[4]    NK-cells_EPIC 0.00025 6.9e-05 0.00068 3.2e-13 3.7e-08 2.2e-06 8.5e-05
## pi[5]  CD8T-cells_EPIC 0.00024 9.1e-05 0.00100 1.1e-13 2.3e-08 4.1e-06 8.2e-05
## pi[6] Neutrophils_EPIC 0.00048 1.7e-04 0.00182 2.2e-13 1.4e-08 2.3e-06 1.1e-04
##        97.5% n_eff Rhat valid
## pi[1] 0.0037   111  1.1     1
## pi[2] 0.0114   115  1.0     1
## pi[3] 0.0089   117  1.0     1
## pi[4] 0.0016    76  1.0     1
## pi[5] 0.0017   110  1.0     1
## pi[6] 0.0051   117  1.1     1

Summary Statistics Columns:
- cell_type: The name of the parameter or variable being analyzed.
- mean: The posterior mean, representing the point estimate of the parameter.
- se_mean: The standard error of the mean, calculated as sd / sqrt(n_eff), indicating precision of the mean estimate.
- sd: The posterior standard deviation, representing the spread or uncertainty of the parameter estimate.
- 2.5%, 25%, 50% (median), 75%, 97.5%: Percentiles of the posterior distribution, providing a summary of parameter uncertainty. These define the 95% credible interval (2.5% to 97.5%).
Diagnostics Columns:
- n_eff: The effective sample size, indicating how many independent samples the chain produced after accounting for autocorrelation.
- Rhat: The potential scale reduction factor, measuring chain convergence. Values close to 1.00 suggest good convergence.
- valid: A flag indicating whether diagnostic checks (e.g., Rhat and n_eff) passed for this parameter (1 = passed, 0 = potential issues).

Plotting out the fitted cell type proportions with credible intervals:

labels <- gsub('_', ' ', smr_pi.df$cell_type)
labels <- gsub('(.*) EPIC', '\\1', labels)

smr_pi.df |>
  mutate(cell_type = factor(cell_type, labels = labels)) |>
  ggplot(aes(cell_type, mean)) +
  geom_linerange(aes(ymin = `2.5%`, ymax = `97.5%`), 
                 position = position_dodge2(width = 0.035),
                 linewidth = 7, alpha = 0.3) +
  geom_linerange(aes(ymin = `25%`, ymax = `75%`), 
                 position = position_dodge2(width = 0.035),
                 linewidth = 7, alpha = 1) +
  geom_point(position = position_dodge2(width = 0.035),
             fill = 'white', shape = 21, size = 8) +
  theme_classic() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Fitted relationship between fractional methylation and MeDIP-seq counts

Note that this plot is only accessible when diagnostics is set toTRUE in the decemedip() function. The actual read counts (orange) and the fitted counts predicted by the GAM component (black) are shown in the figure across varying levels of CpG density. Grey area represents the 95% credible intervals of the predicted counts. `CpG density: x’ means that there are x CpGs in the 100-bp window surrounding the CpG.

plotDiagnostics(output, plot_type = "y_fit") +
  ylim(0, 200)
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_segment()`).
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

Conclusion

The decemedip package provides a robust framework for cell type deconvolution from MeDIP-seq data. By following this vignette, users can apply the method to their own datasets, extract key model outputs, and generate diagnostic plots for analysis.

Session Info

sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
##  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
##  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
## [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
## 
## time zone: UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] rstan_2.32.6                StanHeaders_2.32.10        
##  [3] decemedip_0.99.0            ggplot2_3.5.1              
##  [5] dplyr_1.1.4                 SummarizedExperiment_1.36.0
##  [7] Biobase_2.66.0              GenomicRanges_1.58.0       
##  [9] GenomeInfoDb_1.42.3         IRanges_2.40.1             
## [11] S4Vectors_0.44.0            BiocGenerics_0.52.0        
## [13] MatrixGenerics_1.18.1       matrixStats_1.5.0          
## [15] BiocStyle_2.34.0           
## 
## loaded via a namespace (and not attached):
##   [1] rstudioapi_0.17.1        jsonlite_1.8.9           magrittr_2.0.3          
##   [4] farver_2.1.2             rmarkdown_2.29           fs_1.6.5                
##   [7] BiocIO_1.16.0            zlibbioc_1.52.0          ragg_1.3.3              
##  [10] vctrs_0.6.5              memoise_2.0.1            Rsamtools_2.22.0        
##  [13] RCurl_1.98-1.16          usethis_3.1.0            htmltools_0.5.8.1       
##  [16] S4Arrays_1.6.0           progress_1.2.3           curl_6.2.0              
##  [19] SparseArray_1.6.1        sass_0.4.9               bslib_0.9.0             
##  [22] htmlwidgets_1.6.4        desc_1.4.3               testthat_3.2.3          
##  [25] httr2_1.1.0              cachem_1.1.0             GenomicAlignments_1.42.0
##  [28] mime_0.12                lifecycle_1.0.4          pkgconfig_2.0.3         
##  [31] Matrix_1.7-1             R6_2.5.1                 fastmap_1.2.0           
##  [34] GenomeInfoDbData_1.2.13  shiny_1.10.0             digest_0.6.37           
##  [37] colorspace_2.1-1         ps_1.8.1                 AnnotationDbi_1.68.0    
##  [40] rprojroot_2.0.4          pkgload_1.4.0            textshaping_1.0.0       
##  [43] RSQLite_2.3.9            labeling_0.4.3           filelock_1.0.3          
##  [46] MEDIPS_1.58.0            httr_1.4.7               abind_1.4-8             
##  [49] compiler_4.4.2           remotes_2.5.0            bit64_4.6.0-1           
##  [52] withr_3.0.2              inline_0.3.21            BiocParallel_1.40.0     
##  [55] DBI_1.2.3                QuickJSR_1.5.1           pkgbuild_1.4.6          
##  [58] R.utils_2.12.3           biomaRt_2.62.1           sessioninfo_1.2.3       
##  [61] rappdirs_0.3.3           DelayedArray_0.32.0      rjson_0.2.23            
##  [64] gtools_3.9.5             DNAcopy_1.80.0           loo_2.8.0               
##  [67] BH_1.87.0-1              tools_4.4.2              httpuv_1.6.15           
##  [70] R.oo_1.27.0              glue_1.8.0               callr_3.7.6             
##  [73] restfulr_0.0.15          promises_1.3.2           grid_4.4.2              
##  [76] generics_0.1.3           gtable_0.3.6             BSgenome_1.74.0         
##  [79] R.methodsS3_1.8.2        preprocessCore_1.68.0    hms_1.1.3               
##  [82] xml2_1.3.6               XVector_0.46.0           pillar_1.10.1           
##  [85] stringr_1.5.1            later_1.4.1              BiocFileCache_2.14.0    
##  [88] lattice_0.22-6           rtracklayer_1.66.0       bit_4.5.0.1             
##  [91] tidyselect_1.2.1         Biostrings_2.74.1        miniUI_0.1.1.1          
##  [94] knitr_1.49               gridExtra_2.3            bookdown_0.42           
##  [97] xfun_0.50                brio_1.1.5               devtools_2.4.5          
## [100] stringi_1.8.4            UCSC.utils_1.2.0         yaml_2.3.10             
## [103] evaluate_1.0.3           codetools_0.2-20         RcppEigen_0.3.4.0.2     
## [106] tibble_3.2.1             BiocManager_1.30.25      cli_3.6.3               
## [109] RcppParallel_5.1.10      xtable_1.8-4             systemfonts_1.2.1       
## [112] processx_3.8.5           munsell_0.5.1            jquerylib_0.1.4         
## [115] Rcpp_1.0.14              dbplyr_2.5.0             png_0.1-8               
## [118] XML_3.99-0.18            parallel_4.4.2           rstantools_2.4.0        
## [121] ellipsis_0.3.2           pkgdown_2.1.1            blob_1.2.4              
## [124] prettyunits_1.2.0        profvis_0.4.0            bayesplot_1.11.1        
## [127] urlchecker_1.0.1         bitops_1.0-9             scales_1.3.0            
## [130] purrr_1.0.4              crayon_1.5.3             rlang_1.1.5             
## [133] cowplot_1.1.3            KEGGREST_1.46.0

Ning Shen