PDN Vignette

Javier Cabrera, Amanda Kowk and Zhenbang Wang

August 15, 2017

Introduction

A Personalized Disease Network (PDN) is a connected graph that represents the disease evolution of an individual patient. We use diagnoses, medical interventions, and their dates to build a PDN. Each node in the PDN represents an event (e.g. Heart failure) and each node is connected to another by a directed edge if the dates fulfill a certain criterion. PDN were introduced by Javier Cabrera and proved to be effective for improving goodness of fit of survival models, dominating individual comorbidity variables. (Cabrera et. al. 2016)

Installation

Like many other R packages, the simplest way to obtain glmnet is to install it directly from CRAN. Type the following command in R console:

install.packages("PDN", repos = "https://cran.us.r-project.org")

Users may change the repos options depending on their locations and preferences. Other options such as the directories where to install the packages can be altered in the command. For more details, see help(install.packages). Here the R package has been downloaded and installed to the default directories. Alternatively, users can download the package source at “https://cran.r-project.org/package=PDN” and type Unix commands to install it to the desired location.

Quick Overview

The purpose of this section is to give users a general sense of the package, including the components, what they do and some basic usage. We will briefly go over the main functions, see the basic operations and have a look at the outputs. Users may have a better idea after this section what functions are available, which one to choose, or at least where to seek help. More details are given in later sections.

First, we load the glmnet package:

library(PDN)

The package contains five functions: buildnetworks, datecut, draw.PDN, draw.PDN.circle and draw.PDN.cluster, and three sample data sets: comorbidity_data, demographic_data and survival_data.

You can visit the help page of the package using the following command:

help(package = "PDN")

For comorbidity_data, its columns represent different diagnoses, its rows represent different patients, and the entries represent the time difference between the respective event and January 1st 1970. NA means the patient does not have the respective diagnosis.

summary(comorbidity_data)
##       AFIB             HF              MR            CMTHY      
##  Min.   :13100   Min.   :12796   Min.   :12925   Min.   :12876  
##  1st Qu.:15882   1st Qu.:15018   1st Qu.:14979   1st Qu.:15915  
##  Median :16954   Median :16546   Median :17043   Median :17238  
##  Mean   :16729   Mean   :16300   Mean   :16467   Mean   :16770  
##  3rd Qu.:17854   3rd Qu.:17739   3rd Qu.:17740   3rd Qu.:18168  
##  Max.   :19097   Max.   :19097   Max.   :19191   Max.   :19097  
##  NA's   :63                      NA's   :57      NA's   :42     
##       CHD             HTN              DM             NEO       
##  Min.   :12794   Min.   :12794   Min.   :12794   Min.   :13625  
##  1st Qu.:13151   1st Qu.:13262   1st Qu.:13590   1st Qu.:14946  
##  Median :13602   Median :13866   Median :14474   Median :17517  
##  Mean   :14209   Mean   :14319   Mean   :14953   Mean   :16801  
##  3rd Qu.:14867   3rd Qu.:15266   3rd Qu.:15998   3rd Qu.:18526  
##  Max.   :18837   Max.   :18528   Max.   :18263   Max.   :18900  
##  NA's   :3       NA's   :1       NA's   :43      NA's   :71     
##       COPD           RENAL      
##  Min.   :12829   Min.   :13166  
##  1st Qu.:13944   1st Qu.:16932  
##  Median :16062   Median :17821  
##  Mean   :15916   Mean   :17407  
##  3rd Qu.:17955   3rd Qu.:18405  
##  Max.   :19130   Max.   :19233  
##  NA's   :30      NA's   :31

For demographic_data, it contains the demographic information of each patients It has five columns which are sex, race, hispan, dshyr and prime.

summary(demographic_data)
##  SEX         RACE        HISPAN       DSHYR          PRIME       
##  F:40   1      :66   0      :87   Min.   :1995   Min.   : 11.00  
##  M:60   21063  :14   9      :10   1st Qu.:2001   1st Qu.: 11.00  
##         2      :13   2      : 2   Median :2005   Median : 11.00  
##         0      : 2   5      : 1   Mean   :2004   Mean   : 41.84  
##         9      : 2   1      : 0   3rd Qu.:2008   3rd Qu.: 32.00  
##         21311  : 1   3      : 0   Max.   :2012   Max.   :303.00  
##         (Other): 2   (Other): 0

For survival_data, it contains information of survival days from admission to death, as well as censor information regarding whether the patient is dead or not.

summary(survival_data)
##     surdays         event     
##  Min.   :  69   Min.   :0.00  
##  1st Qu.:1300   1st Qu.:0.00  
##  Median :2466   Median :0.00  
##  Mean   :2817   Mean   :0.43  
##  3rd Qu.:4213   3rd Qu.:1.00  
##  Max.   :6562   Max.   :1.00

Datecuts

In order to create a PDN based on comorbidity_data, we need to find the criterion for each pair of diagnoses that define the connection of the network.

Generally, it is important to choose an appropriate cutoff point such that it captures information of the interaction between the two diagnoses.

For CMTHY and HTN, the distribution of the time difference is shown below:

We used Cox PH regression to find the best criterion for cutoff point following the rule proposed by (Cabrera et. al. 2016).

k1 = datecut(comorbidity_data,survival_data[,1],survival_data[,2])
k1[1:20]
##    AFIB to HF    HF to AFIB    AFIB to MR    MR to AFIB AFIB to CMTHY 
##          6000          6000          1000          1000           900 
## CMTHY to AFIB   AFIB to CHD   CHD to AFIB   AFIB to HTN   HTN to AFIB 
##           900          5500          5500          5500          5500 
##    AFIB to DM    DM to AFIB   AFIB to NEO   NEO to AFIB  AFIB to COPD 
##          1400          1400          4700          4700          3300 
##  COPD to AFIB AFIB to RENAL RENAL to AFIB      HF to MR      MR to HF 
##          3300          1900          1900          5300          5300

Build Networks

By using function buildnetworks, we past comorbidity_data and criterion information we get from the datecut into the function, then we will get the adjacency matrix for the network shown below:

a = buildnetworks(comorbidity_data,k1)
a[1:10,1:10]
##       AFIB to HF HF to AFIB AFIB to MR MR to AFIB AFIB to CMTHY
##  [1,]          0          0          0          0             0
##  [2,]          0          0          0          0             0
##  [3,]          0          1          0          0             0
##  [4,]          0          0          0          0             0
##  [5,]          0          0          0          0             0
##  [6,]          0          0          0          0             0
##  [7,]          0          0          0          0             0
##  [8,]          0          0          0          0             0
##  [9,]          0          0          0          0             0
## [10,]          0          1          0          0             1
##       CMTHY to AFIB AFIB to CHD CHD to AFIB AFIB to HTN HTN to AFIB
##  [1,]             0           0           0           0           0
##  [2,]             0           0           0           0           0
##  [3,]             0           0           1           0           1
##  [4,]             0           0           0           0           0
##  [5,]             0           0           0           0           0
##  [6,]             0           0           0           0           0
##  [7,]             0           0           0           0           0
##  [8,]             0           0           0           0           0
##  [9,]             0           0           0           0           0
## [10,]             0           1           1           0           1

Visualize Networks

In our case, the nodes have chronological order, and we opt to incorporate this order in the network visualization for a more informative and clean result. Below left is the example PDN from function draw.PDN.circle.

PDN for individual patient:

Here is the example using the functioin draw.PDN, which utilize network and ggplot2 for better visualization:

By looping each patient through draw.PDN.circle, we obtain the visualizations for the set of patients: Here is an example of using draw.PDN.cluster to visulaize network of cluster, we use different colors to represent weighted connection of each nodes.

## Loading required package: survival

Analyze Networks

In the below example using PDN network information to perform regression, we combine comorbidity information and the adjacency matrix from buildnetworks as our design matrix which is shown below:

x = cbind(!is.na(comorbidity_data),a)
x[1:10,1:10]
##    AFIB HF MR CMTHY CHD HTN DM NEO COPD RENAL
## 1     0  1  1     1   1   1  1   0    1     1
## 2     0  1  1     0   1   1  0   0    1     1
## 3     1  1  0     0   1   1  1   0    1     1
## 4     0  1  1     0   1   1  1   1    1     1
## 5     0  1  0     1   1   1  1   1    1     1
## 6     0  1  0     0   1   1  1   0    0     0
## 7     0  1  0     1   1   1  1   0    1     1
## 8     0  1  0     0   1   1  0   1    1     0
## 9     0  1  0     0   1   1  0   1    1     1
## 10    1  1  0     1   1   1  1   0    0     1

We then use the glmnet package to perform k-fold cross-validation for the Cox model. We extract two optimal lambdas, that is lambda.min: the ?? at which the minimal MSE is achieved, and lambda.1se: the largest ?? at which the MSE is within one standard error of the minimal MSE. We can check the active covariates in our model and see their coefficients based on the lambda we have chosen.

require(survival)
require(glmnet)
## Loading required package: glmnet
## Loading required package: Matrix
## Loading required package: foreach
## Loaded glmnet 2.0-10
y=Surv(survival_data[,1],survival_data[,2])
glm1 = cv.glmnet(x,y,family="cox",alpha=0.8)
b = glmnet(x,y,family="cox",alpha=0.8,lambda=c(glm1$lambda.min,glm1$lambda.1se))$beta
b[1:20,]
## 20 x 2 sparse Matrix of class "dgCMatrix"
##                    s0          s1
## AFIB          .        .         
## HF            .        .         
## MR            .        .         
## CMTHY         .        .         
## CHD           .        .         
## HTN           .        .         
## DM            .       -0.52194109
## NEO           .        .         
## COPD          .        .         
## RENAL         0.42446  0.54860645
## AFIB to HF    .        .         
## HF to AFIB    .        .         
## AFIB to MR    .        .         
## MR to AFIB    .        .         
## AFIB to CMTHY .        .         
## CMTHY to AFIB .        .         
## AFIB to CHD   .       -0.03440171
## CHD to AFIB   .        .         
## AFIB to HTN   .        .         
## HTN to AFIB   .        .

Through stepwise model selection based on the AIC criterion, we can see that the network variables dominate individual comorbidity variables. We conclude that the network information is significant for improving the model compared to just comorbidity information.

isel = b[,2]!=0
bcox = coxph(y~.,data=data.frame(x[,isel]))
step1 = step(bcox)
step1$anova