# MCMC scheme for posterior inference of Gaussian DAG models: the learn_DAG() function

library(BCDAG)

This is the second of a series of three vignettes for the R package BCDAG. In this vignette we focus on function learn_DAG(), which implements a Markov Chain Monte Carlo (MCMC) algorithm to sample from the joint posterior of DAG structures and DAG-parameters under the Gaussian assumption.

### Model description

The underlying Bayesian Gaussian DAG-model can be summarized as follows: $\begin{eqnarray} X_1, \dots, X_q \,|\,\boldsymbol L, \boldsymbol D, \mathcal{D} &\sim& \mathcal{N}_q\left(\boldsymbol 0, (\boldsymbol{L}\boldsymbol{D}^{-1}\boldsymbol{L}^\top)^{-1}\right)\\ (\boldsymbol L, \boldsymbol D)\,|\,\mathcal{D} &\sim& \text{DAG-Wishart}(\boldsymbol{a}_{c}^{\mathcal{D}}, \boldsymbol U) \\ p(\mathcal{D}) &\propto& w^{|\mathcal{S}_\mathcal{D}|}(1-w)^{\frac{q(q-1)}{2} - {|\mathcal{S}_\mathcal{D}|}} \end{eqnarray}$

In particular $$\mathcal{D}=(V,E)$$ denotes a DAG structure with set of nodes $$V=\{1,\dots,q\}$$ and set of edges $$E\subseteq V \times V$$. Moreover, $$(\boldsymbol L, \boldsymbol D)$$ are model parameters providing the decomposition of the precision (inverse-covariance) matrix $$\boldsymbol{\Omega} = \boldsymbol{L}\boldsymbol{D}^{-1}\boldsymbol{L}^\top$$; specifically, $$\boldsymbol{L}$$ is a $$(q, q)$$ matrix of coefficients such that for each $$(u, v)$$-element $$\boldsymbol{L}_{uv}$$ with $$u \ne v$$, $$\boldsymbol{L}_{uv} \ne 0$$ if and only if $$(u, v) \in E$$, while $$\boldsymbol{L}_{uu} = 1$$ for each $$u = 1,\dots, q$$; also, $$\boldsymbol{D}$$ is a $$(q, q)$$ diagonal matrix with $$(u, u)$$-element $$\boldsymbol{D}_{uu}$$.

The latter decomposition follows from the equivalent Structural Equation Model (SEM) representation of a Gaussian DAG-model; see also Castelletti & Mascaro (2021).

Conditionally to $$\mathcal{D}$$, a prior to $$(\boldsymbol{L}, \boldsymbol{D})$$ is assigned through a compatible DAG-Wishart distribution with rate hyperparameter $$\boldsymbol{U}$$, a $$(q,q)$$ s.p.d. matrix, and shape hyperparameter $$\boldsymbol{a}^{\mathcal {D}}_{c}$$, a $$(q,1)$$ vector; see also Cao et al. (2019) and Peluso & Consonni (2020).

Finally, a prior on DAG $$\mathcal{D}$$ is assigned through a Binomial distribution on the number of edges in the graph; in $$p(\mathcal{D})$$, $$w \in (0,1)$$ is a prior probability of edge inclusion, while $$|\mathcal{S_{\mathcal{D}}}|$$ denotes the number of edges in $$\mathcal{D}$$; see again Castelletti & Mascaro (2021) for further details.

Target of the MCMC scheme is therefore the joint posterior of $$(\boldsymbol{L},\boldsymbol{D},\mathcal{D})$$, $\begin{equation} p(\boldsymbol L, \boldsymbol D, \mathcal{D}\,|\, \boldsymbol X) \propto p(\boldsymbol{X}\,|\,\boldsymbol L, \boldsymbol D, \mathcal{D})p(\boldsymbol{L},\boldsymbol{D}\,|\,\mathcal{D}) \,p(\mathcal{D}), \end{equation}$ where $$\boldsymbol{X}$$ denotes a $$(n,q)$$ data matrix as obtained through $$n$$ i.i.d. draws from the Gaussian DAG-model and $$p(\boldsymbol{X}\,|\,\boldsymbol L, \boldsymbol D, \mathcal{D})$$ is the likelihood function. See also Castelletti & Mascaro (2022+) for full details.

### Generating data

We first randomly generate a DAG $$\mathcal{D}$$, the DAG parameters $$(\boldsymbol{L},\boldsymbol{D})$$ and then $$n=1000$$ i.i.d. observations from a Gaussian DAG-model as follows:

set.seed(1)
q <- 8
w <- 0.2
DAG <- rDAG(q,w)
a <- q
U <- diag(1,q)
outDL <- rDAGWishart(n=1, DAG, a, U)
L <- outDL$L; D <- outDL$D
Omega <- L %*% solve(D) %*% t(L)
Sigma <- solve(Omega)
n <- 1000
X <- mvtnorm::rmvnorm(n = n, sigma = Sigma)

## learn_DAG()

Function learn_DAG() implements an MCMC algorithm to sample from the joint posterior of DAGs and DAG parameters. This is based on a Partial Analytic Structure (PAS) algorithm (Godsill, 2012) which, at each iteration:

1. Updates the DAG through a Metropolis-Hastings (MH) step where, given the current DAG, a new (direct successor) DAG is drawn from a suitable proposal distribution and accepted with a probability given by the MH acceptance rate (see also section A note on fast = TRUE);
2. Samples from the posterior distribution of the (updated DAG) parameters; see also Castelletti & Consonni (2021) for more details.

We implement it as follows:

out <- learn_DAG(S = 5000, burn = 1000, data = X,
a, U, w,
fast = FALSE, save.memory = FALSE, collapse = FALSE)

### Input

Inputs of learn_DAG() correspond to three different sets of arguments:

• S, burn and data are standard inputs required by any MCMC algorithm. In particular, S defines the desired length of the chain, which is obtained by discarding the first burn observations (the total number of sampled observations is therefore S + burn); data is instead the $$(n,q)$$ matrix $$\boldsymbol{X}$$;
• a, U and w are hyperparameters of the priors on DAGs (w) and DAG parameters (a, U); see also Equation [REF]. The same appear in functions rDAG() and rDAGWishart() which were introduced in our vignette [ADD REF TO THE VIGNETTE].
• fast, save.memory and collapse are boolean arguments which allow to: implement an approximate proposal distribution within the MCMC scheme (fast = TRUE); change the array structure of the stored MCMC output into strings in order to save memory (save.memory = TRUE); implement an MCMC for DAG structure learning only, without sampling from the posterior of parameters (collapse = TRUE). See also A note on fast = TRUE and Castelletti & Mascaro (2022+) for full details.

### Output

The output of learn_DAG() is an object of class bcdag:

class(out)
#>  "bcdag"

bcdag objects include the output of the MCMC algorithm together with a collection of meta-data representing the input arguments of learn_DAG(); these are stored in the attributes of the object::

str(out)
#> List of 3
#>  $Graphs: num [1:8, 1:8, 1:5000] 0 0 0 0 1 0 0 1 0 0 ... #>$ L     : num [1:8, 1:8, 1:5000] 1 0 0 0 -0.0194 ...
#>  $D : num [1:8, 1:8, 1:5000] 0.164 0 0 0 0 ... #> - attr(*, "class")= chr "bcdag" #> - attr(*, "type")= chr "complete" #> - attr(*, "input")=List of 10 #> ..$ S          : num 5000
#>   ..$burn : num 1000 #> ..$ data       : num [1:1000, 1:8] -0.299 -0.351 -0.631 0.219 -0.351 ...
#>   ..$a : num 8 #> ..$ U          : num [1:8, 1:8] 1 0 0 0 0 0 0 0 0 1 ...
#>   ..$w : num 0.2 #> ..$ fast       : logi FALSE
#>   ..$save.memory: logi FALSE #> ..$ collapse   : logi FALSE
#>   ..$verbose : logi TRUE Attribute type refers to the output of learn_DAG(), whose structure depends on the choice of boolean arguments save.memory and collapse. When both are set equal to FALSE, as in the previous example, the output of learn_DAG() is a complete bcdag object, collecting three $$(q,q,S)$$ arrays with the DAG structures (in the form of $$q \times q$$ adjacency matrices) and the DAG parameters $$\boldsymbol{L}$$ and $$\boldsymbol{D}$$ (both as $$q \times q$$ matrices) sampled by the MCMC: out$Graphs[,,1]
#>      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
#> [1,]    0    0    0    0    0    0    0    0
#> [2,]    0    0    0    0    0    1    0    0
#> [3,]    0    0    0    0    0    0    0    1
#> [4,]    0    0    0    0    0    0    0    0
#> [5,]    1    0    0    0    0    0    0    0
#> [6,]    0    0    0    0    0    0    0    0
#> [7,]    0    1    0    1    0    0    0    0
#> [8,]    1    0    0    0    0    0    0    0
round(out$L[,,1],2) #> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] #> [1,] 1.00 0.00 0 0.00 0 0.00 0 0.00 #> [2,] 0.00 1.00 0 0.00 0 0.07 0 0.00 #> [3,] 0.00 0.00 1 0.00 0 0.00 0 0.94 #> [4,] 0.00 0.00 0 1.00 0 0.00 0 0.00 #> [5,] -0.02 0.00 0 0.00 1 0.00 0 0.00 #> [6,] 0.00 0.00 0 0.00 0 1.00 0 0.00 #> [7,] 0.00 -0.02 0 -1.97 0 0.00 1 0.00 #> [8,] 0.40 0.00 0 0.00 0 0.00 0 1.00 round(out$D[,,1],2)
#>      [,1] [,2] [,3] [,4]  [,5] [,6] [,7] [,8]
#> [1,] 0.16 0.00 0.00 0.00  0.00  0.0 0.00 0.00
#> [2,] 0.00 0.57 0.00 0.00  0.00  0.0 0.00 0.00
#> [3,] 0.00 0.00 0.74 0.00  0.00  0.0 0.00 0.00
#> [4,] 0.00 0.00 0.00 0.94  0.00  0.0 0.00 0.00
#> [5,] 0.00 0.00 0.00 0.00 77.94  0.0 0.00 0.00
#> [6,] 0.00 0.00 0.00 0.00  0.00  0.8 0.00 0.00
#> [7,] 0.00 0.00 0.00 0.00  0.00  0.0 0.31 0.00
#> [8,] 0.00 0.00 0.00 0.00  0.00  0.0 0.00 0.96

When collapse = TRUE and save.memory = FALSE the output of learn_DAG() is a collapsed bcdag object, consisting of a $$(q,q,S)$$ array with the adjacency matrices of the DAGs sampled by the MCMC:

collapsed_out <- learn_DAG(S = 5000, burn = 1000, data = X,
a, U, w,
fast = FALSE, save.memory = FALSE, collapse = TRUE)
names(collapsed_out)
#>  "Graphs"
class(collapsed_out)
#>  "bcdag"
attributes(collapsed_out)$type #>  "collapsed" collapsed_out$Graphs[,,1]
#>      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
#> [1,]    0    0    0    0    0    0    0    0
#> [2,]    0    0    0    0    0    0    0    0
#> [3,]    0    0    0    0    0    0    0    0
#> [4,]    0    0    0    0    0    0    1    0
#> [5,]    1    0    0    0    0    0    0    0
#> [6,]    0    0    0    0    0    0    0    0
#> [7,]    0    0    0    0    0    0    0    1
#> [8,]    1    0    1    0    0    0    0    0

When save.memory = TRUE and collapse = FALSE, the output is a compressed bcdag object, collecting samples from the joint posterior on DAGs and DAG parameters in the form of a vector of strings:

compressed_out <- learn_DAG(S = 5000, burn = 1000, data = X,
a, U, w,
fast = FALSE, save.memory = TRUE, collapse = FALSE)
names(compressed_out)
#>  "Graphs" "L"      "D"
class(compressed_out)
#>  "bcdag"
attributes(compressed_out)$type #>  "compressed" In such a case, we can access to the MCMC draws as: compressed_out$Graphs
#>  "0;0;0;0;1;0;0;1;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;1;0;0;0;0;0;0;1;1;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0"
compressed_out$L #>  "1;0;0;0;-0.0207148822282879;0;0;0.389080449183578;0;1;0;0;0;0;0;0;0;0;1;0;0;0;0;0.464492744238912;0;0;0;1;0;0;-1.76444008583527;-0.020197409100651;0;0;0;0;1;0;0;0;0;0;0;0;0;1;0;0;0;0;0;0;0;0;1;0;0;0;0;0;0;0;0;1" compressed_out$D
#>  "0.154822760388322;0;0;0;0;0;0;0;0;0.544421200660809;0;0;0;0;0;0;0;0;0.395261505201729;0;0;0;0;0;0;0;0;1.04755025995276;0;0;0;0;0;0;0;0;67.9926231564594;0;0;0;0;0;0;0;0;0.831649133274509;0;0;0;0;0;0;0;0;0.303043124888535;0;0;0;0;0;0;0;0;1.76018923453861"

In addition, we implement bd_decode, an internal function that can be used to visualize the previous objects as matrices:

bd_decode(compressed_out$Graphs) #> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] #> [1,] 0 0 0 0 0 0 0 0 #> [2,] 0 0 0 0 0 0 0 0 #> [3,] 0 0 0 0 0 0 0 0 #> [4,] 0 0 0 0 0 0 0 0 #> [5,] 1 0 0 0 0 0 0 0 #> [6,] 0 0 0 0 0 0 0 0 #> [7,] 0 0 0 1 0 0 0 0 #> [8,] 1 0 1 1 0 0 0 0 round(bd_decode(compressed_out$L),2)
#>       [,1] [,2] [,3]  [,4] [,5] [,6] [,7] [,8]
#> [1,]  1.00    0 0.00  0.00    0    0    0    0
#> [2,]  0.00    1 0.00  0.00    0    0    0    0
#> [3,]  0.00    0 1.00  0.00    0    0    0    0
#> [4,]  0.00    0 0.00  1.00    0    0    0    0
#> [5,] -0.02    0 0.00  0.00    1    0    0    0
#> [6,]  0.00    0 0.00  0.00    0    1    0    0
#> [7,]  0.00    0 0.00 -1.76    0    0    1    0
#> [8,]  0.39    0 0.46 -0.02    0    0    0    1
round(bd_decode(compressed_out$D),2) #> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] #> [1,] 0.15 0.00 0.0 0.00 0.00 0.00 0.0 0.00 #> [2,] 0.00 0.54 0.0 0.00 0.00 0.00 0.0 0.00 #> [3,] 0.00 0.00 0.4 0.00 0.00 0.00 0.0 0.00 #> [4,] 0.00 0.00 0.0 1.05 0.00 0.00 0.0 0.00 #> [5,] 0.00 0.00 0.0 0.00 67.99 0.00 0.0 0.00 #> [6,] 0.00 0.00 0.0 0.00 0.00 0.83 0.0 0.00 #> [7,] 0.00 0.00 0.0 0.00 0.00 0.00 0.3 0.00 #> [8,] 0.00 0.00 0.0 0.00 0.00 0.00 0.0 1.76 Finally, if save.memory = TRUE and collapse = TRUE, the output of learn_DAG() is a compressed and collapsed bcdag object collecting only the sampled DAGs represented as vector of strings: comprcoll_out <- learn_DAG(S = 5000, burn = 1000, data = X, a, U, w, fast = FALSE, save.memory = TRUE, collapse = TRUE) names(comprcoll_out) #>  "Graphs" class(comprcoll_out) #>  "bcdag" attributes(comprcoll_out)$type
#>  "compressed and collapsed"
bd_decode(comprcoll_out\$Graphs)
#>      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
#> [1,]    0    0    0    0    0    0    0    0
#> [2,]    0    0    0    0    0    1    0    0
#> [3,]    0    0    0    0    0    0    0    0
#> [4,]    1    0    0    0    0    0    0    0
#> [5,]    1    0    0    0    0    0    0    0
#> [6,]    0    0    0    0    0    0    0    0
#> [7,]    0    0    0    1    0    0    0    0
#> [8,]    1    0    1    0    0    0    0    0

## A note on fast = TRUE

Step 1. of the MCMC scheme implemented by learn_DAG() updates DAG $$\mathcal{D}$$ by randomly drawing a new candidate DAG $$\mathcal{D}'$$ from a proposal distribution and then accepting it with probability given by the Metropolis Hastings (MH) acceptance rate; see also Castelletti & Mascaro (2021). For a given DAG $$\mathcal{D}$$, the proposal distribution $$q(\mathcal{D}'\,|\,\mathcal{D})$$ is built over the set $$\mathcal{O}_{\mathcal{D}}$$ of direct successors DAGs that can be reached from $$\mathcal{D}$$ by inserting, deleting or reversing a single edge in $$\mathcal{D}$$. A DAG $$\mathcal{D}'$$ is then proposed uniformly from the set $$\mathcal{O}_{\mathcal{D}}$$ so that $$q(\mathcal{D}'\,|\,\mathcal{D})=1/|\mathcal{O}_{\mathcal{D}}|$$. Moreover, the MH rate requires to evaluate the ratio of proposals $$q(\mathcal{D}'\,|\,\mathcal{D})/q(\mathcal{D}\,|\,\mathcal{D}') = |\mathcal{O}_{\mathcal{D}'}|/|\mathcal{O}_{\mathcal{D}}|$$, and accordingly the construction of both $$\mathcal{O}_{\mathcal{D}}$$ and $$\mathcal{O}_{\mathcal{D}'}$$.

If fast = FALSE, the proposal ratio is computed exactly; this requires the enumerations of $$\mathcal{O}_\mathcal{D}$$ and $$\mathcal{O}_{\mathcal{D}'}$$ which may become computationally expensive, especially when $$q$$ is large. However, the ratio approaches $$1$$ as the number of variables $$q$$ increases: option fast = TRUE implements such an approximation, which therefore avoids the construction of $$\mathcal{O}_\mathcal{D}$$ and $$\mathcal{O}_{\mathcal{D}'}$$. A comparison between fast = FALSE and fast = TRUE in the execution of learn_DAG() produces the following results in terms of computational time:

# No approximation
time_nofast <- system.time(out_nofast <- learn_DAG(S = 5000, burn = 1000, data = X,
a, U, w,
fast = FALSE, save.memory = FALSE, collapse = FALSE))
# Approximation
time_fast <- system.time(out_fast <- learn_DAG(S = 5000, burn = 1000, data = X,
a, U, w,
fast = TRUE, save.memory = FALSE, collapse = FALSE))
time_nofast
#>    utente   sistema trascorso
#>     16.44      0.00     16.44
time_fast
#>    utente   sistema trascorso
#>      7.34      0.01      7.36

Finally, the corresponding estimated posterior probabilities of edge inclusion are the following:

round(get_edgeprobs(out_nofast), 2)
#>      1    2    3    4 5    6    7    8
#> 1 0.00 0.06 0.07 0.00 0 0.00 0.00 0.00
#> 2 0.05 0.00 0.02 0.03 0 0.21 0.04 0.00
#> 3 0.09 0.02 0.00 0.00 0 0.01 0.02 0.52
#> 4 0.00 0.06 0.00 0.00 0 0.01 0.52 0.00
#> 5 1.00 0.00 0.00 0.00 0 0.00 0.01 0.00
#> 6 0.03 0.26 0.00 0.00 0 0.00 0.02 0.00
#> 7 0.03 0.01 0.00 0.48 0 0.03 0.00 0.02
#> 8 1.00 0.00 0.48 0.01 0 0.00 0.03 0.00
round(get_edgeprobs(out_fast), 2)
#>      1    2    3    4    5    6    7    8
#> 1 0.00 0.04 0.01 0.02 0.00 0.04 0.01 0.00
#> 2 0.05 0.00 0.00 0.03 0.00 0.20 0.01 0.00
#> 3 0.05 0.00 0.00 0.00 0.01 0.01 0.06 0.55
#> 4 0.02 0.03 0.00 0.00 0.00 0.02 0.55 0.01
#> 5 1.00 0.00 0.01 0.00 0.00 0.00 0.01 0.00
#> 6 0.06 0.24 0.01 0.00 0.01 0.00 0.03 0.01
#> 7 0.00 0.02 0.03 0.45 0.00 0.01 0.00 0.00
#> 8 1.00 0.00 0.45 0.02 0.00 0.04 0.00 0.00