In this tutorial we go through some of the functionality of the `NetworkInference`

package using the example application from Desmarais et al. (2015) and extending it with the recently released State Policy Innovation and Diffusion (SPID) database. In this paper Desmarais et al. infer a latent network for policy diffusion based on adoption of policies in the US states.

`netinf`

infers the optimal diffusion network from a set of *nodes* (in this case US-states) and a number of so called *cascades* (here a cascade corresponds to a policy that is adopted by at least two states). A cascade is a series of events occurring at a specified time.

Installing `NetworkInference`

:

Some other packages required for this tutorial:

The policy adoption data is available in the package:

This loads two `data.frame`

-objects. `policies`

contains the adoption events and `policies_metadata`

contains additional information on each policy.

statenam | policy | adopt_year |
---|---|---|

Alaska | aboldeapen | 1957 |

Delaware | aboldeapen | 1958 |

Hawaii | aboldeapen | 1957 |

Iowa | aboldeapen | 1965 |

Maine | aboldeapen | 1887 |

Michigan | aboldeapen | 1846 |

The first column (`statenam`

) gives the state that adopted the `policy`

in year `adopt_year`

. Let’s take a closer look at the data.

Number of policies:

Number of adoption events:

Some example policies:

```
unique(policies$policy)[1:5]
#> [1] "aboldeapen" "aborparc"
#> [3] "aborparn" "aborpreroe"
#> [5] "abortion_consent_1973_199"
```

Not all the policy abbreviations are understandable. So let’s check the metadata for more information:

policy | first_year | last_year | adopt_count |
---|---|---|---|

civil defense and disaster compact | 1951 | 1978 | 19 |

civinjaut | 1998 | 2001 | 15 |

class actions | 1977 | 1980 | 2 |

clinic_access | 1973 | 2005 | 16 |

cogrowman | 1961 | 1998 | 10 |

description | majortopic |
---|---|

Bilateral/Mutual Aid For Disasters | Domestic Commerce |

Civil Injunction Authority | Law and Crime |

Provides Complete Procedures To Govern The Conduct Of A Class Action Law Suit. | Law and Crime |

No-Protest Zone Around Abortion Clinic | Civil Rights |

Requiring Local Government To Coordinate Growth Management | Housing |

Most functionality of the `NetworkInference`

package is based on the `cascades`

data format. So before starting with the analysis we have to transform our data to such an object.

```
policy_cascades <- as_cascade_long(policies, cascade_node_name = 'statenam',
event_time = 'adopt_year',
cascade_id = 'policy')
```

In this case we used the function `as_cascade_long`

. If your data is in wide format you can convert it using the function `as_cascade_wide`

.

The `cascade`

class contains the same data as the `policies`

`data.frame`

, just in a different format. You don’t need to understand how the object is constructed but let’s take a look for clarity:

```
class(policy_cascades)
#> [1] "cascade" "list"
length(policy_cascades)
#> [1] 3
names(policy_cascades)
#> [1] "cascade_nodes" "cascade_times" "node_names"
```

The `cascade`

class is basically a list containing three elements:

```
policy_cascades$cascade_nodes[2:3]
#> $aborparc
#> [1] "Louisiana" "Massachusetts" "Rhode Island" "Missouri"
#> [5] "Indiana" "Alabama" "Maine" "Wyoming"
#> [9] "South Carolina" "Michigan" "Wisconsin" "Kentucky"
#> [13] "Pennsylvania" "North Carolina" "Tennessee"
#>
#> $aborparn
#> [1] "Minnesota" "Utah" "Arizona" "West Virginia"
#> [5] "Arkansas" "Connecticut" "Ohio" "Georgia"
#> [9] "Nebraska" "Kansas" "Maryland" "Tennessee"
#> [13] "Delaware" "Idaho" "South Dakota" "Virginia"
#> [17] "Texas"
```

```
policy_cascades$cascade_times[2:3]
#> $aborparc
#> [1] 1981 1981 1982 1983 1984 1987 1989 1989 1990 1991 1992 1994 1994 1996
#> [15] 1999
#>
#> $aborparn
#> [1] 1981 1981 1982 1984 1989 1990 1990 1991 1991 1992 1992 1992 1996 1996
#> [15] 1998 1998 2000
```

There are a few convenience functions to manipulate the cascades (but you can also manipulate the data before converting it to the `cascade`

format).

```
selected_policies <- subset_cascade(cascade = policy_cascades,
selection = c('clinic_access', 'cogrowman'))
selected_policies[1:2]
#> $cascade_nodes
#> $cascade_nodes$clinic_access
#> [1] "Oregon" "Wisconsin" "Kansas" "Maryland"
#> [5] "California" "Nevada" "Colorado" "Connecticut"
#> [9] "Massachusetts" "Minnesota" "North Carolina" "Washington"
#> [13] "Maine" "Michigan" "New York" "Montana"
#>
#> $cascade_nodes$cogrowman
#> [1] "Hawaii" "Vermont" "Oregon" "Florida"
#> [5] "New Jersey" "Rhode Island" "Washington" "Maryland"
#> [9] "Arizona" "Tennessee"
#>
#>
#> $cascade_times
#> $cascade_times$clinic_access
#> [1] 1973 1985 1988 1989 1990 1991 1993 1993 1993 1993 1993 1993 1995 1995
#> [15] 1999 2005
#>
#> $cascade_times$cogrowman
#> [1] 1961 1970 1973 1985 1986 1988 1990 1992 1998 1998
```

```
time_constrained <- subset_cascade_time(cascade = selected_policies,
start_time = 1990, end_time = 2000)
time_constrained[1:2]
#> $cascade_nodes
#> $cascade_nodes$clinic_access
#> [1] "California" "Nevada" "Colorado" "Connecticut"
#> [5] "Massachusetts" "Minnesota" "North Carolina" "Washington"
#> [9] "Maine" "Michigan" "New York"
#>
#> $cascade_nodes$cogrowman
#> [1] "Washington" "Maryland" "Arizona" "Tennessee"
#>
#>
#> $cascade_times
#> $cascade_times$clinic_access
#> [1] 1990 1991 1993 1993 1993 1993 1993 1993 1995 1995 1999
#>
#> $cascade_times$cogrowman
#> [1] 1990 1992 1998 1998
```

```
less_nodes <- drop_nodes(cascades = time_constrained,
nodes = c('Maryland', 'Washington'))
less_nodes[1:2]
#> $cascade_nodes
#> $cascade_nodes$clinic_access
#> [1] "California" "Nevada" "Colorado" "Connecticut"
#> [5] "Massachusetts" "Minnesota" "North Carolina" "Maine"
#> [9] "Michigan" "New York"
#>
#> $cascade_nodes$cogrowman
#> [1] "Arizona" "Tennessee"
#>
#>
#> $cascade_times
#> $cascade_times$clinic_access
#> [1] 1990 1991 1993 1993 1993 1993 1993 1995 1995 1999
#>
#> $cascade_times$cogrowman
#> [1] 1998 1998
```

It’s always good practice to visually inspect the data before working with it. The `NetworkInference`

package provides functionality to visualize the cascade data.

The function `summary.cascades()`

provides quick summary statistics on the cascade data:

```
summary(policy_cascades)
#> # cascades: 728
#> # nodes: 50
#> # nodes in cascades: 50
#> # possible edges: 2450
#>
#> Summary statistics for cascade length and number of ties:
#> length ties
#> Min. 1.00000 0.00000
#> 1st Qu. 11.00000 4.00000
#> Median 23.00000 11.00000
#> Mean 24.49863 14.53571
#> 3rd Qu. 40.00000 23.00000
#> Max. 50.00000 48.00000
```

The first four lines provide the number of cascades, the number of nodes in the system, the number of nodes involved in cascades (there might be states that we don’t have diffusion data on, but we still want them represented in the dataset) and the possible number of edges in a potential diffusion network (a diffusion edge between nodes `u`

and `v`

only makes sense if there is at least one cascade in which `u`

experiences an event before `v`

). In this example there are 187 policies and 50 states. Each state is involved in at least one policy cascade and a fully connected diffusion network would have 2450 edges.

It also provides summary statistics on the distribution of the cascade lengths (number of nodes involved in each cascade) and the number of ties in the cascades (two nodes experiencing the same event at the same time). For our example, we can see that the ‘smallest’ policy was adopted by 10 states and the ‘largest’ by all 50 states. From the tie summaries we see that there is at least one policy that was adopted by 45 states in the same year.

The `plot()`

method allows to plot cascades with varying degrees of detail. The argument `label_nodes`

(`TRUE/FALSE`

) provides node labels which require more space but provide more detail. The argument `selection`

allows to pick a subset of cascades to visualize in case there are too many to plot. If `label_nodes`

is set to `FALSE`

each event is depicted by a dot, which allows to visualize more cascades simultaneously.

Let’s first look at the visualization with labels. Here we plot two cascades, selected by their name:

```
selection <- c('guncontrol_assaultweapon_ba', 'guncontrol_licenses_dealer')
plot(policy_cascades, label_nodes = TRUE, selection = selection)
```

We can also plot more cascades with less detail:

```
selection <- c('waiting', 'threestrikes', 'unionlimits', 'smokeban',
'paperterror', 'miglab', 'methpre', 'lott', 'lemon', 'idtheft',
'harass', 'hatecrime', 'equalpay')
plot(policy_cascades, label_nodes = FALSE, selection = selection)
```

This produces a ‘violin plot’ for each cascade with the single diffusion events overplotted as dots. As we already saw in the previous visualization, the policy data has a lot of ties (i.e. many states adopted a policy in the same year) which is indicated by the areas of higher density in the violin plot.

The `netinf`

algorithm is implemented in the `netinf()`

function. The `netinf`

inferrs edges based on a diffusion model. That is we assume a parametric model for the diffusion times between edges. Currently three different diffusion models are implemented: The exponential distribution, the rayleigh distribution and the log-normal distribution. The model can be chosen with the `trans_mod`

argument (default is the exponential distribution).

In the original implementation the number of edges to infer had to be fixed and chosen by the researcher. If you want to run `netinf`

in this classic way you can do so by specifiying all parameters and the number of edges:

```
results <- netinf(policy_cascades, trans_mod = "exponential", n_edges = 100,
params = 0.5, quiet = TRUE)
```

The exponential model has only one parameter (lambda or the rate). If there are more parameters the details section in the documentation of the `netinf`

function (`?netinf`

) has more detail on how to specify parameters and on the specific parametrization used by the implementation.

`n_edges`

specifies how many edges should be inferred. See @gomez2010inferring and @desmarais2015persistent for guidance on choosing this parameter if running netinf in classic mode. If the number of edges is specified manually, it has to be lower than the maximum number of possible edges. An edge `u->v`

is only possible if in at least one cascade `u`

experiences an event *before* `v`

. This means, that the maximum number of edges depends on the data. The function `count_possible_edges()`

allows us to compute the maximum number of edges in advance:

With version 1.2.0 `netinf`

can be run without manually specifying the number of edges or the parameters of the diffusion model.

After each iteration of the netinf algorithm, we check if the edge added significant improvement to the network. This is done via a vuong style test. Given the likelihood score for each cascade conditional on the network inferred so far, we penalize the network with one addional edge and test if the increase in likelihood accross all cascades is significant. The user still has to specify a p-value cut-off. If the p-value of an edge is larger than the specified cut-off the algorithm stops inferring more edges. The cut-off is set via the `p_value_cutoff`

argument.

```
results <- netinf(policy_cascades, trans_mod = "exponential",
p_value_cutoff = 0.1, params = 0.5, quiet = TRUE)
nrow(results)
#> [1] 872
```

We see that with a fixed lambda of 0.5 and a p-value cut-off of 0.1 the algorithm inferred 872 edges.

The diffusion model parameters can be selected automatically. Setting the `params`

argument to `NULL`

(default value) makes the `netinf`

function initialize the parameters automatically. The parameters are initialized at the midpoint between the MLE of the minimum diffusion times and the MLE of the maximum diffusion times, across all cascades. Edges are then inferred until either the p-value cut-off or a manually specified number of edges (`n_edges`

) is reached.

Let’s take a look at the output of the algorithm. The output is a dataframe containing the inferred latent network in the form of an edgelist:

origin_node | destination_node | improvement | p_value |
---|---|---|---|

California | Oregon | 3745 | 0 |

California | Colorado | 3363 | 0 |

California | Maryland | 3356 | 0 |

California | Washington | 3307 | 0 |

California | Illinois | 3286 | 0 |

California | North Carolina | 3264 | 0 |

Each row corresponds to a directed edge. The first column indicates the origin node, the second the destination node. The third column displays the gain in model fit from each added edge. The last column displays the p-value from the Vuong test of each edge. There is a generic plot method to inspect the results. If more tweaking is required, the results are a dataframe so it should be easy for the more experienced users to make your own plot. With `type = "improvement"`

the improvement from each edge can be plotted:

We can also quickly check the p-value from the Vuong test associated with each edge addition:

In order to produce a quick visualization of the resulting diffusion network we can use the plot method again, this time with `type = "network"`

. Note that in order to use this functionality the igraph package has to be installed.

```
#install.packages('igraph')
# For this functionality the igraph package has to be installed
# This code is only executed if the package is found:
if(requireNamespace("igraph", quietly = TRUE)) {
plot(results, type = "network")
}
```

If additional tweaking of the plot is desired, the network can be visualized using `igraph`

explicitly. We refer you you to the igraph documentation for details on how to customize the plot.