Example Streaming RL on Simulated Data

library(bstrl)

Data

We will be using the included data, geco_small, a small simulated dataset with 7 files, 14 fields per file, known true identities, and 3 errors per duplicated record. This variable contains a list of data frames, each of which represents a file with 10 records each.

length(geco_small)
#> [1] 7
head(geco_small[[1]])
#>           rec.id entity given.name    surname age occup extra1 extra2 extra3
#> 3   rec-0252-org    252   harrison  widdowson   a     4      6      7      5
#> 7   rec-0534-org    534     taylah     deakin   f     2     12      4      8
#> 10  rec-0605-org    605        jai      green   k     2      2     10      4
#> 73  rec-4150-org   4150    spencer      blake   g     5      4     12     11
#> 107 rec-5641-org   5641    makenzi   chandler   j     2      6      4      4
#> 112 rec-5932-org   5932      blake gilbertson   g     1     12     11      9
#>     extra4 extra5 extra6 extra7 extra8 extra9 extra10
#> 3        1      8      3      3      1      1       8
#> 7        7      3      5      9      2      2       6
#> 10       5     10      2      6      1      5       4
#> 73      10      5      6      7      5      1       5
#> 107      8     11      7     11      7     10       6
#> 112     12      4      4      7      8      7       3

A larger version of this dataset is also included as geco_30over_3err.

To perform linkage, we first have to define the fields we will be using and their data types. For this example we will only use a subset of the available fields.

# Names of the columns on which to perform linkage
fieldnames <- c("given.name", "surname", "age", "occup", "extra1", "extra2", "extra3", "extra4", "extra5", "extra6")

# How to compare each of the fields
types <- c("lv", "lv", # First name and last name use normalized edit distance
           "bi", "bi", "bi", "bi", "bi", "bi", "bi", "bi") # All others binary equal/unequal
breaks <- c(0, 0.25, 0.5) # Break continuous difference measures into 4 levels using these split points

We will link using the fields given.name, surname, age, occup, and extra1 through extra6. Given name and surname are compared using normalized Levenshtein edit distance, with breaks at 0, 0.25, and 0.5. In other words, exact equality is given its own level, then there are three levels for varying inequality. All other fields are compared in a binary fashion.

Streaming Record Linkage

Base two file linkage

To perform streaming record linkage, we need an initial two-file linkage to use as a base for streaming updates. This is accomplished using the bipartiteRL function which outsources the task to the BRL package.

res.twofile <- bipartiteRL(geco_small[[1]], geco_small[[2]],
                           flds = fieldnames, types = types, breaks = breaks,
                           nIter = 600, burn = 100,
                           seed = 0)

It is important to pass the comparison details we created earlier to this function as shown. These will be stored with the result and used in future streaming updates. Further information on function parameters can be found in the documentation by running help(bipartiteRL).

PPRB updates

To perform PPRB updates, pass an existing result object and a new data frame to the function PPRBupdate, along with parameters that define how the sampling should proceed.

res.pprb3 <- PPRBupdate(res.twofile, geco_small[[3]], # Comparison details are stored with previous result
                       nIter = 600, burn = 100,
                       seed = 0,
                       refresh = 0.05)
#> 30/600 [5%] (burn)
#> 60/600 [10%] (burn)
#> 90/600 [15%] (burn)
#> 120/600 [20%]
#> 150/600 [25%]
#> 180/600 [30%]
#> 210/600 [35%]
#> 240/600 [40%]
#> 270/600 [45%]
#> 300/600 [50%]
#> 330/600 [55%]
#> 360/600 [60%]
#> 390/600 [65%]
#> 420/600 [70%]
#> 450/600 [75%]
#> 480/600 [80%]
#> 510/600 [85%]
#> 540/600 [90%]
#> 570/600 [95%]
#> 600/600 [100%]
res.pprb4 <- PPRBupdate(res.pprb3, geco_small[[4]], # Comparison details are stored with previous result
                       nIter = 600, burn = 100,
                       seed = 0,
                       refresh = 0.05)
#> 30/600 [5%] (burn)
#> 60/600 [10%] (burn)
#> 90/600 [15%] (burn)
#> 120/600 [20%]
#> 150/600 [25%]
#> 180/600 [30%]
#> 210/600 [35%]
#> 240/600 [40%]
#> 270/600 [45%]
#> 300/600 [50%]
#> 330/600 [55%]
#> 360/600 [60%]
#> 390/600 [65%]
#> 420/600 [70%]
#> 450/600 [75%]
#> 480/600 [80%]
#> 510/600 [85%]
#> 540/600 [90%]
#> 570/600 [95%]
#> 600/600 [100%]

Further information on function parameters can be found in the documentation by running help(PRPBupdate).

SMCMC updates

To perform SMCMC updates, pass an existing result object and a new data frame to the function SMCMCupdate, along with parameters that define how the sampling should proceed.

First, SMCMC can work with a smaller number of iterations than PPRB. To filter an existing result object by thinning every \(n^{th}\) posterior sample, run the thinsamples function. SMCMC produces independent samples from the posterior, so filter to the number of independent samples that would be desired from the posterior distribution for estimating your parameters or quantities of interest.

filtered <- thinsamples(res.twofile, 50) # Don't need 500, 50 for demonstration

The filtered sample pool can be passed to a streaming update in the same way as any result object.

res.smcmc3 <- SMCMCupdate(filtered, geco_small[[3]],
                         nIter.jumping=2, nIter.transition = 8,
                         proposals.jumping="component", proposals.transition="component", #Either can be LB, but increase corresponding number of iterations
                         cores = 2) # Parallel execution
#> Random seed not set in parallel execution.
res.smcmc4 <- SMCMCupdate(res.smcmc3, geco_small[[4]],
                         nIter.jumping=2, nIter.transition = 8,
                         proposals.jumping="component", proposals.transition="component", #Either can be LB, but increase corresponding number of iterations
                         cores = 2) # Parallel execution
#> Random seed not set in parallel execution.

SMCMC uses two MCMC kernels, a jumping kernel and a transition kernel, which are performed on each member of the ensemble in parallel. The jumping kernel is used to initialize the links to the latest file, and the transition kernel is used to simultaneously update all parameters.

This results in two differences to the function’s parameters. Instead of a single nIter parameter, SMCMCupdate has two: nIter.jumping and nIter.transition. The function also has two parameters, proposals.jumping and proposals.transition, which define the update method for the link parameters. nIter.jumping can be relatively smaller than nIter.transition, since the transition kernel will continue to update links to the latest file. Both values must be larger if either proposal is set to “LB” - locally balanced proposals have slower mixing than component-wise.

Analyzing results

Results objects are made up of a list containing posterior samples of each parameter.

names(res.pprb4)
#>  [1] "Z"           "m"           "u"           "files"       "comparisons"
#>  [6] "priors"      "cmpdetails"  "m.fc.pars"   "u.fc.pars"   "diagnostics"

The values of Z, m, and u are link parameters. They are stored as matrices where each column is a posterior sample and each row is a component of the vector-valued parameters. The value of diagnostics is used internally by some functions, and all other values store details necessary to perform streaming updates when further files arrive.

# All have 500 columns and different numbers of rows.
dim(res.pprb4$Z)
#> [1]  30 500
dim(res.pprb4$m)
#> [1]  24 500
dim(res.pprb4$u)
#> [1]  24 500

The Z parameter contains posterior samples of links between records, so is the main parameter of interest in record linkage applications.

# The first post-burn MCMC sample of Z
Zexample <- res.pprb4$Z[,1]
Zexample
#>  [1]  3 12 13 14 15 16  7  9 10 20  1 22 13 24 14 26  6  8 29 20 31 32 33 34  5
#> [26] 36 18 38 39 30

The parameter \(Z\) contains one value per record starting in file 2. The value at an index gives the record to which the corresponding record is linked. For example,

Zexample[8]
#> [1] 9

indicates that the \(8^{th}\) record in file 2 is linked to the \(9^{th}\) record in file 1. These links can also lead to further links, such as

Zexample[27]
#> [1] 18
Zexample[8]
#> [1] 9

collectively defining a cluster of the \(7^{th}\) record in file 4, the \(8^{th}\) record in file 2 and the \(9^{th}\) record in file 1.

This is unwieldy for an increasing number of files or for larger files, so functions to process these links are provided.

# Create a list of length 500, where each element is one streaming link object
# for each posterior sample.
samples <- extractlinks(res.pprb4)

# Are record 9 in file 1 and record 7 in file 4 linked in the first posterior sample?
islinked(samples[[1]], file1=1, record1=9, file2=4, record2=7)
#> [1] TRUE

# In what proportion of posterior samples are record 9 in file 1 and record 7 in file 4 linked?
mean(sapply(samples, islinked, file1=1, record1=9, file2=4, record2=7))
#> [1] 1

# In what proportion of posterior samples are record 8 in file 1 and record 1 in file 2 linked?
mean(sapply(samples, islinked, file1=1, record1=8, file2=2, record2=1))
#> [1] 0

Method notes and caveats

Locally balanced proposals

Locally balanced proposals are an alternate way for link vectors to be sampled by an MCMC. While component-wise proposals draw values from the full conditional distribution of each component of a link vector - essentially updating the link of each record in each file sequentially - locally balanced proposals perform an add, delete, or swap operation based on the target posterior probability. Locally balanced proposals sample by more intelligently moving through the space of possible links between records at the cost of slower mixing (because fewer links can be updated in any one iteration). Locally balanced proposals have another advantage in that they can be blocked. Blocking allows for a locally balanced proposal to only consider a small subset of records for its proposal, reducing the time required at the cost of even slower mixing.

In streaming update functions such as SMCMCupdate, the proposals.jumping and proposals.transition parameters can instruct the sampler to use locally balanced proposals and the blocksize parameter can be used to enable blocking. Similar options are also available in the multifileRL function.

Mixing Streaming Updates

Streaming updates can be mixed and matched. For example, file 3 can be incorporated using a PPRB update and file 4 can be incorporated using an SMCMC update.

MCMC Iterations

The number of MCMC iterations does not need to be the same in subsequent streaming updates. This is especially useful when alternating between SMCMC updates (which can function well with small ensembles) and PPRB updates (which can quickly produce large numbers of samples).

PPRB Sample Degredation

As a filtering method, PPRB reduces the number of distinct values within its sample pool after repeated use. Eventually, this can lead to inaccurate estimates of quantities of interest as the pool has degraded too much. To couteract this, SMCMC updates can be occasionally used to refresh the diversity of available samples.

Non-streaming Multifile Record Linkage

This package also provides the option to perform non-streaming, multi-file record linkage using a Gibbs sampler. This can be done using the multifileRL function. This is significantly slower than streaming record linkage and is only provided for educational purposes.

# Link three files from scratch, returning 500 posterior samples
res.threefile <- multifileRL(geco_small[1:3],
                             flds = fieldnames, types = types, breaks = breaks,
                             nIter = 600, burn=100, # Number of iterations to run
                             proposals = "comp", # Change to "LB" for faster iterations, slower convergence
                             seed = 0,
                             refresh = 0.05) # Print progress every 5% of run.
#> 30/600 [5%] (burn)
#> 60/600 [10%] (burn)
#> 90/600 [15%] (burn)
#> 120/600 [20%]
#> 150/600 [25%]
#> 180/600 [30%]
#> 210/600 [35%]
#> 240/600 [40%]
#> 270/600 [45%]
#> 300/600 [50%]
#> 330/600 [55%]
#> 360/600 [60%]
#> 390/600 [65%]
#> 420/600 [70%]
#> 450/600 [75%]
#> 480/600 [80%]
#> 510/600 [85%]
#> 540/600 [90%]
#> 570/600 [95%]
#> 600/600 [100%]