Filtering cohorts

library(CohortConstructor)
library(CohortCharacteristics)
library(ggplot2)

For this example we’ll use the Eunomia synthetic data from the CDMConnector package.

con <- DBI::dbConnect(duckdb::duckdb(), dbdir = eunomiaDir())
cdm <- CDMConnector::cdmFromCon(con, cdmSchema = "main", 
                    writeSchema = "main", writePrefix = "my_study_")

Let’s start by creating two drug cohorts, one for users of diclofenac and another for users of acetaminophen.

cdm$medications <- conceptCohort(cdm = cdm, 
                                 conceptSet = list("diclofenac" = 1124300,
                                                   "acetaminophen" = 1127433), 
                                 name = "medications")
cohortCount(cdm$medications)
#> # A tibble: 2 × 3
#>   cohort_definition_id number_records number_subjects
#>                  <int>          <int>           <int>
#> 1                    1           9365            2580
#> 2                    2            830             830

We can take a sample from a cohort table using the function sampleCohort(). This allows us to specify the number of individuals in each cohort.

cdm$medications |> sampleCohorts(cohortId = NULL, n = 100)
#> # Source:   table<my_study_medications> [?? x 4]
#> # Database: DuckDB v1.0.0 [eburn@Windows 10 x64:R 4.2.1/C:\Users\eburn\AppData\Local\Temp\RtmpUtkMzn\file75f020f15898.duckdb]
#>    cohort_definition_id subject_id cohort_start_date cohort_end_date
#>                   <int>      <int> <date>            <date>         
#>  1                    1       3227 1984-05-24        1984-06-07     
#>  2                    1        703 1992-06-19        1992-07-03     
#>  3                    1       1666 1991-09-03        1991-09-24     
#>  4                    1        246 2018-03-15        2018-04-05     
#>  5                    1        316 2007-10-25        2007-11-24     
#>  6                    2       4999 1991-06-28        1991-06-28     
#>  7                    2       1966 2000-04-05        2000-04-05     
#>  8                    2       5333 1986-08-13        1986-08-13     
#>  9                    1       5222 2015-07-21        2015-08-04     
#> 10                    2       3943 1990-07-24        1990-07-24     
#> # ℹ more rows

cohortCount(cdm$medications)
#> # A tibble: 2 × 3
#>   cohort_definition_id number_records number_subjects
#>                  <int>          <int>           <int>
#> 1                    1            354             100
#> 2                    2            100             100

When cohortId = NULL all cohorts in the table are used. Note that this function does not reduced the number of records in each cohort, only the number of individuals.

It is also possible to only sample one cohort within cohort table, however the remaining cohorts will still remain.

cdm$medications <- cdm$medications |> sampleCohorts(cohortId = 2, n = 100)

cohortCount(cdm$medications)
#> # A tibble: 2 × 3
#>   cohort_definition_id number_records number_subjects
#>                  <int>          <int>           <int>
#> 1                    1           9365            2580
#> 2                    2            100             100

The chosen cohort (users of diclofenac) has been reduced to 100 individuals, as specified in the function, however all individuals from cohort 1 (users of acetaminophen) and their records remain.

If you want to filter the cohort table to only include individuals and records from a specified cohort, you can use the function subsetCohorts.

cdm$medications <- cdm$medications |> subsetCohorts(cohortId = 2)
cohortCount(cdm$medications)
#> # A tibble: 1 × 3
#>   cohort_definition_id number_records number_subjects
#>                  <int>          <int>           <int>
#> 1                    2            830             830

The cohort table has been filtered so it now only includes individuals and records from cohort 2. If you want to take a sample of the filtered cohort table then you can use the sampleCohorts function.

cdm$medications <- cdm$medications |> sampleCohorts(cohortId = 2, n = 100)

cohortCount(cdm$medications)
#> # A tibble: 1 × 3
#>   cohort_definition_id number_records number_subjects
#>                  <int>          <int>           <int>
#> 1                    2            100             100