Retrospective clinical data harmonisation reporting using R and Quarto


Jeremy Selva

@JauntyJJS
https://jeremy-selva.netlify.app

For useR! 2025

useR! 2025 Logo.

9 August 2025

whoami

Research Officer from National Heart Centre Singapore who collects, cleans and harmonises clinical data.

Picture by Allison Horst about a data analyst facing a dataset in the form of a monster.

Taming the Data Beast from “Cleaning Medical Data with R” workshop by Shannon Pileggi, Crystal Lewis and Peter Higgins presented at R/Medicine 2023. Illustrated by Allison Horst.

About Data Harmonisation

Data harmonisation is part of data wrangling process where

  • Similar variables from different datasets are identified.
  • Grouped based on a generalised concept they represent.
  • Transformed into unified harmonised variables for analysis.
The data harmonization process: Study data variables collected from different sources need to be mapped to one another, classified into the generalized concepts they represent, and transformed into unified harmonized variables for analysis

Image from Mallya et al. Circ Cardiovasc Qual Outcomes. 2023 Nov; 16(11):e009938 doi: 10.1161/CIRCOUTCOMES.123.009938.

How it started

Figure showing me welcoming the collaborator to the project and sending them a data dictionary and template file for them to use and send the data back to me.

Cheerful Businessman designed by Iftikhar Alam from Vecteezy and Medical Doctor Man from Creazilla.

How it started

Figure showing the collaborator sharing that he does not have an analyst to do the mapping and warns that if his end were to do it himself, it will take a long time.

Cheerful Businessman designed by Iftikhar Alam from Vecteezy and Medical Doctor Man from Creazilla.

How it started

Figure showing the collaborator saying that they will send the data to me and ask me to do the data harmonisation instead.

Cheerful Businessman designed by Iftikhar Alam from Vecteezy and Medical Doctor Man from Creazilla.

How it started

Figure showing a sad guy receving a package/data full of bugs and issue.

snapshot from Ready for QA | MonkeyUser 2SP Animation Video from MonkeyUser.com.

How it started

A 3 by 3 repeared figure showing a sad guy receving a package/data full of bugs and issue.

snapshot from Ready for QA | MonkeyUser 2SP Animation Video from MonkeyUser.com.

How it started

Turn my sorrow into opportunities.

How it started

Figure showing the collaborator saying that they need the harmonised dataset back with a report for higher management to show to data auditors.

Cheerful Businessman designed by Iftikhar Alam from Vecteezy and Medical Doctor Man from Creazilla.

Additional Motivation

Some data fields just cannot be planned in advanced.

Figure showing different race or ethnicity grouping from different collaborators that it is impossible to plan race or ethnicity grouping.

Issues

While there are R packages to facilitate data harmonisation,

There are limited resources on how to make a data harmonisation report.

Harmonisation Project Template

A template to offer a systematic way to report data harmonisation processes.

[Link: https://jauntyjjs.github.io/harmonisation/]

Workflow with collaborators

Collaborator can send the raw data once and you keep updating the cleaned data for harmonsation.

Picture showing a top workfow to first clean data for harmonisation before doing the actual harmonisation. Bottom workflow to verfiy with the collaborators and update the clean data for harmonisation periodically

Dataset vectors by Vectora Artworks from Vecteezy.

Workflow with collaborators

Figure showing the collaborator saying that they need the updated raw data in addition to the harmonised data.

Cheerful Businessman designed by Iftikhar Alam from Vecteezy and Medical Doctor Man from Creazilla.

Workflow with collaborators

Figure showing the data analyst providing the updated raw data and harmonised data.

Cheerful Businessman designed by Iftikhar Alam from Vecteezy and Medical Doctor Man from Creazilla.

Workflow with collaborators

Collaborator can update the raw data. For example, adding new clinical data, add more patients, correct errors.

Picture showing a top workfow to first clean data for harmonisation before doing the actual harmonisation. Bottom workflow to verfiy with the collaborators and update the raw data for harmonisation periodically

Dataset vectors by Vectora Artworks from Vecteezy.

Workflow with collaborators

New version means new bugs or reopen issues to fix. Is there an automated way to catch warnings/issues when reading these updated files ?

Picture showing that new version of raw data sent by collaborator can give new issues to a supposedly fixed problem

Dataset vectors by Vectora Artworks from Vecteezy. Anticipate from MonkeyUser.com

Automated capture of warnings (csv)

Is there an automated way to catch warnings/issues when reading csv files ?

cohort_data_csv <- vroom::vroom(
  file = here::here("data-raw", "Cohort_csv", 
    "data_to_harmonise_age_issue.csv"),
  delim = ",",
  col_select = 1:2,
  show_col_types = FALSE,
  col_types = list(
    ID = vroom::col_character(), 
    Age = vroom::col_integer()
  )
)

head(cohort_data_csv, n = 3)
# A tibble: 3 × 2
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)
  ID      Age
  <chr> <int>
1 B001     32
2 B002     52
3 B003     NA

A csv file with text data in the Age column.

Automated capture of warnings (csv)

If there are issues with the data, the output of vroom::problems will be a tibble.

cohort_data_csv |> 
  vroom::problems()
# A tibble: 4 × 5
    row   col expected   actual  file                                           
  <int> <int> <chr>      <chr>   <chr>                                          
1     2     2 an integer missing D:/Jeremy/PortableR/RPortableWorkDirectory/use…
2     4     2 an integer missing D:/Jeremy/PortableR/RPortableWorkDirectory/use…
3    10     2 an integer missing D:/Jeremy/PortableR/RPortableWorkDirectory/use…
4    17     2 an integer missing D:/Jeremy/PortableR/RPortableWorkDirectory/use…

To check for this automatically, we can use pointblank::expect_row_count_match.

cohort_data_csv |> 
  vroom::problems() |> 
  pointblank::expect_row_count_match(count = 0)
Error: Row counts for the two tables did not match.
The `expect_row_count_match()` validation failed beyond the absolute threshold level (1).
* failure level (1) >= failure threshold (1)

Automated capture of warnings (csv)

Here is a case with no issues.

cohort_data_csv <- vroom::vroom(
  file = here::here("data-raw", "Cohort_csv", 
    "data_to_harmonise_age_issue_fixed.csv"),
  delim = ",",
  col_select = 1:2,
  show_col_types = FALSE,
  col_types = list(
    ID = vroom::col_character(), 
    Age = vroom::col_integer()
  )
)

cohort_data_csv |> 
  vroom::problems()
# A tibble: 0 × 5
# ℹ 5 variables: row <int>, col <int>, expected <chr>, actual <chr>, file <chr>
cohort_data_csv |> 
  vroom::problems() |> 
  pointblank::expect_row_count_match(count = 0)

A csv file with only integer data in the Age column.

Automated capture of warnings (Excel)

Is there an automated way to catch warnings/issues when reading Excel files ?

cohort_data_excel <- readxl::read_excel(
  path = here::here("data-raw", "Cohort_Excel", 
    "data_to_harmonise_age_issue.xlsx"),
  sheet = "Sheet1",
  col_types = c(
    "text", "numeric"
    )
  )
Warning: Expecting numeric in B7 / R7C2: got 'missing'
Warning: Expecting numeric in B14 / R14C2: got 'missing'

An Excel file with text data in the Age column.

Automated capture of warnings (Excel)

We can read the Excel file with testthat::expect_no_condition.

testthat::expect_no_condition(
  cohort_data_excel <- readxl::read_excel(
    path = here::here("data-raw", "Cohort_Excel", 
      "data_to_harmonise_age_issue.xlsx"),
    sheet = "Sheet1",
    col_types = c("text", "numeric")
  )
)
Error: Expected `... <- NULL` to run without any conditions.
ℹ Actually got a <simpleWarning> with text:
  Expecting numeric in B7 / R7C2: got 'missing'

An Excel file with text data in the Age column.

Automated capture of warnings (Excel)

However, this method means that you will lose the pipe workflow.

testthat::expect_no_condition(
  cohort_data_excel <- readxl::read_excel(
    path = here::here("data-raw", "Cohort_Excel", 
      "data_to_harmonise_age_issue_fixed.xlsx"),
    sheet = "Sheet1",
    col_types = c("text", "numeric")
  )
)

cohort_data_excel <- cohort_data_excel |> 
  # Check if Serial Number is unique
  pointblank::rows_distinct(
    columns = "Serial Number",
  )

An Excel file with numeric data in the Age column.

Automated capture of warnings (Excel)

We can use the tee pipe operator %T>% from magrittr.

With Issues

cohort_data_excel <- readxl::read_excel(
  path = here::here("data-raw", "Cohort_Excel", 
    "data_to_harmonise_age_issue.xlsx"),
  sheet = "Sheet1",
  col_types = c(
    "text", "numeric"
    )
  ) %T>%
  testthat::expect_no_condition()
Error: Expected `.` to run without any conditions.
ℹ Actually got a <simpleWarning> with text:
  Expecting numeric in B7 / R7C2: got 'missing'

No Issues

cohort_data_excel_2 <- readxl::read_excel(
  path = here::here("data-raw", "Cohort_Excel", 
    "data_to_harmonise_age_issue_fixed.xlsx"),
  sheet = "Sheet1",
  col_types = c("text", "numeric")
) %T>%
testthat::expect_no_condition() |> 
# Check if Serial Number is unique
  pointblank::rows_distinct(
    columns = "Serial Number",
)

Variable Mapping

Variable Mapping Reporting Workflow. Starting with introducing the mapping procedure, writing code to do mapping ans validation, showing that the code works, clean up data for merging.

Variable Mapping

Let take this data set as an example.

cohort_csv_data <- vroom::vroom(
  file = here::here("data-raw",
                    "Cohort_csv",
                    "data_to_harmonise.csv"),
  delim = ",",
  col_select = 1:8,
  show_col_types = FALSE,
  col_types = list(
    ID = vroom::col_character(),
    Age = vroom::col_integer(),
    Sex = vroom::col_character(),
    Height = vroom::col_double(),
    Weight = vroom::col_double(),
    `Smoke History` = vroom::col_character(),
    `Chest Pain Character` = vroom::col_character(),
    Dyspnea = vroom::col_character()
    )
  ) |>  
  dplyr::rename(cohort_unique_id = "ID") |>
  # Remove rows when the ID value is NA
  dplyr::filter(!is.na(.data[["cohort_unique_id"]])) |>
  # Remove white spaces in column names
  dplyr::rename_all(stringr::str_trim) |> 
  # Check if cohort id is unique
  pointblank::rows_distinct(
    columns = "cohort_unique_id",
  )

cohort_csv_data |> 
  vroom::problems() |> 
  pointblank::expect_row_count_match(count = 0)

Variable Mapping

Let the reader know how the collaborator’s data Smoke History is going to be mapped.

Documentation indicating how the collaborator's data Smoke History is going to be mapped or harmonised in Quarto code and output.

Variable Mapping

Code to do mapping.

smoking_data <- cohort_csv_data |>
  dplyr::select(c("cohort_unique_id", 
                  "Smoke History")) |>
  dplyr::mutate(
    smoke_current = dplyr::case_when(
      is.na(.data[["Smoke History"]]) ~ "-1",
      .data[["Smoke History"]] == "non-smoker" ~ "0",
      .data[["Smoke History"]] == "past smoker" ~ "0",
      .data[["Smoke History"]] == "current smoker" ~ "1",
      .default = NA_character_
    ),
    smoke_current = forcats::fct_relevel(
      .data[["smoke_current"]],
      c("0", "1")), 
    smoke_past = dplyr::case_when(
      is.na(.data[["Smoke History"]]) ~ "-1",
      .data[["Smoke History"]] == "non-smoker" ~ "0",
      .data[["Smoke History"]] == "past smoker" ~ "1",
      .data[["Smoke History"]] == "current smoker" ~ "0",
      .default = NA_character_
    ),
    smoke_past = forcats::fct_relevel(
      .data[["smoke_past"]],
      c("0", "1")),
    `Smoke History` = forcats::fct(
      .data[["Smoke History"]]
    )
  )

Documentation of harmonisation of smoking history.

Variable Mapping

Code to do validation.

smoking_data <- smoking_data |>
  pointblank::col_vals_in_set(
    columns = c("smoke_current", "smoke_past"),
    set = c("0", "1", "-1")
  ) |> 
  pointblank::col_vals_expr(
    expr = pointblank::expr(
      (.data[["smoke_current"]] == "1" & .data[["smoke_past"]] == "0") |
      (.data[["smoke_current"]] == "-1" & .data[["smoke_past"]] == -"1") |
      (.data[["smoke_current"]] == "0" & .data[["smoke_past"]] %in% c("0", "1"))
    )
  )

Documentation of harmonisation of smoking history.

Reference: https://github.com/rstudio/pointblank/issues/578

Variable Mapping

Show examples that the code works. Code to print the reactable in html file.

Html Output

Variable Mapping

Show examples that the code works. Code to print the table in pdf file.

Pdf Output

Output code in pdf file.

Output table in pdf file.

Variable Mapping

Clean for merging.

smoking_data <- smoking_data |>
  dplyr::select(-c("Smoke History"))

Merging Harmonised Data

Suppose we have completed harmonising a batch of clinical data.

age_gender_data |> 
  reactable_with_download_csv_button(
    defaultPageSize = 5,
    paginationType = "jump",
    style = list(fontSize = "1rem"),
  )
body_measurement_data |> 
  reactable_with_download_csv_button(
    defaultPageSize = 5,
    paginationType = "jump",
    style = list(fontSize = "1rem"),
  )

How can we merge them without issues of missing rows or additional columns ?

Merging Harmonised Data

unmatched = "error" in dplyr::inner_join helps to avoid patients with no match.

join_specification <- dplyr::join_by("cohort_unique_id")

demo_behave_data <- cohort_csv_data |>
  dplyr::select(c("cohort_unique_id")) |>
  dplyr::inner_join(age_gender_data,
                   by = join_specification,
                   unmatched = "error",
                   relationship = "one-to-one") |>
  dplyr::inner_join(body_measurement_data,
                   by = join_specification,
                   unmatched = "error",
                   relationship = "one-to-one") |>
  dplyr::inner_join(smoking_data,
                   by = join_specification,
                   unmatched = "error",
                   relationship = "one-to-one") |>
  dplyr::relocate(c("bsa_m2", "bmi"),
                  .after = "sex")
three_penguins <- tibble::tribble(
  ~samp_id, ~species,    ~island,
  1,        "Adelie",    "Torgersen",
  2,        "Gentoo",    "Biscoe",
)

weight_extra <- tibble::tribble(
  ~samp_id,  ~body_mass_g,
  1,         3220,
  2,         4730,
  4,         4725
)

three_penguins |> 
  dplyr::inner_join(
    y = weight_extra,
    by = dplyr::join_by("samp_id"),
    unmatched = "error"
 ) 
Error in `dplyr::inner_join()`:
! Each row of `y` must be matched by `x`.
ℹ Row 3 of `y` was not matched.

Reference: https://www.tidyverse.org/blog/2023/08/teach-tidyverse-23/#improved-and-expanded-_join-functionality

Merging Harmonised Data

unmatched = "error" in dplyr::inner_join helps to avoid patients with no match.

join_specification <- dplyr::join_by("cohort_unique_id")

demo_behave_data <- cohort_csv_data |>
  dplyr::select(c("cohort_unique_id")) |>
  dplyr::inner_join(age_gender_data,
                   by = join_specification,
                   unmatched = "error",
                   relationship = "one-to-one") |>
  dplyr::inner_join(body_measurement_data,
                   by = join_specification,
                   unmatched = "error",
                   relationship = "one-to-one") |>
  dplyr::inner_join(smoking_data,
                   by = join_specification,
                   unmatched = "error",
                   relationship = "one-to-one") |>
  dplyr::relocate(c("bsa_m2", "bmi"),
                  .after = "sex")
three_penguins <- tibble::tribble(
  ~samp_id, ~species,    ~island,
  1,        "Adelie",    "Torgersen",
  2,        "Gentoo",    "Biscoe",
  3,        "Chinstrap", "Dream"
)

weight_extra <- tibble::tribble(
  ~samp_id,  ~body_mass_g,
  1,         3220,
  3,         4725
)

three_penguins |> 
  dplyr::inner_join(
    y = weight_extra,
    by = dplyr::join_by("samp_id"),
    unmatched = "error"
 ) 
Error in `dplyr::inner_join()`:
! Each row of `x` must have a match in `y`.
ℹ Row 2 of `x` does not have a match.

Reference: https://www.tidyverse.org/blog/2023/08/teach-tidyverse-23/#improved-and-expanded-_join-functionality

Merging Harmonised Data

relationship = "one-to-one" in dplyr::inner_join helps to avoid patients with multiple match.

join_specification <- dplyr::join_by("cohort_unique_id")

demo_behave_data <- cohort_csv_data |>
  dplyr::select(c("cohort_unique_id")) |>
  dplyr::inner_join(age_gender_data,
                   by = join_specification,
                   unmatched = "error",
                   relationship = "one-to-one") |>
  dplyr::inner_join(body_measurement_data,
                   by = join_specification,
                   unmatched = "error",
                   relationship = "one-to-one") |>
  dplyr::inner_join(smoking_data,
                   by = join_specification,
                   unmatched = "error",
                   relationship = "one-to-one") |>
  dplyr::relocate(c("bsa_m2", "bmi"),
                  .after = "sex")
three_penguins <- tibble::tribble(
  ~samp_id, ~species,    ~island,
  1,        "Adelie",    "Torgersen",
  2,        "Gentoo",    "Biscoe",
  3,        "Chinstrap", "Dream"
)

weight_extra <- tibble::tribble(
  ~samp_id,  ~body_mass_g,
  1,         3220,
  2,         4730,
  2,         4725,
  3,         4000
)

three_penguins |> 
  dplyr::inner_join(
    y = weight_extra,
    by = dplyr::join_by("samp_id"),
    relationship = "one-to-one"
 ) 
Error in `dplyr::inner_join()`:
! Each row in `x` must match at most 1 row in `y`.
ℹ Row 2 of `x` matches multiple rows in `y`.

Reference: https://www.tidyverse.org/blog/2023/08/teach-tidyverse-23/#improved-and-expanded-_join-functionality

Merging Harmonised Data

Use pointblank::has_columns to ensure we only have harmonised variables.

testthat::expect_false(
  pointblank::has_columns(
    demo_behave_data,
    columns = c(
      dplyr::ends_with(".x"), 
      dplyr::ends_with(".y")
    )
  )
)

testthat::expect_equal(
  ncol(demo_behave_data), 9
)

testthat::expect_true(
  pointblank::has_columns(
    demo_behave_data,
    columns = c(
      "age_years", "sex",
      "height_cm", "weight_kg", "bsa_m2", "bmi",
      "smoke_current", "smoke_past"
    )
  )
) 
three_penguins <- tibble::tribble(
  ~samp_id, ~species,    ~island,
  1,        "Adelie",    "Torgersen",
  2,        "Gentoo",    "Biscoe",
  3,        "Chinstrap", "Dream"
)

weight_extra <- tibble::tribble(
  ~samp_id,  ~island,
  1,         "Torgersen",
  2,         "Biscoe",
  3,         "Dream"
)

three_penguins <- three_penguins |> 
  dplyr::inner_join(
    y = weight_extra,
    by = dplyr::join_by("samp_id"),
    unmatched = "error",
    relationship = "one-to-one"
 )

three_penguins |> 
  pointblank::has_columns(
    columns = c(
      dplyr::ends_with(".x"), 
      dplyr::ends_with(".y")
      )
  )
[1] TRUE
colnames(three_penguins)
[1] "samp_id"  "species"  "island.x" "island.y"

Harmonisation Report Types

Collaborator wants different ways to report how data harmonisation is done.

Four major documentation type.

The documentation system by Divio

Harmonisation Report Types

Collaborator wants different ways to report how data harmonisation is done.

Four major documentation type in terms of data harmonisation

Tehcncal Report Challenge

One variable mapping report takes at least one page.

On average, a clinical trial will have a few hundred variables.

  • One hundred columns for clinical and demographics.
  • Two hundred columns for medication.

Harmonisation report can have at least a few hundreds pages for each cohort.

There is a need to automate the creation of these reports.

Businessman in pile of documents asking for help.

Businessman in pile of documents asking for help by Amonrat Rungreangfangsai

Quarto Books

To make a Quarto book or website, we need a _quarto.yml and index.qmd file

A preview of the _quarto.yml file.

A preview of the index.qmd file.

Automated Technical Report (Reference)

We create an index.qmd file for technical report generation.

A preview of the index qmd file for technical reporting.

Automated Technical Report (Reference)

We create a _quarto.yml file and relevant Quarto files for each cohort.

A preview of the _quarto.yml and relevant Quarto files for Cohort A.

Automated Technical Report (Reference)

Create a script to generate technical reports in pdf, word and html for each cohort.

# Copy the right index.qmd 
# file

index_qmd_file <- paste0(
  "_index_",
  "report",
  ".qmd"
)

fs::file_copy(
  path = here::here(
    "templates",
    "index-qmd",
    index_qmd_file),
  new_path = here::here(
    "index.qmd"
  ),
  overwrite = TRUE
)

index_report.qmd file copied and changed to index.qmd

Automated Technical Report (Reference)

Create a script to generate technical reports in pdf, word and html for each cohort.

copy_and_render <- function(
    cohort
) {

  # Copy quarto.yml file
  # for each cohort
  
  quarto_yml_file <- paste0(
    "_quarto_",
    cohort,
    ".yml"
  )

  fs::file_copy(
    path = here::here(
      "templates",
      "quarto-yaml",
      quarto_yml_file),
    new_path = here::here("_quarto.yml"),
    overwrite = TRUE
  )
  
  # Render each cohort
  quarto::quarto_render(
    as_job = FALSE
  )
}

cohort_name <- c("Cohort_A", 
                 "Cohort_B")

purrr::walk(
  .x = cohort_name,
  .f = ~copy_and_render(
    cohort = .x
  )
)

_quarto_Cohort_A.yml file copied and changed to _quarto.qmd

Automated Technical Report (Reference)

Output of technical report.

Automated Summary Report (How-to-Guide)

A similar method is done to create a summary report in word using flextable.

Preview of summary report.

Overview Diagrams

How many variables can each cohort provide ?

How many variables can be harmonised ?

demographic_list <- list(
  A = c("Age", "Sex",
        "Hypertension", "Dyslipidemia", "Family Hx CAD", "Diabetes",
        "Smoke Current", "Smoke Past",
        "Have Chest Pain", "Chest Pain Character",
        "Dyspnea",
        "BMI", "Height", "Weight"),
  B = c("Age", "Sex",
        "Hypertension", "Dyslipidemia", "Family Hx CAD", "Diabetes",
        "Smoke Current", "Smoke Past",
        "Have Chest Pain", "Chest Pain Character",
        "Dyspnea",
        "HDL", "Total Cholesterol", 
        "Triglyceride", "LDL"),
  C = c("Age", "Sex",
        "Hypertension", "Dyslipidemia", "Family Hx CAD", "Diabetes",
        "Smoke Current", "Smoke Past",
        "Have Chest Pain", "Chest Pain Character",
        "Dyspnea",
        "BMI", "Height", "Weight",
        "HDL", "Total Cholesterol", 
        "Triglyceride", "LDL")
)  

cohort_a_label_name <- "Cohort A\n(n=1000)"
cohort_b_label_name <- "Cohort B\n(n=2000)"
cohort_c_label_name <- "Cohort C\n(n=1500)"

demographic_venn_data <- demographic_list |> 
  RVenn::Venn() |> 
  ggVennDiagram::process_data()

demographic_venn_regionedge_data <- demographic_venn_data |> 
  ggVennDiagram::venn_regionedge() |> 
  dplyr::mutate(
    name = stringr::str_replace_all(
      .data[["name"]],
      "/",
      " and "
    ),
    name = stringr::str_wrap(.data[["name"]], width = 10),
    name = forcats::fct_reorder(.data[["name"]],
                                nchar(.data[["id"]]))
  ) |> 
  dplyr::rename(`Cohort` = "name")

demographic_venn_label_data <- demographic_venn_data |>
  ggVennDiagram::venn_setlabel() |> 
  dplyr::mutate(
    name = dplyr::case_match(
      .data[["name"]],
      "A" ~ cohort_a_label_name,
      "B" ~ cohort_b_label_name,
      "C" ~ cohort_c_label_name
    )
  ) |> 
  dplyr::rename(`Cohort` = "name")

demographic_venn_regionlabel_data <- demographic_venn_data |> 
  ggVennDiagram::venn_regionlabel() |> 
  dplyr::mutate(
    name = stringr::str_replace_all(
      .data[["name"]],
      "/",
      " and "
    ),
    name = stringr::str_wrap(.data[["name"]], width = 10),
    name = forcats::fct_reorder(.data[["name"]],
                                nchar(.data[["id"]]))
  ) |> 
  dplyr::rename(`Cohort` = "name")

demographic_venn_edge_data <- demographic_venn_data |> 
  ggVennDiagram::venn_setedge()

demographic_venn_diagram <- ggplot2::ggplot() +
  # 1. region count layer
  ggplot2::geom_polygon(
    data = demographic_venn_regionedge_data,
    mapping = ggplot2::aes(
      x = .data[["X"]], y = .data[["Y"]],
      fill = .data[["Cohort"]],
      group = .data[["id"]])
  ) +
  # 2. set edge layer
  ggplot2::geom_path(
    data = demographic_venn_edge_data,
    mapping = ggplot2::aes(
      x = .data[["X"]], y = .data[["Y"]],
      colour = "black",
      group = .data[["id"]]
    ),    
    show.legend = FALSE
  ) +
  # 3. set label layer
  ggplot2::geom_text(
    data = demographic_venn_label_data,
    mapping = ggplot2::aes(
      x = .data[["X"]], y = .data[["Y"]],
      label = .data[["Cohort"]]),
    size = 5.5
  ) +
  # 4. region label layer
  ggplot2::geom_label(
    data = demographic_venn_regionlabel_data,
    mapping = ggplot2::aes(
      x = .data[["X"]], y = .data[["Y"]],
      label = .data[["count"]]),
    size = 7
  ) +
  ggplot2::scale_x_continuous(
    expand = ggplot2::expansion(mult = 0.2)
  ) +
  ggplot2::theme_void() +
  ggplot2::theme(
    text = element_text(size = 20)
  )

An venn diagram showing common variables within six cohorts

Venn diagram does not work for many (> 10) cohorts.

Overview Diagrams

Upset plots are too complicated for clinicians.

demographic_venn <- tibble::tibble(
  column_name = c("Age", "Sex",
                  "Hypertension", "Dyslipidemia", "Family Hx CAD", "Diabetes",
                  "Smoke Current", "Smoke Past",
                  "Have Chest Pain", "Chest Pain Character",
                  "Dyspnea",
                  "BMI", "Height", "Weight",
                  "HDL", "Total Cholesterol", 
                  "Triglyceride", "LDL"),
  `Cohort A` = c(1, 1,
                 1, 1, 1, 1,
                 1, 1,
                 1, 1,
                 1,
                 1, 1, 1,
                 0, 0, 
                 0, 0),
  `Cohort B` = c(1, 1,
                 1, 1, 1, 1,
                 1, 1,
                 1, 1,
                 1,
                 0, 0, 0,
                 1, 1, 
                 1, 1),
  `Cohort C` = c(1, 1,
                 1, 1, 1, 1,
                 1, 1,
                 1, 1,
                 1,
                 1, 1, 1,
                 1, 1, 
                 1, 1),
  `Cohort D` = c(1, 1,
                 1, 1, 1, 1,
                 1, 1,
                 1, 1,
                 0,
                 1, 0, 0,
                 1, 1, 
                 1, 1),
  `Cohort E` = c(1, 1,
                 1, 1, 1, 1,
                 1, 1,
                 1, 1,
                 0,
                 1, 1, 1,
                 1, 1, 
                 0, 0),
  `Cohort F` = c(1, 1,
                 1, 1, 1, 1,
                 1, 1,
                 1, 1,
                 0,
                 1, 1, 1,
                 0, 0, 
                 0, 0),
)

cohort_a_upset_col_name <- "Cohort A\n(n=1000)"
cohort_b_upset_col_name <- "Cohort B\n(n=2000)"
cohort_c_upset_col_name <- "Cohort C\n(n=1500)"
cohort_d_upset_col_name <- "Cohort D\n(n=500)"
cohort_e_upset_col_name <- "Cohort E\n(n=1000)"
cohort_f_upset_col_name <- "Cohort F\n(n=2500)"


demographic_upset_data <- demographic_venn |>
  dplyr::rename(
    !!cohort_a_upset_col_name := "Cohort A",
    !!cohort_b_upset_col_name := "Cohort B",
    !!cohort_c_upset_col_name := "Cohort C",
    !!cohort_d_upset_col_name := "Cohort D",
    !!cohort_e_upset_col_name := "Cohort E",
    !!cohort_f_upset_col_name := "Cohort F"
  )

upset_plot <- ComplexUpset::upset(
  demographic_upset_data,
  c(cohort_a_upset_col_name, 
    cohort_b_upset_col_name, 
    cohort_c_upset_col_name,
    cohort_d_upset_col_name,
    cohort_e_upset_col_name,
    cohort_f_upset_col_name),
  base_annotations = list(
    `Intersection size` = 
      ComplexUpset::intersection_size(
        text = list(size = 9, nudge_y = 0.5),
        text_colors = c(on_background='black', 
                        on_bar='white')
      ) + 
      ggplot2::annotate(
        size = 5.5,
        geom = 'text', 
        x = Inf, 
        y = Inf,
        label = paste('Total:', nrow(demographic_venn)),
        vjust = 1, 
        hjust = 1
      ) + 
      ggplot2::theme(
        axis.title.y = ggplot2::element_text(angle = 0, size = 20),
        axis.title.x = ggplot2::element_text(angle = 0, size = 20), 
        text = element_text(size = 20)
      ) +
      ggplot2::labs(y = "",
                    title = "Intersection size")
      ),
  set_sizes = ( 
    ComplexUpset::upset_set_size() +
    ggplot2::geom_text(
      size = 5.5,
      mapping = ggplot2::aes(label= ggplot2::after_stat(.data[["count"]])), 
      hjust = 1.5, 
      stat = 'count') +
    ggplot2::expand_limits(y = 30) + 
    ggplot2::theme(
        axis.title.y = ggplot2::element_text(angle = 0, size = 15),
        axis.title.x = ggplot2::element_text(angle = 0, size = 15), 
        text = element_text(size = 15)
    ) +
    ggplot2::labs(y = "Number of variables")
  ),
  sort_intersections_by = "degree",
  sort_sets = FALSE,
  name = "", 
  width_ratio = 0.25,
  themes = ComplexUpset::upset_default_themes(text = element_text(size = 15))
  )

An upset plot showing common variables within six cohorts

Cannot answer follow-up questions:

How many cohorts provide patient’s blood lipid information and how many patients have them ?

Overview Diagrams

Create a “heatmap”using Microsoft PowerPoint.

A modified heatmap to show command variables.

Summary

The data harmonization process: Study data variables collected from different sources need to be mapped to one another, classified into the generalized concepts they represent, and transformed into unified harmonized variables for analysis

Variable Mapping Reporting Workflow. Starting with introducing the mapping procedure, writing code to do mapping ans validation, showing that the code works, clean up data for merging.

Four major documentation type in terms of data harmonisation

A modified heatmap to show command variables.

Thank you

Happy 60th Birthday Singapore

Gif image of the 60th National Day Parade in Singapore themed Majulah Singapura.

Singapore National Day Parade (NDP) 2025 Logo from https://www.ndp.gov.sg