Retrospective clinical data harmonisation reporting

12th June 2025

whoami

Research Officer from National Heart Centre Singapore who collects, cleans and harmonises clinical data.

Picture by Allison Horst about a data analyst facing a dataset in the form of a monster.

Taming the Data Beast from “Cleaning Medical Data with R” workshop by Shannon Pileggi, Crystal Lewis and Peter Higgins presented at R/Medicine 2023. Illustrated by Allison Horst.

Outline

Data harmonisation overview strategy

Counter small but annoying issues during retrospective data harmonisation.

Suggested reports and diagram to create for different clients.

About Data Harmonisation

The data harmonization process: Study data variables collected from different sources need to be mapped to one another, classified into the generalized concepts they represent, and transformed into unified harmonized variables for analysis

Image from Mallya et al. Circ Cardiovasc Qual Outcomes. 2023 Nov; 16(11):e009938 doi: 10.1161/CIRCOUTCOMES.123.009938.

Tasks

Here is a summary of my task for an international study.

Data harmonization task: Liaise with collaborator on what variables we need for the study and collect them by providing a data dictionary and input template file, harmonise data provided and return a harmonised dataset with a report of how data is harmonised, and provide analyst team combined data for further analysis.

Why Retropective Data Harmonisation ?

Collaborators may not have resources (time/funding) to match their data to the data dictionary and input template provided.

Data provided may have issues that needs relevant experience and may be hard to solve.

Why Retropective Data Harmonisation ?

Collaborator wants transparency on how the data is processed.

High risk of getting blame should things don’t go well.

Businessman kneel on floor with fingers pointing to him.

Businessman kneel on floor with pointing finger to him by Amonrat Rungreangfangsai

Issues

Limited resources on how to do data harmonisation and to make a report.

R packages related to data harmonisation

Issues

retroharmonize works with categorical but not continuous data variables.

Rmonize documentation has too much jargon and difficult for me (and people with limited data science background) to understand.

psHarmonize variable harmonisation process is reported in a Microsoft Excel file. Works for variables with simple harmonisation process.

Harmonisation report from psHarmonise in an excel file.

Image from Stephen et al. Patterns (N Y). 2024 Jun; 5(8):101003 doi: 10.1016/j.patter.2024.101003.

Harmonisation Project Template

https://jauntyjjs.github.io/harmonisation/

Automated capture of warnings (csv)

When reading files in batches or the same file but in a different version, is there an automated way to catch warnings/issues when reading csv files ?

cohort_data_csv <- vroom::vroom(
  file = here::here("data-raw", "Cohort_csv", 
    "data_to_harmonise_age_issue.csv"),
  delim = ",",
  col_select = 1:2,
  show_col_types = FALSE,
  col_types = list(
    ID = vroom::col_character(), 
    Age = vroom::col_integer()
  )
)

head(cohort_data_csv, n = 3)
# A tibble: 3 × 2
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)
  ID      Age
  <chr> <int>
1 B001     32
2 B002     52
3 B003     NA

A csv file with text data in the Age column.

Automated capture of warnings (csv)

If there are issues with the data, the output of vroom::problems will be a tibble.

cohort_data_csv |> 
  vroom::problems()
# A tibble: 4 × 5
    row   col expected   actual  file                                           
  <int> <int> <chr>      <chr>   <chr>                                          
1     2     2 an integer missing D:/Jeremy/PortableR/RPortableWorkDirectory/RMe…
2     4     2 an integer missing D:/Jeremy/PortableR/RPortableWorkDirectory/RMe…
3    10     2 an integer missing D:/Jeremy/PortableR/RPortableWorkDirectory/RMe…
4    17     2 an integer missing D:/Jeremy/PortableR/RPortableWorkDirectory/RMe…

To check for this automatically, we can use pointblank::expect_row_count_match.

cohort_data_csv |> 
  vroom::problems() |> 
  pointblank::expect_row_count_match(count = 0)
Error: Row counts for the two tables did not match.
The `expect_row_count_match()` validation failed beyond the absolute threshold level (1).
* failure level (1) >= failure threshold (1)

Automated capture of warnings (csv)

Here is a case with no issues.

cohort_data_csv <- vroom::vroom(
  file = here::here("data-raw", "Cohort_csv", 
    "data_to_harmonise_age_issue_fixed.csv"),
  delim = ",",
  col_select = 1:2,
  show_col_types = FALSE,
  col_types = list(
    ID = vroom::col_character(), 
    Age = vroom::col_integer()
  )
)

cohort_data_csv |> 
  vroom::problems()
# A tibble: 0 × 5
# ℹ 5 variables: row <int>, col <int>, expected <chr>, actual <chr>, file <chr>
cohort_data_csv |> 
  vroom::problems() |> 
  pointblank::expect_row_count_match(count = 0)

A csv file with only integer data in the Age column.

Automated capture of warnings (Excel)

Is there an automated way to catch warnings/issues when reading Excel files ?

cohort_data_excel <- readxl::read_excel(
  path = here::here("data-raw", "Cohort_Excel", 
    "data_to_harmonise_age_issue.xlsx"),
  sheet = "Sheet1",
  col_types = c(
    "text", "numeric"
    )
  )
Warning: Expecting numeric in B7 / R7C2: got 'missing'
Warning: Expecting numeric in B14 / R14C2: got 'missing'

An Excel file with text data in the Age column.

Automated capture of warnings (Excel)

We can read the Excel file with testthat::expect_no_condition.

testthat::expect_no_condition(
  cohort_data_excel <- readxl::read_excel(
    path = here::here("data-raw", "Cohort_Excel", 
      "data_to_harmonise_age_issue.xlsx"),
    sheet = "Sheet1",
    col_types = c("text", "numeric")
  )
)
Error: Expected `... <- NULL` to run without any conditions.
ℹ Actually got a <simpleWarning> with text:
  Expecting numeric in B7 / R7C2: got 'missing'

An Excel file with text data in the Age column.

Automated capture of warnings (Excel)

However, this method means that you will lose the pipe workflow.

testthat::expect_no_condition(
  cohort_data_excel <- readxl::read_excel(
    path = here::here("data-raw", "Cohort_Excel", 
      "data_to_harmonise_age_issue_fixed.xlsx"),
    sheet = "Sheet1",
    col_types = c("text", "numeric")
  )
)

cohort_data_excel <- cohort_data_excel |> 
  # Check if Serial Number is unique
  pointblank::rows_distinct(
    columns = "Serial Number",
  )

An Excel file with numeric data in the Age column.

Automated capture of warnings (Excel)

We can use the tee pipe operator %T>%.

With Issues

cohort_data_excel <- readxl::read_excel(
  path = here::here("data-raw", "Cohort_Excel", 
    "data_to_harmonise_age_issue.xlsx"),
  sheet = "Sheet1",
  col_types = c(
    "text", "numeric"
    )
  ) %T>%
  testthat::expect_no_condition()
Error: Expected `.` to run without any conditions.
ℹ Actually got a <simpleWarning> with text:
  Expecting numeric in B7 / R7C2: got 'missing'

No Issues

cohort_data_excel_2 <- readxl::read_excel(
  path = here::here("data-raw", "Cohort_Excel", 
    "data_to_harmonise_age_issue_fixed.xlsx"),
  sheet = "Sheet1",
  col_types = c("text", "numeric")
) %T>%
testthat::expect_no_condition() |> 
# Check if Serial Number is unique
  pointblank::rows_distinct(
    columns = "Serial Number",
)

Variable Mapping

Variable Mapping Reporting Workflow. Starting with introducing the mapping procedure, writing code to do mapping ans validation, showing that the code works, clean up data for merging.

Variable Mapping

Let take this data set as an example.

cohort_csv_data <- vroom::vroom(
  file = here::here("data-raw",
                    "Cohort_csv",
                    "data_to_harmonise.csv"),
  delim = ",",
  col_select = 1:8,
  show_col_types = FALSE,
  col_types = list(
    ID = vroom::col_character(),
    Age = vroom::col_integer(),
    Sex = vroom::col_character(),
    Height = vroom::col_double(),
    Weight = vroom::col_double(),
    `Smoke History` = vroom::col_character(),
    `Chest Pain Character` = vroom::col_character(),
    Dyspnea = vroom::col_character()
    )
  ) |>  
  dplyr::rename(cohort_unique_id = "ID") |>
  # Remove rows when the ID value is NA
  dplyr::filter(!is.na(.data[["cohort_unique_id"]])) |>
  # Remove white spaces in column names
  dplyr::rename_all(stringr::str_trim) |> 
  # Check if cohort id is unique
  pointblank::rows_distinct(
    columns = "cohort_unique_id",
  )

cohort_csv_data |> 
  vroom::problems() |> 
  pointblank::expect_row_count_match(count = 0)

Variable Mapping

Let the reader know how the collaborator’s data Smoke History is going to be mapped.

Documentation indicating how the collaborator's data Smoke History is going to be mapped or harmonised in Quarto code and output.

Variable Mapping

Code to do mapping.

smoking_data <- cohort_csv_data |>
  dplyr::select(c("cohort_unique_id", 
                  "Smoke History")) |>
  dplyr::mutate(
    smoke_current = dplyr::case_when(
      is.na(.data[["Smoke History"]]) ~ "-1",
      .data[["Smoke History"]] == "non-smoker" ~ "0",
      .data[["Smoke History"]] == "past smoker" ~ "0",
      .data[["Smoke History"]] == "current smoker" ~ "1",
      .default = NA_character_
    ),
    smoke_current = forcats::fct_relevel(
      .data[["smoke_current"]],
      c("0", "1")), 
    smoke_past = dplyr::case_when(
      is.na(.data[["Smoke History"]]) ~ "-1",
      .data[["Smoke History"]] == "non-smoker" ~ "0",
      .data[["Smoke History"]] == "past smoker" ~ "1",
      .data[["Smoke History"]] == "current smoker" ~ "0",
      .default = NA_character_
    ),
    smoke_past = forcats::fct_relevel(
      .data[["smoke_past"]],
      c("0", "1")),
    `Smoke History` = forcats::fct(
      .data[["Smoke History"]]
    )
  )

Documentation of harmonisation of smoking history.

Variable Mapping

Code to do validation.

smoking_data <- smoking_data |>
  pointblank::col_vals_in_set(
    columns = c("smoke_current", "smoke_past"),
    set = c("0", "1", "-1")
  ) |> 
  pointblank::col_vals_expr(
    expr = pointblank::expr(
      (.data[["smoke_current"]] == "1" & .data[["smoke_past"]] == "0") |
      (.data[["smoke_current"]] == "-1" & .data[["smoke_past"]] == -"1") |
      (.data[["smoke_current"]] == "0" & .data[["smoke_past"]] %in% c("0", "1"))
    )
  )

Documentation of harmonisation of smoking history.

Reference: https://github.com/rstudio/pointblank/issues/578

Variable Mapping

Show examples that the code works. Code to print the reactable in html file.

Html Output

Variable Mapping

Show examples that the code works. Code to print the table in pdf file.

Pdf Output

Output code in pdf file.

Output table in pdf file.

Variable Mapping

Clean for merging.

smoking_data <- smoking_data |>
  dplyr::select(-c("Smoke History"))

Merging Harmonised Data

Suppose we have completed harmonising a batch of clinical data.

age_gender_data |> 
  reactable_with_download_csv_button(
    defaultPageSize = 5,
    paginationType = "jump",
    style = list(fontSize = "1rem"),
  )
body_measurement_data |> 
  reactable_with_download_csv_button(
    defaultPageSize = 5,
    paginationType = "jump",
    style = list(fontSize = "1rem"),
  )

How can we merge them without issues of missing rows or additional columns ?

Merging Harmonised Data

unmatched = "error" in dplyr::inner_join helps to avoid patients with no match.

join_specification <- dplyr::join_by("cohort_unique_id")

demo_behave_data <- cohort_csv_data |>
  dplyr::select(c("cohort_unique_id")) |>
  dplyr::inner_join(age_gender_data,
                   by = join_specification,
                   unmatched = "error",
                   relationship = "one-to-one") |>
  dplyr::inner_join(body_measurement_data,
                   by = join_specification,
                   unmatched = "error",
                   relationship = "one-to-one") |>
  dplyr::inner_join(smoking_data,
                   by = join_specification,
                   unmatched = "error",
                   relationship = "one-to-one") |>
  dplyr::relocate(c("bsa_m2", "bmi"),
                  .after = "sex")
three_penguins <- tibble::tribble(
  ~samp_id, ~species,    ~island,
  1,        "Adelie",    "Torgersen",
  2,        "Gentoo",    "Biscoe",
)

weight_extra <- tibble::tribble(
  ~samp_id,  ~body_mass_g,
  1,         3220,
  2,         4730,
  4,         4725
)

three_penguins |> 
  dplyr::inner_join(
    y = weight_extra,
    by = dplyr::join_by("samp_id"),
    unmatched = "error"
 ) 
Error in `dplyr::inner_join()`:
! Each row of `y` must be matched by `x`.
ℹ Row 3 of `y` was not matched.

Reference: https://www.tidyverse.org/blog/2023/08/teach-tidyverse-23/#improved-and-expanded-_join-functionality

Merging Harmonised Data

unmatched = "error" in dplyr::inner_join helps to avoid patients with no match.

join_specification <- dplyr::join_by("cohort_unique_id")

demo_behave_data <- cohort_csv_data |>
  dplyr::select(c("cohort_unique_id")) |>
  dplyr::inner_join(age_gender_data,
                   by = join_specification,
                   unmatched = "error",
                   relationship = "one-to-one") |>
  dplyr::inner_join(body_measurement_data,
                   by = join_specification,
                   unmatched = "error",
                   relationship = "one-to-one") |>
  dplyr::inner_join(smoking_data,
                   by = join_specification,
                   unmatched = "error",
                   relationship = "one-to-one") |>
  dplyr::relocate(c("bsa_m2", "bmi"),
                  .after = "sex")
three_penguins <- tibble::tribble(
  ~samp_id, ~species,    ~island,
  1,        "Adelie",    "Torgersen",
  2,        "Gentoo",    "Biscoe",
  3,        "Chinstrap", "Dream"
)

weight_extra <- tibble::tribble(
  ~samp_id,  ~body_mass_g,
  1,         3220,
  3,         4725
)

three_penguins |> 
  dplyr::inner_join(
    y = weight_extra,
    by = dplyr::join_by("samp_id"),
    unmatched = "error"
 ) 
Error in `dplyr::inner_join()`:
! Each row of `x` must have a match in `y`.
ℹ Row 2 of `x` does not have a match.

Reference: https://www.tidyverse.org/blog/2023/08/teach-tidyverse-23/#improved-and-expanded-_join-functionality

Merging Harmonised Data

relationship = "one-to-one" in dplyr::inner_join helps to avoid patients with multiple match.

join_specification <- dplyr::join_by("cohort_unique_id")

demo_behave_data <- cohort_csv_data |>
  dplyr::select(c("cohort_unique_id")) |>
  dplyr::inner_join(age_gender_data,
                   by = join_specification,
                   unmatched = "error",
                   relationship = "one-to-one") |>
  dplyr::inner_join(body_measurement_data,
                   by = join_specification,
                   unmatched = "error",
                   relationship = "one-to-one") |>
  dplyr::inner_join(smoking_data,
                   by = join_specification,
                   unmatched = "error",
                   relationship = "one-to-one") |>
  dplyr::relocate(c("bsa_m2", "bmi"),
                  .after = "sex")
three_penguins <- tibble::tribble(
  ~samp_id, ~species,    ~island,
  1,        "Adelie",    "Torgersen",
  2,        "Gentoo",    "Biscoe",
  3,        "Chinstrap", "Dream"
)

weight_extra <- tibble::tribble(
  ~samp_id,  ~body_mass_g,
  1,         3220,
  2,         4730,
  2,         4725,
  3,         4000
)

three_penguins |> 
  dplyr::inner_join(
    y = weight_extra,
    by = dplyr::join_by("samp_id"),
    relationship = "one-to-one"
 ) 
Error in `dplyr::inner_join()`:
! Each row in `x` must match at most 1 row in `y`.
ℹ Row 2 of `x` matches multiple rows in `y`.

Reference: https://www.tidyverse.org/blog/2023/08/teach-tidyverse-23/#improved-and-expanded-_join-functionality

Merging Harmonised Data

Use pointblank::has_columns to ensure we only have harmonised variables.

testthat::expect_false(
  pointblank::has_columns(
    demo_behave_data,
    columns = c(
      dplyr::ends_with(".x"), 
      dplyr::ends_with(".y")
    )
  )
)

testthat::expect_equal(
  ncol(demo_behave_data), 9
)

testthat::expect_true(
  pointblank::has_columns(
    demo_behave_data,
    columns = c(
      "age_years", "sex",
      "height_cm", "weight_kg", "bsa_m2", "bmi",
      "smoke_current", "smoke_past"
    )
  )
) 
three_penguins <- tibble::tribble(
  ~samp_id, ~species,    ~island,
  1,        "Adelie",    "Torgersen",
  2,        "Gentoo",    "Biscoe",
  3,        "Chinstrap", "Dream"
)

weight_extra <- tibble::tribble(
  ~samp_id,  ~island,
  1,         "Torgersen",
  2,         "Biscoe",
  3,         "Dream"
)

three_penguins <- three_penguins |> 
  dplyr::inner_join(
    y = weight_extra,
    by = dplyr::join_by("samp_id"),
    unmatched = "error",
    relationship = "one-to-one"
 )

three_penguins |> 
  pointblank::has_columns(
    columns = c(
      dplyr::ends_with(".x"), 
      dplyr::ends_with(".y")
      )
  )
[1] TRUE
colnames(three_penguins)
[1] "samp_id"  "species"  "island.x" "island.y"

Automated Report Challenge

One variable mapping report takes at least one page.

On average, a clinical trial will have a few hundred variables.

  • One hundred columns for clinical and demographics.
  • Two hundred columns for medication.

Harmonisation report can have at least a few hundreds pages for each cohort.

There is a need to automate the creation of these reports.

Businessman in pile of documents asking for help.

Businessman in pile of documents asking for help by Amonrat Rungreangfangsai

Automated Report Challenge

Collaborator wants different ways to report how data harmonisation is done.

Four major documentation type.

The documentation system by Divio

Quarto Books

To make a Quarto book or website, we need a _quarto.yml and index.qmd file

A preview of the _quarto.yml file.

A preview of the index.qmd file.

Automated Technical Report (Reference)

We create an index.qmd file for technical report generation.

A preview of the index.qmd file for technical reporting.

Automated Technical Report (Reference)

We create a _quarto.yml file and relevant Quarto files for each cohort.

A preview of the _quarto.yml and relevant Quarto files for Cohort A.

Automated Technical Report (Reference)

Create an automated script to generate a technical report in pdf, word and html for each cohort

# Copy the right index.qmd 
# file

index_qmd_file <- paste0(
  "_index_",
  "report",
  ".qmd"
)

fs::file_copy(
  path = here::here(
    "templates",
    "index-qmd",
    index_qmd_file),
  new_path = here::here(
    "index.qmd"
  ),
  overwrite = TRUE
)

index_report.qmd file copied and changed to index.qmd

Automated Technical Report (Reference)

Create an automated script to generate a technical report in pdf, word and html for each cohort

copy_and_render <- function(
    cohort
) {

  # Copy quarto.yml file
  # for each cohort
  
  quarto_yml_file <- paste0(
    "_quarto_",
    cohort,
    ".yml"
  )

  fs::file_copy(
    path = here::here(
      "templates",
      "quarto-yaml",
      quarto_yml_file),
    new_path = here::here("_quarto.yml"),
    overwrite = TRUE
  )
  
  # Render each cohort
  quarto::quarto_render(
    as_job = FALSE
  )
}

cohort_name <- c("Cohort_A", 
                 "Cohort_B")

purrr::walk(
  .x = cohort_name,
  .f = ~copy_and_render(
    cohort = .x
  )
)

_quarto_Cohort_A.yml file copied and changed to _quarto.qmd

Automated Technical Report (Reference)

Output of technical report.

Automated Summary Report (How-to-Guide)

A similar method is done to create a summary report in word using flextable.

Preview of summary report.

Data Variable Justification Report (Explanation)

Initially we have smoke_history with values 0 (Non-smoker), 1 (Current smoker), 2 (Past smoker) and -1 (Unknown).

However, in the early stage of the study, some collaborators could only provide if the patient has a smoking history but could not specify if the patient is a current or past smoker.

Column smoke_history with values 0 for non smoker, 1 for past-smoker, 2 for current smoker, -1 for unknown.

Data Variable Justification Report (Explanation)

To deal with this case, we have smoke_history, smoke_current and smoke_past to hold values 1 (Yes), 0 (No) and -1 (Unknown).

Smoking history issue fixed by having smoke_history, smoke_current and smoke_past to hold values 1 for Yes, 0 for No and -1 for Unknown.

Overview Diagrams

How many variables can each cohort provide ?

How many variables can be harmonised ?

demographic_list <- list(
  A = c("Age", "Sex",
        "Hypertension", "Dyslipidemia", "Family Hx CAD", "Diabetes",
        "Smoke Current", "Smoke Past",
        "Have Chest Pain", "Chest Pain Character",
        "Dyspnea",
        "BMI", "Height", "Weight"),
  B = c("Age", "Sex",
        "Hypertension", "Dyslipidemia", "Family Hx CAD", "Diabetes",
        "Smoke Current", "Smoke Past",
        "Have Chest Pain", "Chest Pain Character",
        "Dyspnea",
        "HDL", "Total Cholesterol", 
        "Triglyceride", "LDL"),
  C = c("Age", "Sex",
        "Hypertension", "Dyslipidemia", "Family Hx CAD", "Diabetes",
        "Smoke Current", "Smoke Past",
        "Have Chest Pain", "Chest Pain Character",
        "Dyspnea",
        "BMI", "Height", "Weight",
        "HDL", "Total Cholesterol", 
        "Triglyceride", "LDL")
)  

cohort_a_label_name <- "Cohort A\n(n=1000)"
cohort_b_label_name <- "Cohort B\n(n=2000)"
cohort_c_label_name <- "Cohort C\n(n=1500)"

demographic_venn_data <- demographic_list |> 
  RVenn::Venn() |> 
  ggVennDiagram::process_data()

demographic_venn_regionedge_data <- demographic_venn_data |> 
  ggVennDiagram::venn_regionedge() |> 
  dplyr::mutate(
    name = stringr::str_replace_all(
      .data[["name"]],
      "/",
      " and "
    ),
    name = stringr::str_wrap(.data[["name"]], width = 10),
    name = forcats::fct_reorder(.data[["name"]],
                                nchar(.data[["id"]]))
  ) |> 
  dplyr::rename(`Cohort` = "name")

demographic_venn_label_data <- demographic_venn_data |>
  ggVennDiagram::venn_setlabel() |> 
  dplyr::mutate(
    name = dplyr::case_match(
      .data[["name"]],
      "A" ~ cohort_a_label_name,
      "B" ~ cohort_b_label_name,
      "C" ~ cohort_c_label_name
    )
  ) |> 
  dplyr::rename(`Cohort` = "name")

demographic_venn_regionlabel_data <- demographic_venn_data |> 
  ggVennDiagram::venn_regionlabel() |> 
  dplyr::mutate(
    name = stringr::str_replace_all(
      .data[["name"]],
      "/",
      " and "
    ),
    name = stringr::str_wrap(.data[["name"]], width = 10),
    name = forcats::fct_reorder(.data[["name"]],
                                nchar(.data[["id"]]))
  ) |> 
  dplyr::rename(`Cohort` = "name")

demographic_venn_edge_data <- demographic_venn_data |> 
  ggVennDiagram::venn_setedge()

demographic_venn_diagram <- ggplot2::ggplot() +
  # 1. region count layer
  ggplot2::geom_polygon(
    data = demographic_venn_regionedge_data,
    mapping = ggplot2::aes(
      x = .data[["X"]], y = .data[["Y"]],
      fill = .data[["Cohort"]],
      group = .data[["id"]])
  ) +
  # 2. set edge layer
  ggplot2::geom_path(
    data = demographic_venn_edge_data,
    mapping = ggplot2::aes(
      x = .data[["X"]], y = .data[["Y"]],
      colour = "black",
      group = .data[["id"]]
    ),    
    show.legend = FALSE
  ) +
  # 3. set label layer
  ggplot2::geom_text(
    data = demographic_venn_label_data,
    mapping = ggplot2::aes(
      x = .data[["X"]], y = .data[["Y"]],
      label = .data[["Cohort"]]),
    size = 5.5
  ) +
  # 4. region label layer
  ggplot2::geom_label(
    data = demographic_venn_regionlabel_data,
    mapping = ggplot2::aes(
      x = .data[["X"]], y = .data[["Y"]],
      label = .data[["count"]]),
    size = 7
  ) +
  ggplot2::scale_x_continuous(
    expand = ggplot2::expansion(mult = 0.2)
  ) +
  ggplot2::theme_void() +
  ggplot2::theme(
    text = element_text(size = 20)
  )

An venn diagram showing common variables within six cohorts

But it does not work for many (> 10) cohorts.

Overview Diagrams

Upset plots are too complicated for clinicians.

demographic_venn <- tibble::tibble(
  column_name = c("Age", "Sex",
                  "Hypertension", "Dyslipidemia", "Family Hx CAD", "Diabetes",
                  "Smoke Current", "Smoke Past",
                  "Have Chest Pain", "Chest Pain Character",
                  "Dyspnea",
                  "BMI", "Height", "Weight",
                  "HDL", "Total Cholesterol", 
                  "Triglyceride", "LDL"),
  `Cohort A` = c(1, 1,
                 1, 1, 1, 1,
                 1, 1,
                 1, 1,
                 1,
                 1, 1, 1,
                 0, 0, 
                 0, 0),
  `Cohort B` = c(1, 1,
                 1, 1, 1, 1,
                 1, 1,
                 1, 1,
                 1,
                 0, 0, 0,
                 1, 1, 
                 1, 1),
  `Cohort C` = c(1, 1,
                 1, 1, 1, 1,
                 1, 1,
                 1, 1,
                 1,
                 1, 1, 1,
                 1, 1, 
                 1, 1),
  `Cohort D` = c(1, 1,
                 1, 1, 1, 1,
                 1, 1,
                 1, 1,
                 0,
                 1, 0, 0,
                 1, 1, 
                 1, 1),
  `Cohort E` = c(1, 1,
                 1, 1, 1, 1,
                 1, 1,
                 1, 1,
                 0,
                 1, 1, 1,
                 1, 1, 
                 0, 0),
  `Cohort F` = c(1, 1,
                 1, 1, 1, 1,
                 1, 1,
                 1, 1,
                 0,
                 1, 1, 1,
                 0, 0, 
                 0, 0),
)

cohort_a_upset_col_name <- "Cohort A\n(n=1000)"
cohort_b_upset_col_name <- "Cohort B\n(n=2000)"
cohort_c_upset_col_name <- "Cohort C\n(n=1500)"
cohort_d_upset_col_name <- "Cohort D\n(n=500)"
cohort_e_upset_col_name <- "Cohort E\n(n=1000)"
cohort_f_upset_col_name <- "Cohort F\n(n=2500)"


demographic_upset_data <- demographic_venn |>
  dplyr::rename(
    !!cohort_a_upset_col_name := "Cohort A",
    !!cohort_b_upset_col_name := "Cohort B",
    !!cohort_c_upset_col_name := "Cohort C",
    !!cohort_d_upset_col_name := "Cohort D",
    !!cohort_e_upset_col_name := "Cohort E",
    !!cohort_f_upset_col_name := "Cohort F"
  )

upset_plot <- ComplexUpset::upset(
  demographic_upset_data,
  c(cohort_a_upset_col_name, 
    cohort_b_upset_col_name, 
    cohort_c_upset_col_name,
    cohort_d_upset_col_name,
    cohort_e_upset_col_name,
    cohort_f_upset_col_name),
  base_annotations = list(
    `Intersection size` = 
      ComplexUpset::intersection_size(
        text = list(size = 9, nudge_y = 0.5),
        text_colors = c(on_background='black', 
                        on_bar='white')
      ) + 
      ggplot2::annotate(
        size = 5.5,
        geom = 'text', 
        x = Inf, 
        y = Inf,
        label = paste('Total:', nrow(demographic_venn)),
        vjust = 1, 
        hjust = 1
      ) + 
      ggplot2::theme(
        axis.title.y = ggplot2::element_text(angle = 0, size = 20),
        axis.title.x = ggplot2::element_text(angle = 0, size = 20), 
        text = element_text(size = 20)
      ) +
      ggplot2::labs(y = "",
                    title = "Intersection size")
      ),
  set_sizes = ( 
    ComplexUpset::upset_set_size() +
    ggplot2::geom_text(
      size = 5.5,
      mapping = ggplot2::aes(label= ggplot2::after_stat(.data[["count"]])), 
      hjust = 1.5, 
      stat = 'count') +
    ggplot2::expand_limits(y = 30) + 
    ggplot2::theme(
        axis.title.y = ggplot2::element_text(angle = 0, size = 15),
        axis.title.x = ggplot2::element_text(angle = 0, size = 15), 
        text = element_text(size = 15)
    ) +
    ggplot2::labs(y = "Number of variables")
  ),
  sort_intersections_by = "degree",
  sort_sets = FALSE,
  name = "", 
  width_ratio = 0.25,
  themes = ComplexUpset::upset_default_themes(text = element_text(size = 15))
  )

An upset plot showing common variables within six cohorts

Cannot answer follow-up questions:

How many cohorts provide patient’s blood lipid information and how many patients have this information ?

Overview Diagrams

Create a “heatmap”using Microsoft PowerPoint.

A modified heatmap to show command variables.

Summary

The data harmonization process: Study data variables collected from different sources need to be mapped to one another, classified into the generalized concepts they represent, and transformed into unified harmonized variables for analysis

Variable Mapping Reporting Workflow. Starting with introducing the mapping procedure, writing code to do mapping ans validation, showing that the code works, clean up data for merging.

Four major documentation type.

A modified heatmap to show command variables.

Thank you

Harmonisation project template on GitHub: https://github.com/JauntyJJS/harmonisation/

Data harmonisation project template on Github.