Retrospective clinical data harmonisation Reporting Using R and Quarto

14th November 2025

whoami

Research Officer from National Heart Centre Singapore who collects, cleans and harmonises clinical data.

Picture by Allison Horst about a data analyst facing a dataset in the form of a monster.

Taming the Data Beast from “Cleaning Medical Data with R” workshop by Shannon Pileggi, Crystal Lewis and Peter Higgins presented at R/Medicine 2023. Illustrated by Allison Horst.

About Data Harmonisation

Data harmonisation is part of data wrangling process where

  • Similar variables from different datasets are identified.
  • Grouped based on a generalised concept they represent.
  • Transformed into unified harmonised variables for analysis.

The data harmonization process: Study data variables collected from different sources need to be mapped to one another, classified into the generalized concepts they represent, and transformed into unified harmonized variables for analysis

Image from Mallya et al. Circ Cardiovasc Qual Outcomes. 2023 Nov; 16(11):e009938 doi: 10.1161/CIRCOUTCOMES.123.009938.

How it started

Figure showing me welcoming the collaborator to the project and sending them a data dictionary and template file for them to use and send the data back to me.

Cheerful Businessman designed by Iftikhar Alam from Vecteezy and Medical Doctor Man from Creazilla.

How it started

Figure showing the collaborator sharing that he does not have an analyst to do the mapping and warns that if his end were to do it himself, it will take a long time.

Cheerful Businessman designed by Iftikhar Alam from Vecteezy and Medical Doctor Man from Creazilla.

How it started

Figure showing the collaborator saying that they will send the data to me and ask me to do the data harmonisation instead.

Cheerful Businessman designed by Iftikhar Alam from Vecteezy and Medical Doctor Man from Creazilla.

How it started

Figure showing a sad guy receving a package/data full of bugs and issue.

snapshot from Ready for QA | MonkeyUser 2SP Animation Video from MonkeyUser.com.

How it started

A 3 by 3 repeared figure showing a sad guy receving a package/data full of bugs and issue.

snapshot from Ready for QA | MonkeyUser 2SP Animation Video from MonkeyUser.com.

How it started

Turn my sorrow into opportunities.

Why a harmonisation report

Figure showing the collaborator saying that they need the harmonised dataset back with a report for higher management to show to data auditors.

Cheerful Businessman designed by Iftikhar Alam from Vecteezy and Medical Doctor Man from Creazilla.

Why a harmonisation report

Some data fields just cannot be planned in advanced.

Figure showing different race or ethnicity grouping from different collaborators that it is impossible to plan race or ethnicity grouping.

Issues

While there are 📦 to facilitate data harmonisation,

There are limited resources on how to make a data harmonisation report.

Harmonisation Project Template

A template to offer a systematic way to report data harmonisation processes.

Link: https://jauntyjjs.github.io/harmonisation/

Tools to create documentation

Figure showing the logo of the R programming language and hexsticker of Quarto.

R Programming Logo from CleanPNG and Quarto Hex Sticker from Posit.


What Did We Forget
to Teach You about ?

Figure showing a Google search on what is the R Programming Language.

Figure showing that R is taught only as a tool to do statistics but not other stuffs.

Image from Project Oriented Workflows slides from What They Forgot to Teach You About R.


We will share a glimpse of “Everything else” R and its friends can do.

Pipes

\(\geq\) 4.1.0 has a “pipe” symbol |> to make code easier to read.

Without |>

data_after_task_3 <- task_3(task_2(task_1(data, arg_1_2), arg_2_2, arg_2_3), arg_3_2, arg_3_3)
data_after_task_1 <- task_1(data, arg_1_2)

data_after_task_2 <- task_2(data_after_task_1, arg_2_2, arg_2_3)

data_after_task_3 <- task_3(data_after_task_2, arg_3_2, arg_3_3)

With |>

data_after_task_3 <- data |> 
  task_1(arg_1_2) |> 
  task_2(arg_2_2, arg_2_3) |>
  task_3(arg_3_2, arg_3_3)


Inspired from the Bash Pipe |

terminal
# List files, then filter by row, then filter by column, then sort.
ls -l | grep drw | awk '{print $9}' | sort

Pipes

  • 2014+ 📦 magrittr pipe %>%

  • 2021+ ( \(\geq\) 4.1.0) native pipe |>

More details between the two pipes in Understanding the native R pipe |>.

Namespacing

dplyr::select()

  • tells R explicitly to use the function select from the package dplyr

  • can help to avoid name conflicts (e.g., MASS::select())

  • does not require library(dplyr)

Without Namespace

library(dplyr)

select(mtcars, mpg, cyl) 

mtcars |>  
  select(mpg, cyl) 

With Namespace

# library(dplyr) not needed

dplyr::select(mtcars, mpg, cyl) 

mtcars |>  
  dplyr::select(mpg, cyl) 

Session

Before starts up in a given project, it will perform the following steps

  1. Set up Environment Variables.
  2. Run startup script.
  3. Set up the session.

We can customise Step 1 and 2 using these two main text files.

  • .Renviron (Contains environment variables to be set in sessions.)
  • .Rprofile (Contains code to be run in each session.)

Figure showing the .Renviron and .Rprofile file in an R project template.

what goes in .Renviron

-specific environment variables.
API keys or other secrets
R code

APPDATA="D:/Jeremy/PortableR/RAppData/Roaming"
LOCALAPPDATA="D:/Jeremy/PortableR/RAppData/Local"
TEMP="D:/Jeremy/PortableR/RPortableWorkDirectory/temp"
TMP="D:/Jeremy/PortableR/RPortableWorkDirectory/temp"
_R_CHECK_SYSTEM_CLOCK_=0
RENV_CONFIG_PAK_ENABLED=TRUE
CONNECT_API_KEY=DaYK2hBUriSBYUEGIAiyXsRJHSjTYJN3
DB_USER=elephant
DB_PASS=p0stgr3s

user

~/.Renviron

project

path/to/your/project/.Renviron

Sys.getenv("RENV_CONFIG_PAK_ENABLED")
[1] "TRUE"

what goes in .Rprofile

set a default CRAN mirror.
customize R prompt.

source("renv/activate.R")
options(
  repos = c(
    P3M_20250306 = "https://packagemanager.posit.co/cran/2025-10-13",
    ropensci = "https://ropensci.r-universe.dev",
    janmarvin = "https://janmarvin.r-universe.dev",
    CRAN = 'https://cloud.r-project.org'
  )
)
if (interactive()) prompt::set_prompt(prompt::prompt_fancy)

Prompt Resources:

📦 – prompt.
xinxxxin/rprofile-custom-prompt.R.
Me, Myself and my Rprofile.
Prompt-moting a custom R prompt.

States of 📦

📦 – Structure

Figure showing the summary of the phases of the R package structure.

Binary 📦 Installation

Go for ( | ) binary 📦 in CRAN.

  • compiled ahead of time -> easiest / fastest to install

Figure showing the Windows and Mac binary files of R package pretestcad.

R-universe and Posit Public Package Manager

Consider installing ( | | ) binaries from

Set in your .Rprofile file.

options(
  repos = c(
    P3M_20250306 = "https://packagemanager.posit.co/cran/2025-10-13",
    ropensci = "https://ropensci.r-universe.dev",
    CRAN = 'https://cloud.r-project.org'
  )
)

Website of Appsilon R-universe.

Website of Posit Public Package Manager.

How do I know I got a binary?

> install.packages("parallelly", repos = "https://cran.r-project.org")
Installing package into ‘C:/Users/edavi/Documents/R/win-library/4.1
(as ‘lib’ is unspecified)
trying URL 'https://cran.r-project.org/bin/windows/contrib/4.1/parallelly_1.32.1.zip'
Content type 'application/zip' length 306137 bytes (298 KB)
downloaded 298 KB

package ‘parallelly’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    C:\Users\edavi\AppData\Local\Temp\Rtmpa2s3e8\downloaded_packages
> install.packages("renv", repos="https://cran.r-project.org")
Installing package into ‘/Users/edavidaja/Library/R/x86_64/4.1/library’
(as ‘lib’ is unspecified)
trying URL 'https://cran.r-project.org/bin/macosx/contrib/4.1/renv_0.15.5.tgz'
Content type 'application/x-gzip' length 1866760 bytes (1.8 MB)
==================================================
downloaded 1.8 MB


The downloaded binary packages are in
 /var/folders/b5/fl4ff68d23s148tg1_1gnflc0000gn/T//RtmpMk69B0/downloaded_packages
> install.packages("remotes")
Installing package into ‘C:/Users/WDAGUtilityAccount/AppData/Local/R/win-library/4.2
(as ‘lib’ is unspecified)
trying URL 'https://p3m.dev/cran/latest/bin/windows/contrib/4.2/remotes_2.4.2.zip'
Content type 'binary/octet-stream' length 399930 bytes (390 KB)
downloaded 390 KB

package ‘remotes’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in
  C:\Users\WDAGUtilityAccount\AppData\Local\Temp\RtmpA1edRi\downloaded_packages

Source 📦 Installation

Go for source 📦 if you need the latest version urgently.

Figure showing the Windows binary files are not the latest version for R package parallelly.

only have source 📦 in CRAN.

Source 📦 Installation

You will need additional tools and dependencies.

Install Rtools

Install XCode ⚠️ take some time to install

terminal
xcode-select --install

install tools via package manager, e.g.

terminal
apt install make


Run devtools::has_devel() in console.

## Your system is ready to build packages!

Install using 📦 pak

Consider using pak::pkg_install instead of install.packages()

Isolated project environment using renv

Most commonly used 📦 to create isolated project environments.

“virtual environment” using renv

Some advice …

  • In an existing project, use renv::init(bare = TRUE) to initiate renv with an empty library and then install 📦 manaully.
  • After installing 📦 pak in the renv environment, set RENV_CONFIG_PAK_ENABLED=TRUE in the .Renviron file for renv::install() to use 📦 pak at the backend to install 📦.
  • Indicate folders/files in the .renvignore file to ignore to speed up the snapshot process (renv::snapshot)
  • You can update the repositories specified in the renv.lock file.

“virtual environment” using renv

If you are frustrated about renv::restore() … please

watch Practical {renv} and read Practical {renv} Materials

New Kid for 📦 managment Require

Personal R Administration

https://rstats-wtf.github.io/wtf-personal-radmin-slides

Project Organisation

Organise your project as you go instead of waiting for “tomorrow”.

Data is cheap but time is expensive.

Comic from Monkeyuser.com showing that it is hard to get help with programming issues.

Figure showing Research Compendium as a tool to sort out different analysis.

Research Compendium by Scriberia from The Turing Way project and Project Layout from Good enough practices in scientific computing.

Project Organisation

My harmoniation template organisation is based on the 📦 rcompendium but there are others (orderly, prodigenr and workflowr) as well.

Examples of R packages (rcompendium, orderly and workflowr) for project organisation.

Custom R Function Management

It is better to organise your custom functions into a 📦 to make your code easier to re-use, document, and test.

If the top of your script is

setwd("C:\Users\jenny\path\that\only\I\have")
rm(list = ls())

Jenny will come into your your office and SET YOUR COMPUTER ON FIRE 🔥.

https://tidyverse.org/blog/2017/12/workflow-vs-script/

Practise “safe paths”

📦 with file system functions (fs and here).

Practise “safe paths”

Avoid typing absolute path in script. Let do it for you.

BAD <- "D://Jeremy//PortableR//RPortableWorkDirectory//"


User’s home directory

fs::path_home()
C:/Users/Jeremy

Project directory

here::here()
[1] "D:/Jeremy/PortableR/RPortableWorkDirectory/hat_2025"


here::here() does not create directories; that’s your job.

Practise “safe paths”

Avoid typing / or \ manually. Let do it for you.

file.path("data", "raw-data.csv")
[1] "data/raw-data.csv"
fs::path_home("data", "raw-data.csv")
C:/Users/Jeremy/data/raw-data.csv
here::here("data", "raw-data.csv")
[1] "D:/Jeremy/PortableR/RPortableWorkDirectory/hat_2025/data/raw-data.csv"


Use relative path within the project directory.

readxl::read_excel(path = here::here("data-folder", "data.xlsx"))
ggplot2::ggsave(filename = here::here("figs", "built-barchart.png"))

Works on my machine, works on yours!

Practise “safe paths”

Efficient File Management in R with {fs} with Jadey Ryan

Start Session in a “blank slate”

Avoid rm(list = ls())

Figure showing the options that persisted despite using rm(list = ls()). Slide from Project oriented workflows.


Use usethis::use_blank_slate()

R console
usethis::use_blank_slate()

Tools -> Global Options in RStudio.

Starting R Session in a blank slate.

Project Oriented Workflow

https://rstats-wtf.github.io/wtf-project-oriented-workflow-slides

Quarto

Quarto is an open-source software that weaves narrative and programming code together to produce elegantly formatted output as documents (in HTML, Word, PDF), presentations, books, web pages, and more.

Picture by Allison Horst about Quarto turning programming code to documents. Artwork from “Hello, Quarto” keynote by Julia Lowndes and Mine Çetinkaya-Rundel, presented at RStudio Conference 2022. Illustrated by Allison Horst.

Open A Quarto Document

Create a Quarto document

Steps to create a Quarto document

Render to HTML Report

Steps to render a Quarto document to a report in html

Quarto Level 1

A Quarto file is a plain text file that has the extension .qmd containing three important types of content:

Quarto file has three contents: An optional YAML header surrounded by `---`s, Chunks of R code surrounded by ```s and Text mixed with markdown text formatting. Simple Quarto file example from R for Data Science (2e) Chapter 28 Quarto

Quarto Level 1

https://kpuka.ca/resources/quarto_intro.html

Quarto Level 2

terminal
quarto pandoc -o custom-reference-doc.docx --print-default-data-file reference.docx

How to change document fonts & formats in Quarto (Word/Docx)

Quarto Level 2

My word template for Quarto

Quarto Level 2

terminal
quarto install tinytex

Preparing RStudio to Generate PDF Files with Quarto and tinyTeX

Quarto Level 2

Christopher Kenny’s Quarto templates

Workflow with collaborators

Collaborator can send the raw data once and you keep updating the cleaned data for harmonsation.

Picture showing a top workfow to first clean data for harmonisation before doing the actual harmonisation. Bottom workflow to verfiy with the collaborators and update the clean data for harmonisation periodically

Dataset vectors by Vectora Artworks from Vecteezy.

Workflow with collaborators

Figure showing the collaborator saying that they need the updated raw data in addition to the harmonised data.

Cheerful Businessman designed by Iftikhar Alam from Vecteezy and Medical Doctor Man from Creazilla.

Workflow with collaborators

Figure showing the data analyst providing the updated raw data and harmonised data.

Cheerful Businessman designed by Iftikhar Alam from Vecteezy and Medical Doctor Man from Creazilla.

Workflow with collaborators

Collaborator can update the raw data. For example, adding new clinical data, add more patients, correct errors.

Picture showing a top workfow to first clean data for harmonisation before doing the actual harmonisation. Bottom workflow to verfiy with the collaborators and update the raw data for harmonisation periodically

Dataset vectors by Vectora Artworks from Vecteezy.

Workflow with collaborators

New version means new bugs or reopen issues to fix. Is there an automated way to catch warnings/issues when reading these updated files ?

Picture showing that new version of raw data sent by collaborator can give new issues to a supposedly fixed problem

Dataset vectors by Vectora Artworks from Vecteezy. Anticipate from MonkeyUser.com

Automated capture of warnings (csv)

Is there an automated way to catch warnings/issues when reading csv files ?

cohort_data_csv <- vroom::vroom(
  file = here::here("data-raw", "Cohort_csv", 
    "data_to_harmonise_age_issue.csv"),
  delim = ",",
  col_select = 1:2,
  show_col_types = FALSE,
  col_types = list(
    ID = vroom::col_character(), 
    Age = vroom::col_integer()
  )
)
head(cohort_data_csv, n = 3)
# A tibble: 3 × 2
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)
  ID      Age
  <chr> <int>
1 B001     32
2 B002     52
3 B003     NA

A csv file with text data in the Age column.

Automated capture of warnings (csv)

If there are issues with the data, the output of vroom::problems will be a tibble.

cohort_data_csv |> 
  vroom::problems()
# A tibble: 4 × 5
    row   col expected   actual  file                                           
  <int> <int> <chr>      <chr>   <chr>                                          
1     2     2 an integer missing D:/Jeremy/PortableR/RPortableWorkDirectory/hat…
2     4     2 an integer missing D:/Jeremy/PortableR/RPortableWorkDirectory/hat…
3    10     2 an integer missing D:/Jeremy/PortableR/RPortableWorkDirectory/hat…
4    17     2 an integer missing D:/Jeremy/PortableR/RPortableWorkDirectory/hat…

To check for this automatically, we can use pointblank::expect_row_count_match.

cohort_data_csv |> 
  vroom::problems() |> 
  pointblank::expect_row_count_match(count = 0)
Error: Row counts for the two tables did not match.
The `expect_row_count_match()` validation failed beyond the absolute threshold level (1).
* failure level (1) >= failure threshold (1)

Automated capture of warnings (csv)

Here is a case with no issues.

cohort_data_csv <- vroom::vroom(
  file = here::here("data-raw", "Cohort_csv", 
    "data_to_harmonise_age_issue_fixed.csv"),
  delim = ",",
  col_select = 1:2,
  show_col_types = FALSE,
  col_types = list(
    ID = vroom::col_character(), 
    Age = vroom::col_integer()
  )
)

cohort_data_csv |> 
  vroom::problems()
# A tibble: 0 × 5
# ℹ 5 variables: row <int>, col <int>, expected <chr>, actual <chr>, file <chr>
cohort_data_csv |> 
  vroom::problems() |> 
  pointblank::expect_row_count_match(count = 0)

A csv file with only integer data in the Age column.

Automated capture of warnings (Excel)

Is there an automated way to catch warnings/issues when reading Excel files ?

cohort_data_excel <- readxl::read_excel(
  path = here::here("data-raw", "Cohort_Excel", 
    "data_to_harmonise_age_issue.xlsx"),
  sheet = "Sheet1",
  col_types = c(
    "text", "numeric"
    )
  )
Warning: Expecting numeric in B7 / R7C2: got 'missing'
Warning: Expecting numeric in B14 / R14C2: got 'missing'

An Excel file with text data in the Age column.

Automated capture of warnings (Excel)

We can read the Excel file with testthat::expect_no_condition.

testthat::expect_no_condition(
  cohort_data_excel <- readxl::read_excel(
    path = here::here("data-raw", "Cohort_Excel", 
      "data_to_harmonise_age_issue.xlsx"),
    sheet = "Sheet1",
    col_types = c("text", "numeric")
  )
)
Error: Expected `... <- NULL` to run without any conditions.
ℹ Actually got a <simpleWarning> with text:
  Expecting numeric in B7 / R7C2: got 'missing'

An Excel file with text data in the Age column.

Automated capture of warnings (Excel)

However, this method means that you will lose the pipe workflow.

testthat::expect_no_condition(
  cohort_data_excel <- readxl::read_excel(
    path = here::here("data-raw", "Cohort_Excel", 
      "data_to_harmonise_age_issue_fixed.xlsx"),
    sheet = "Sheet1",
    col_types = c("text", "numeric")
  )
)

cohort_data_excel <- cohort_data_excel |> 
  # Check if Serial Number is unique
  pointblank::rows_distinct(
    columns = "Serial Number",
  )

An Excel file with numeric data in the Age column.

Automated capture of warnings (Excel)

We can use the tee pipe operator %T>% from 📦 magrittr.

With Issues

cohort_data_excel <- readxl::read_excel(
  path = here::here("data-raw", "Cohort_Excel", 
    "data_to_harmonise_age_issue.xlsx"),
  sheet = "Sheet1",
  col_types = c(
    "text", "numeric"
    )
  ) %T>%
  testthat::expect_no_condition()
Error: Expected `.` to run without any conditions.
ℹ Actually got a <simpleWarning> with text:
  Expecting numeric in B7 / R7C2: got 'missing'

No Issues

cohort_data_excel_2 <- readxl::read_excel(
  path = here::here("data-raw", "Cohort_Excel", 
    "data_to_harmonise_age_issue_fixed.xlsx"),
  sheet = "Sheet1",
  col_types = c("text", "numeric")
) %T>%
testthat::expect_no_condition() |> 
# Check if Serial Number is unique
  pointblank::rows_distinct(
    columns = "Serial Number",
)

Variable Mapping

Variable Mapping Reporting Workflow. Starting with introducing the mapping procedure, writing code to do mapping ans validation, showing that the code works, clean up data for merging.

Variable Mapping

Let take this data set as an example.

cohort_csv_data <- vroom::vroom(
  file = here::here("data-raw",
                    "Cohort_csv",
                    "data_to_harmonise.csv"),
  delim = ",",
  col_select = 1:8,
  show_col_types = FALSE,
  col_types = list(
    ID = vroom::col_character(),
    Age = vroom::col_integer(),
    Sex = vroom::col_character(),
    Height = vroom::col_double(),
    Weight = vroom::col_double(),
    `Smoke History` = vroom::col_character(),
    `Chest Pain Character` = vroom::col_character(),
    Dyspnea = vroom::col_character()
    )
  ) |>  
  dplyr::rename(cohort_unique_id = "ID") |>
  # Remove rows when the ID value is NA
  dplyr::filter(!is.na(.data[["cohort_unique_id"]])) |>
  # Remove white spaces in column names
  dplyr::rename_all(stringr::str_trim) |> 
  # Check if cohort id is unique
  pointblank::rows_distinct(
    columns = "cohort_unique_id",
  )

cohort_csv_data |> 
  vroom::problems() |> 
  pointblank::expect_row_count_match(count = 0)

Variable Mapping

Let the reader know how the collaborator’s data Smoke History is going to be mapped.

Documentation indicating how the collaborator's data Smoke History is going to be mapped or harmonised in Quarto code and output.

Variable Mapping

Code to do mapping.

smoking_data <- cohort_csv_data |>
  dplyr::select(c("cohort_unique_id", 
                  "Smoke History")) |>
  dplyr::mutate(
    smoke_current = dplyr::case_when(
      is.na(.data[["Smoke History"]]) ~ "-1",
      .data[["Smoke History"]] == "non-smoker" ~ "0",
      .data[["Smoke History"]] == "past smoker" ~ "0",
      .data[["Smoke History"]] == "current smoker" ~ "1",
      .default = NA_character_
    ),
    smoke_current = forcats::fct_relevel(
      .data[["smoke_current"]],
      c("0", "1")), 
    smoke_past = dplyr::case_when(
      is.na(.data[["Smoke History"]]) ~ "-1",
      .data[["Smoke History"]] == "non-smoker" ~ "0",
      .data[["Smoke History"]] == "past smoker" ~ "1",
      .data[["Smoke History"]] == "current smoker" ~ "0",
      .default = NA_character_
    ),
    smoke_past = forcats::fct_relevel(
      .data[["smoke_past"]],
      c("0", "1")),
    `Smoke History` = forcats::fct(
      .data[["Smoke History"]]
    )
  )

Documentation of harmonisation of smoking history.

Variable Mapping

Code to do validation.

smoking_data <- smoking_data |>
  pointblank::col_vals_in_set(
    columns = c("smoke_current", "smoke_past"),
    set = c("0", "1", "-1")
  ) |> 
  pointblank::col_vals_expr(
    expr = pointblank::expr(
      (.data[["smoke_current"]] == "1" & .data[["smoke_past"]] == "0") |
      (.data[["smoke_current"]] == "-1" & .data[["smoke_past"]] == -"1") |
      (.data[["smoke_current"]] == "0" & .data[["smoke_past"]] %in% c("0", "1"))
    )
  )

Documentation of harmonisation of smoking history.

Reference: https://github.com/rstudio/pointblank/issues/578

Variable Mapping

Make use of Quarto’s parameters, conditional content and !expr knitr engine syntax to choose what code/items to run/display on your html, pdf or word report.

Show examples that the code works.

your_quarto_script.qmd
---
params:
  show_table: TRUE
---

```{r}
#| label: output type
#| echo: false
#| warning: false
#| message: false

out_type <- knitr::opts_chunk$get("rmarkdown.pandoc.to")
```

Code to print the reactable in html file.

Variable Mapping

Parameterized Quarto Reports Improve Understanding of Soil Health by Jadey Ryan.

Variable Mapping

Show examples that the code works. Code to print the reactable in html file.

Html Output

Variable Mapping

Show examples that the code works. Code to print the table in pdf file.

Pdf Output

Output code in pdf file.

Output table in pdf file.

Variable Mapping

Clean for merging.

smoking_data <- smoking_data |>
  dplyr::select(-c("Smoke History"))

Merging Harmonised Data

Suppose we have completed harmonising a batch of clinical data.

age_gender_data |> 
  reactable_with_download_csv_button(
    defaultPageSize = 5,
    paginationType = "jump",
    style = list(fontSize = "1rem"),
  )
body_measurement_data |> 
  reactable_with_download_csv_button(
    defaultPageSize = 5,
    paginationType = "jump",
    style = list(fontSize = "1rem"),
  )

How can we merge them without issues of missing rows or additional columns ?

Merging Harmonised Data

unmatched = "error" in dplyr::inner_join helps to avoid patients with no match.

join_specification <- dplyr::join_by("cohort_unique_id")

demo_behave_data <- cohort_csv_data |>
  dplyr::select(c("cohort_unique_id")) |>
  dplyr::inner_join(age_gender_data,
                   by = join_specification,
                   unmatched = "error",
                   relationship = "one-to-one") |>
  dplyr::inner_join(body_measurement_data,
                   by = join_specification,
                   unmatched = "error",
                   relationship = "one-to-one") |>
  dplyr::inner_join(smoking_data,
                   by = join_specification,
                   unmatched = "error",
                   relationship = "one-to-one") |>
  dplyr::relocate(c("bsa_m2", "bmi"),
                  .after = "sex")
three_penguins <- tibble::tribble(
  ~samp_id, ~species,    ~island,
  1,        "Adelie",    "Torgersen",
  2,        "Gentoo",    "Biscoe",
)

weight_extra <- tibble::tribble(
  ~samp_id,  ~body_mass_g,
  1,         3220,
  2,         4730,
  4,         4725
)

three_penguins |> 
  dplyr::inner_join(
    y = weight_extra,
    by = dplyr::join_by("samp_id"),
    unmatched = "error"
 ) 
Error in `dplyr::inner_join()`:
! Each row of `y` must be matched by `x`.
ℹ Row 3 of `y` was not matched.

Reference: https://www.tidyverse.org/blog/2023/08/teach-tidyverse-23/#improved-and-expanded-_join-functionality

Merging Harmonised Data

unmatched = "error" in dplyr::inner_join helps to avoid patients with no match.

join_specification <- dplyr::join_by("cohort_unique_id")

demo_behave_data <- cohort_csv_data |>
  dplyr::select(c("cohort_unique_id")) |>
  dplyr::inner_join(age_gender_data,
                   by = join_specification,
                   unmatched = "error",
                   relationship = "one-to-one") |>
  dplyr::inner_join(body_measurement_data,
                   by = join_specification,
                   unmatched = "error",
                   relationship = "one-to-one") |>
  dplyr::inner_join(smoking_data,
                   by = join_specification,
                   unmatched = "error",
                   relationship = "one-to-one") |>
  dplyr::relocate(c("bsa_m2", "bmi"),
                  .after = "sex")
three_penguins <- tibble::tribble(
  ~samp_id, ~species,    ~island,
  1,        "Adelie",    "Torgersen",
  2,        "Gentoo",    "Biscoe",
  3,        "Chinstrap", "Dream"
)

weight_extra <- tibble::tribble(
  ~samp_id,  ~body_mass_g,
  1,         3220,
  3,         4725
)

three_penguins |> 
  dplyr::inner_join(
    y = weight_extra,
    by = dplyr::join_by("samp_id"),
    unmatched = "error"
 ) 
Error in `dplyr::inner_join()`:
! Each row of `x` must have a match in `y`.
ℹ Row 2 of `x` does not have a match.

Reference: https://www.tidyverse.org/blog/2023/08/teach-tidyverse-23/#improved-and-expanded-_join-functionality

Merging Harmonised Data

relationship = "one-to-one" in dplyr::inner_join helps to avoid patients with multiple match.

join_specification <- dplyr::join_by("cohort_unique_id")

demo_behave_data <- cohort_csv_data |>
  dplyr::select(c("cohort_unique_id")) |>
  dplyr::inner_join(age_gender_data,
                   by = join_specification,
                   unmatched = "error",
                   relationship = "one-to-one") |>
  dplyr::inner_join(body_measurement_data,
                   by = join_specification,
                   unmatched = "error",
                   relationship = "one-to-one") |>
  dplyr::inner_join(smoking_data,
                   by = join_specification,
                   unmatched = "error",
                   relationship = "one-to-one") |>
  dplyr::relocate(c("bsa_m2", "bmi"),
                  .after = "sex")
three_penguins <- tibble::tribble(
  ~samp_id, ~species,    ~island,
  1,        "Adelie",    "Torgersen",
  2,        "Gentoo",    "Biscoe",
  3,        "Chinstrap", "Dream"
)

weight_extra <- tibble::tribble(
  ~samp_id,  ~body_mass_g,
  1,         3220,
  2,         4730,
  2,         4725,
  3,         4000
)

three_penguins |> 
  dplyr::inner_join(
    y = weight_extra,
    by = dplyr::join_by("samp_id"),
    relationship = "one-to-one"
 ) 
Error in `dplyr::inner_join()`:
! Each row in `x` must match at most 1 row in `y`.
ℹ Row 2 of `x` matches multiple rows in `y`.

Reference: https://www.tidyverse.org/blog/2023/08/teach-tidyverse-23/#improved-and-expanded-_join-functionality

Merging Harmonised Data

Use pointblank::has_columns to ensure we only have harmonised variables.

testthat::expect_false(
  pointblank::has_columns(
    demo_behave_data,
    columns = c(
      dplyr::ends_with(".x"), 
      dplyr::ends_with(".y")
    )
  )
)

testthat::expect_equal(
  ncol(demo_behave_data), 9
)

testthat::expect_true(
  pointblank::has_columns(
    demo_behave_data,
    columns = c(
      "age_years", "sex",
      "height_cm", "weight_kg", "bsa_m2", "bmi",
      "smoke_current", "smoke_past"
    )
  )
) 
three_penguins <- tibble::tribble(
  ~samp_id, ~species,    ~island,
  1,        "Adelie",    "Torgersen",
  2,        "Gentoo",    "Biscoe",
  3,        "Chinstrap", "Dream"
)

weight_extra <- tibble::tribble(
  ~samp_id,  ~island,
  1,         "Torgersen",
  2,         "Biscoe",
  3,         "Dream"
)

three_penguins <- three_penguins |> 
  dplyr::inner_join(
    y = weight_extra,
    by = dplyr::join_by("samp_id"),
    unmatched = "error",
    relationship = "one-to-one"
 )

three_penguins |> 
  pointblank::has_columns(
    columns = c(
      dplyr::ends_with(".x"), 
      dplyr::ends_with(".y")
      )
  )
[1] TRUE
colnames(three_penguins)
[1] "samp_id"  "species"  "island.x" "island.y"

Comparing Datasets

Use 📦 daff to compare different version of harmonised datasets.

data1 <- data.frame(
  Name=c("P1","P2","P3","P4","P5"),
  col1=c(1,2,3,4,5),
  col2=c(11,13,14,15,17)
)

data2 <- data.frame(
  col1=c(1,3,3,6,9), 
  Name=c("P1","P2","P6","P4","P5")
)

compare_results <- daff::diff_data(data1, data2)
compare_results
Daff Comparison: 'data1' vs. 'data2' 
        B:A  A:B  C:- 
    !        :    --- 
    @@  col1 Name col2
1:1     1    P1   11  
2:2 ->  2->3 P2   13  
-:3 +++ 3    P6   <NA>
3:- --- 3    P3   14  
4:4 ->  4->6 P4   15  
5:5 ->  5->9 P5   17  
daff::render_diff(compare_results)

Interactive results from daff.

Comparing Datasets

Use summary() to return a summary list.

data1 <- data.frame(
  Name=c("P1","P2","P3","P4","P5"),
  col1=c(1,2,3,4,5),
  col2=c(11,13,14,15,17)
)

data2 <- data.frame(
  col1=c(1,3,3,6,9), 
  Name=c("P1","P2","P6","P4","P5")
)

compare_different_summary <- daff::diff_data(
  data1, data2) |> 
  summary()

compare_different_summary

Data diff: 'data1' vs. 'data2' 
        #       Modified Reordered Deleted Added
Rows    5       3        0         1       1    
Columns 3 --> 2 1        1         1       0    
data1 <- data.frame(
  Name=c("P1","P2","P3","P4","P5"),
  col1=c(1,2,3,4,5),
  col2=c(11,13,14,15,17)
)

data2 <- data.frame(
  Name=c("P1","P2","P3","P4","P5"),
  col1=c(1,2,3,4,5),
  col2=c(11,13,14,15,17)
)

compare_same_summary <- daff::diff_data(data1, data2) |> 
  summary()

compare_same_summary

Data diff: 'data1' vs. 'data2' 
        # Modified Reordered Deleted Added
Rows    5        0         0       0     0
Columns 3        0         0       0     0

Comparing Datasets

Use the summary list and pointblank::expect_col_vals_in_set to do the validation automatically.

tibble::tibble(
  row_deletes = compare_different_summary$row_deletes,
  row_inserts = compare_different_summary$row_inserts,
  row_updates = compare_different_summary$row_updates,
  row_reorders = compare_different_summary$row_reorders,
  col_deletes = compare_different_summary$col_deletes,
  col_inserts = compare_different_summary$col_inserts,
  col_updates = compare_different_summary$col_updates,
  col_reorders = compare_different_summary$col_reorders,
) |>
  pointblank::expect_col_vals_in_set(
    columns = c(
      "row_deletes", "row_inserts", 
      "row_updates", "row_reorders",
      "col_deletes", "col_inserts", 
      "col_updates", "col_reorders"
    ),
    set = c(0)
  )
Error: Exceedance of failed test units where values in `row_deletes` should have been in the set of `0`.
The `expect_col_vals_in_set()` validation failed beyond the absolute threshold level (1).
* failure level (1) >= failure threshold (1)
tibble::tibble(
  row_deletes = compare_same_summary$row_deletes,
  row_inserts = compare_same_summary$row_inserts,
  row_updates = compare_same_summary$row_updates,
  row_reorders = compare_same_summary$row_reorders,
  col_deletes = compare_same_summary$col_deletes,
  col_inserts = compare_same_summary$col_inserts,
  col_updates = compare_same_summary$col_updates,
  col_reorders = compare_same_summary$col_reorders,
) |>
  pointblank::expect_col_vals_in_set(
    columns = c(
      "row_deletes", "row_inserts", 
      "row_updates", "row_reorders",
      "col_deletes", "col_inserts", 
      "col_updates", "col_reorders"
    ),
    set = c(0)
  )

Comparing Datasets

Find the difference between two datasets in R

Technical Report Challenge

One variable mapping report takes at least one page.

On average, a clinical trial will have a few hundred variables.

  • One hundred columns for clinical and demographics.
  • Two hundred columns for medication.

Harmonisation report can have at least a few hundreds pages for each cohort.

There is a need to automate the creation of these reports.

Businessman in pile of documents asking for help.

Businessman in pile of documents asking for help by Amonrat Rungreangfangsai

Quarto Books/Websites

Quarto Level 3

To make a Quarto book or website, we need a _quarto.yml and index.qmd file

A preview of the _quarto.yml file.

A preview of the index.qmd file.

Quarto Level 3

_quarto.yml is a configuration file to tell Quarto to create a book.

_quarto.yml
---
project:
  type: book
  output-dir: reports/Cohort_B

book:
  downloads: [pdf, docx]
  title: "Harmonisation Template for Cohort B"
  author: "My Name"
  navbar:
    search: true
  sidebar:
    collapse-level: 1

  chapters:
    - index.qmd
    - part: Cohort B Cleaning
      chapters:
        - codes/Cohort_B/00_R_Package_And_Environment.qmd
        - codes/Cohort_B/01_Read_Cohort_B_Data.qmd
        - codes/Cohort_B/02_Extract_Demographic.qmd
        - codes/Cohort_B/03_Export_To_Excel.qmd

crossref:
  chapters: false
  fig-title: Figure     # (default is "Figure")
  tbl-title: Table      # (default is "Table")
  fig-prefix: Figure    # (default is "Figure")
  tbl-prefix: Table     # (default is "Table")
  ref-hyperlink: true   # (default is true)
  title-delim: ":"      # (default is ":")

bibliography: references.bib
csl: csl_file.csl

format:
  html:
    toc: true
    toc-depth: 5
    toc-location: right
    toc-expand: true
    number-sections: true
    number-depth: 5
    smooth-scroll: true
    theme:
      light:
        - flatly
        #- custom.scss
      dark:
        - solar
  docx:
    reference-doc: custom-reference.docx
    toc: true
    toc-depth: 5
    number-sections: true
    number-depth: 5
    prefer-html: true
    highlight-style: github
  pdf:
    pdf-engine: xelatex
    documentclass: scrreprt
    papersize: a4
    toc-depth: 5
    number-sections: true
    number-depth: 5
    keep-tex: false
    include-in-header:
      text: |
        \usepackage{fvextra}
        \DefineVerbatimEnvironment{Highlighting}{Verbatim}{breaklines,commandchars=\\\{\}}
        \DefineVerbatimEnvironment{OutputCode}{Verbatim}{breaklines,commandchars=\\\{\}}
    include-before-body:
      text: |
        \begin{flushleft}
        \begin{sloppypar}
        \RecustomVerbatimEnvironment{verbatim}{Verbatim}{
          showspaces = false,
          showtabs = false,
          breaksymbolleft={}, % Need pacakge fvextra
          breaklines
          % Note: setting commandchars=\\\{\} here will cause an error
        }
    include-after-body:
      text: |
        \end{sloppypar}
        \end{flushleft}
---

Quarto Level 3

index.qmd file gives the preface (homepage) content of the Quarto book (website). It is compulsory file needed for the rendering to work.

index.qmd
---
date: "2025-03-10"
format:
  html:
    code-fold: true
    freeze: false
params:
  show_table: TRUE
---

```{r}
#| label: output type
#| echo: false
#| warning: false
#| message: false

out_type <- knitr::opts_chunk$get("rmarkdown.pandoc.to")
```

# Preface {.unnumbered .unlisted}

Here is the documentation of the data harmonisation step generated using [Quarto](https://quarto.org/). To learn more about Quarto books visit <https://quarto.org/docs/books>.

## File Structure

Here is the file structure of the project used to generate the document.

```
harmonisation/                            # Root of the project template.
|
├── .quarto/ (not in repository)          # Folder to keep intermediate files/folders 
|                                         # generated when Quarto renders the files.
|
├── archive/                              # Folder to keep previous books and harmonised data.
|   |
│   ├── reports/                          # Folder to keep previous versions of
|   |   |                                 # data harmonisation documentation.
|   |   |
|   |   ├── {some_date}_batch/            # Folder to keep {some_date} version of
|   |   |                                 # data harmonisation documentation.
|   |   |
|   |   └── Flowchart.xlsx                # Flowchart sheet to record version control.
|   |
|   └── harmonised/                       # Folder to keep previous version of harmonised data.
|       |
|       ├── {some_date}_batch/            # Folder to keep {some_date} version of
|       |                                 # harmonised data.
|       |
|       └── Flowchart.xlsx                # Flowchart sheet to record version control.
|
├── codes/                                # Folder to keep R/Quarto scripts 
|   |                                     # to run data harmonisation.
|   |
│   ├── {cohort name}/                    # Folder to keep Quarto scripts to run
|   |   |                                 # data cleaning, harmonisation 
|   |   |                                 # and output them for each cohort.
|   |   |
|   |   └── preprocessed_data/            # Folder to keep preprocessed data.
|   |
│   ├── harmonisation_summary/            # Folder to keep Quarto scripts to create
|   |                                     # data harmonisation summary report.
|   |
│   ├── output/                           # Folder to keep harmonised data.
|   |                                     
|   ├── cohort_harmonisation_script.R     # R script to render each {cohort name}/ folder. 
|   |                                     # folder into html, pdf and word document.
|   |
|   └── harmonisation_summary_script.R    # R script to render the {harmonisation_summary}/ 
|                                         # folder into word document.

├── data-raw/                             # Folder to keep cohort raw data (.csv, .xlsx, etc.)
|   |
│   ├── {cohort name}/                    # Folder to keep cohort raw data.
|   |   |
|   |   ├── {data_dictionary}             # Data dictionary file that correspond to the 
|   |   |                                 # cohort raw data. Can be one from the
|   |   |                                 # collaborator provide or provided by us.
|   |   |
|   |   └── Flowchart.xlsx                # Flowchart sheet to record version control.
|   |
|   ├── data-dictionary/                  # Folder to keep data dictionary 
|   |   |                                 # used for harmonising data.
|   |   |
|   |   └── Flowchart.xlsx                # Flowchart sheet to record version control.
|   |
|   └── data-input/                       # Folder to keep data input file 
|       |                                 # for collaborators to fill in.
|       |
|       └── Flowchart.xlsx                # Flowchart sheet to record version control.
|  
├── docs/                                 # Folder to keep R functions documentation 
|                                         # generated using pkgdown:::build_site_external().
|  
├── inst/                                 # Folder to keep arbitrary additional files 
|   |                                     # to include in the project.
|   |  
|   └── WORDLIST                          # File generated by spelling::update_wordlist()
|  
├── man/                                  # Folder to keep R functions documentation
|   |                                     # generated using devtools::document().
|   |
│   ├── {fun-demo}.Rd                     # Documentation of the demo R function.
|   |
│   └── harmonisation-template.Rd         # High-level documentation.
|  
├── R/                                    # Folder to keep R functions.
|   |
│   ├── {fun-demo}.R                      # Script with R functions.
|   |
│   └── harmonisation-package.R           # Dummy R file for high-level documentation.

├── renv/ (not in repository)             # Folder to keep all packages 
|                                         # installed in the renv environment.
| 
├── reports/                              # Folder to keep the most recent data harmonisation
|                                         # documentation.
|
├── templates/                            # Folder to keep template files needed to generate
|   |                                     # data harmonisation documentation efficiently.
|   |
|   ├── quarto-yaml/                      # Folder to keep template files to generate 
|   |   |                                 # data harmonisation documentation structure 
|   |   |                                 # in Quarto. 
|   |   |
│   |   ├── _quarto_{cohort name}.yml     # Quarto book template data harmonisation documentation 
|   |   |                                 # for {cohort name}.
|   |   |
|   |   └── _quarto_summary.yml           # Quarto book template data harmonisation summary.
|   |
|   └── index-qmd/                        # Folder to keep template files to generate
|       |                                 # the preface page of the data harmonisation 
|       |                                 # documentation.
|       |
|       ├── _index_report.qmd             # Preface template for each cohort data harmonisation
|       |                                 # report. 
|       |
|       └── _index_summary.qmd            # Preface template for data harmonisation 
|                                         # summary report. 
|        
├── tests/                                # Folder to keep test unit files. 
|                                         # Files will be used by R package testhat.
|
├── .Rbuildignore                         # List of files/folders to be ignored while 
│                                         # checking/installing the package.
|
├── .Renviron (not in repository)         # File to set environment variables.
|
├── .Rprofile (not in repository)         # R code to be run when R starts up.
|                                         # It is run after the .Renviron file is sourced.
|
├── .Rhistory (not in repository)         # File containing R command history.
|
├── .gitignore                            # List of files/folders to be ignored while 
│                                         # using the git workflow.
|
├── .lintr                                # Configuration for linting
|                                         # R projects and packages using linter.
|        
├── .renvignore                           # List of files/folders to be ignored when 
│                                         # renv is doing its snapshot.
|
├── DESCRIPTION[*]                        # Overall metadata of the project.
|
├── LICENSE                               # Content of the MIT license generated via
|                                         # usethis::use_mit_license().
|
├── LICENSE.md                            # Content of the MIT license generated via
|                                         # usethis::use_mit_license().
|
├── NAMESPACE                             # List of functions users can use or imported
|                                         # from other R packages. It is generated 
|                                         # by devtools::document().

├── README.md                             # GitHub README markdown file generated by Quarto.
|
├── README.qmd                            # GitHub README quarto file used to generate README.md. 
|        
├── _pkgdown.yml                          # Configuration for R package documentation
|                                         # using pkgdown:::build_site_external().
|        
├── _quarto.yml                           # Configuration for Quarto book generation.
|                                         # It is also the project configuration file.
|
├── csl_file.csl                          # Citation Style Language (CSL) file to ensure
|                                         # citations follows the Lancet journal.
|        
├── custom-reference.docx                 # Microsoft word template for data harmonisation 
|                                         # documentation to Word.
|
├── harmonisation_template.Rproj          # RStudio project file.
|        
├── index.qmd                             # Preface page of Quarto book content.
|        
├── references.bib                        # Bibtex file for Quarto book.
|      
└── renv.lock                             # Metadata of R packages installed generated
                                          # using renv::snapshot().

[*] These files are automatically created but user needs to manually add some information.
```

Quarto (Reference)

https://quarto.org/

Harmonisation Report Types

Collaborator wants different ways to report how data harmonisation is done.

Four major documentation type.

The documentation system by Divio

Harmonisation Report Types

Collaborator wants different ways to report how data harmonisation is done.

Four major documentation type in terms of data harmonisation

Automated Technical Report (Reference)

We create a _quarto.yml file and relevant Quarto files for each cohort.

A preview of the _quarto.yml and relevant Quarto files for Cohort A.

Automated Technical Report (Reference)

We create an index.qmd file for each kind of report.

A preview of the index qmd file for technical reporting.

Automated Technical Report (Reference)

Create a script to generate technical reports in pdf, word and html for each cohort.

# Copy the right index.qmd 
# file

index_qmd_file <- paste0(
  "_index_",
  "report",
  ".qmd"
)

fs::file_copy(
  path = here::here(
    "templates",
    "index-qmd",
    index_qmd_file),
  new_path = here::here(
    "index.qmd"
  ),
  overwrite = TRUE
)

index_report.qmd file copied and changed to index.qmd

Automated Technical Report (Reference)

Create a script to generate technical reports in pdf, word and html for each cohort.

copy_and_render <- function(
    cohort
) {

  # Copy quarto.yml file
  # for each cohort
  
  quarto_yml_file <- paste0(
    "_quarto_",
    cohort,
    ".yml"
  )

  fs::file_copy(
    path = here::here(
      "templates",
      "quarto-yaml",
      quarto_yml_file),
    new_path = here::here("_quarto.yml"),
    overwrite = TRUE
  )
  
  # Render each cohort
  quarto::quarto_render(
    as_job = FALSE
  )
}

cohort_name <- c("Cohort_A", 
                 "Cohort_B")

purrr::walk(
  .x = cohort_name,
  .f = ~copy_and_render(
    cohort = .x
  )
)

_quarto_Cohort_A.yml file copied and changed to _quarto.qmd

Automated Technical Report (Reference)

Output of technical report.

Automated Summary Report (How-to-Guide)

A similar method is done to create a summary report in word using 📦 flextable.

Preview of summary report.

Overview Diagrams

How many variables can each cohort provide ?

How many variables can be harmonised ?

demographic_list <- list(
  A = c("Age", "Sex",
        "Hypertension", "Dyslipidemia", "Family Hx CAD", "Diabetes",
        "Smoke Current", "Smoke Past",
        "Have Chest Pain", "Chest Pain Character",
        "Dyspnea",
        "BMI", "Height", "Weight"),
  B = c("Age", "Sex",
        "Hypertension", "Dyslipidemia", "Family Hx CAD", "Diabetes",
        "Smoke Current", "Smoke Past",
        "Have Chest Pain", "Chest Pain Character",
        "Dyspnea",
        "HDL", "Total Cholesterol", 
        "Triglyceride", "LDL"),
  C = c("Age", "Sex",
        "Hypertension", "Dyslipidemia", "Family Hx CAD", "Diabetes",
        "Smoke Current", "Smoke Past",
        "Have Chest Pain", "Chest Pain Character",
        "Dyspnea",
        "BMI", "Height", "Weight",
        "HDL", "Total Cholesterol", 
        "Triglyceride", "LDL")
)  

cohort_a_label_name <- "Cohort A\n(n=1000)"
cohort_b_label_name <- "Cohort B\n(n=2000)"
cohort_c_label_name <- "Cohort C\n(n=1500)"

demographic_venn_data <- demographic_list |> 
  ggVennDiagram::Venn() |> 
  ggVennDiagram::process_data()

demographic_venn_regionedge_data <- demographic_venn_data |> 
  ggVennDiagram::venn_regionedge() |> 
  dplyr::mutate(
    name = stringr::str_replace_all(
      .data[["name"]],
      "/",
      " and "
    ),
    name = stringr::str_wrap(.data[["name"]], width = 10),
    name = forcats::fct_reorder(.data[["name"]],
                                nchar(.data[["id"]]))
  ) |> 
  dplyr::rename(`Cohort` = "name")

demographic_venn_label_data <- demographic_venn_data |>
  ggVennDiagram::venn_setlabel() |> 
  dplyr::mutate(
    name = dplyr::case_match(
      .data[["name"]],
      "A" ~ cohort_a_label_name,
      "B" ~ cohort_b_label_name,
      "C" ~ cohort_c_label_name
    )
  ) |> 
  dplyr::rename(`Cohort` = "name")

demographic_venn_regionlabel_data <- demographic_venn_data |> 
  ggVennDiagram::venn_regionlabel() |> 
  dplyr::mutate(
    name = stringr::str_replace_all(
      .data[["name"]],
      "/",
      " and "
    ),
    name = stringr::str_wrap(.data[["name"]], width = 10),
    name = forcats::fct_reorder(.data[["name"]],
                                nchar(.data[["id"]]))
  ) |> 
  dplyr::rename(`Cohort` = "name")

demographic_venn_edge_data <- demographic_venn_data |> 
  ggVennDiagram::venn_setedge()

demographic_venn_diagram <- ggplot2::ggplot() +
  # 1. region count layer
  ggplot2::geom_polygon(
    data = demographic_venn_regionedge_data,
    mapping = ggplot2::aes(
      x = .data[["X"]], y = .data[["Y"]],
      fill = .data[["Cohort"]],
      group = .data[["id"]])
  ) +
  # 2. set edge layer
  ggplot2::geom_path(
    data = demographic_venn_edge_data,
    mapping = ggplot2::aes(
      x = .data[["X"]], y = .data[["Y"]],
      colour = "black",
      group = .data[["id"]]
    ),    
    show.legend = FALSE
  ) +
  # 3. set label layer
  ggplot2::geom_text(
    data = demographic_venn_label_data,
    mapping = ggplot2::aes(
      x = .data[["X"]], y = .data[["Y"]],
      label = .data[["Cohort"]]),
    size = 5.5
  ) +
  # 4. region label layer
  ggplot2::geom_label(
    data = demographic_venn_regionlabel_data,
    mapping = ggplot2::aes(
      x = .data[["X"]], y = .data[["Y"]],
      label = .data[["count"]]),
    size = 7,
    fill = "white"
  ) +
  ggplot2::scale_x_continuous(
    expand = ggplot2::expansion(mult = 0.2)
  ) +
  ggplot2::theme_void() +
  ggplot2::theme(
    text = ggplot2::element_text(size = 20)
  )

An venn diagram showing common variables within six cohorts

Venn diagram does not work for many (> 10) cohorts.

Overview Diagrams

Upset plots are too complicated for clinicians.

demographic_venn <- tibble::tibble(
  column_name = c("Age", "Sex",
                  "Hypertension", "Dyslipidemia", "Family Hx CAD", "Diabetes",
                  "Smoke Current", "Smoke Past",
                  "Have Chest Pain", "Chest Pain Character",
                  "Dyspnea",
                  "BMI", "Height", "Weight",
                  "HDL", "Total Cholesterol", 
                  "Triglyceride", "LDL"),
  `Cohort A` = c(1, 1,
                 1, 1, 1, 1,
                 1, 1,
                 1, 1,
                 1,
                 1, 1, 1,
                 0, 0, 
                 0, 0),
  `Cohort B` = c(1, 1,
                 1, 1, 1, 1,
                 1, 1,
                 1, 1,
                 1,
                 0, 0, 0,
                 1, 1, 
                 1, 1),
  `Cohort C` = c(1, 1,
                 1, 1, 1, 1,
                 1, 1,
                 1, 1,
                 1,
                 1, 1, 1,
                 1, 1, 
                 1, 1),
  `Cohort D` = c(1, 1,
                 1, 1, 1, 1,
                 1, 1,
                 1, 1,
                 0,
                 1, 0, 0,
                 1, 1, 
                 1, 1),
  `Cohort E` = c(1, 1,
                 1, 1, 1, 1,
                 1, 1,
                 1, 1,
                 0,
                 1, 1, 1,
                 1, 1, 
                 0, 0),
  `Cohort F` = c(1, 1,
                 1, 1, 1, 1,
                 1, 1,
                 1, 1,
                 0,
                 1, 1, 1,
                 0, 0, 
                 0, 0),
)

cohort_a_upset_col_name <- "Cohort A\n(n=1000)"
cohort_b_upset_col_name <- "Cohort B\n(n=2000)"
cohort_c_upset_col_name <- "Cohort C\n(n=1500)"
cohort_d_upset_col_name <- "Cohort D\n(n=500)"
cohort_e_upset_col_name <- "Cohort E\n(n=1000)"
cohort_f_upset_col_name <- "Cohort F\n(n=2500)"


demographic_upset_data <- demographic_venn |>
  dplyr::rename(
    !!cohort_a_upset_col_name := "Cohort A",
    !!cohort_b_upset_col_name := "Cohort B",
    !!cohort_c_upset_col_name := "Cohort C",
    !!cohort_d_upset_col_name := "Cohort D",
    !!cohort_e_upset_col_name := "Cohort E",
    !!cohort_f_upset_col_name := "Cohort F"
  )

upset_plot <- ComplexUpset::upset(
  demographic_upset_data,
  c(cohort_a_upset_col_name, 
    cohort_b_upset_col_name, 
    cohort_c_upset_col_name,
    cohort_d_upset_col_name,
    cohort_e_upset_col_name,
    cohort_f_upset_col_name),
  base_annotations = list(
    `Intersection size` = 
      ComplexUpset::intersection_size(
        text = list(size = 9, nudge_y = 0.5),
        text_colors = c(on_background='black', 
                        on_bar='white')
      ) + 
      ggplot2::annotate(
        size = 5.5,
        geom = 'text', 
        x = Inf, 
        y = Inf,
        label = paste('Total:', nrow(demographic_venn)),
        vjust = 1, 
        hjust = 1
      ) + 
      ggplot2::theme(
        axis.title.y = ggplot2::element_text(angle = 0, size = 20),
        axis.title.x = ggplot2::element_text(angle = 0, size = 20), 
        text = ggplot2::element_text(size = 20)
      ) +
      ggplot2::labs(y = "",
                    title = "Intersection size")
      ),
  set_sizes = ( 
    ComplexUpset::upset_set_size() +
    ggplot2::geom_text(
      size = 5.5,
      mapping = ggplot2::aes(label= ggplot2::after_stat(.data[["count"]])), 
      hjust = 1.5, 
      stat = 'count') +
    ggplot2::expand_limits(y = 30) + 
    ggplot2::theme(
        axis.title.y = ggplot2::element_text(angle = 0, size = 15),
        axis.title.x = ggplot2::element_text(angle = 0, size = 15), 
        text = ggplot2::element_text(size = 15)
    ) +
    ggplot2::labs(y = "Number of variables")
  ),
  sort_intersections_by = "degree",
  sort_sets = FALSE,
  name = "", 
  width_ratio = 0.25,
  themes = ComplexUpset::upset_default_themes(
    text = ggplot2::element_text(size = 15)
    )
  )

An upset plot showing common variables within six cohorts

Cannot answer follow-up questions:

How many cohorts provide patient’s blood lipid information and how many patients have them ?

Overview Diagrams

Create a “heatmap”using Microsoft PowerPoint.

A modified heatmap to show command variables.

Summary

The data harmonization process: Study data variables collected from different sources need to be mapped to one another, classified into the generalized concepts they represent, and transformed into unified harmonized variables for analysis

R Administration using pipes, namespacing, ,Renviron and .Rprofile file, R pacakges renv and pak and Posit Public Pacakge Manager.

R project organisation using a research compendium, R packages here, fs, rcompendium and always starting the project in a blank slate.

Picture by Allison Horst about Quarto turning programming code to documents.

Summary

Variable Mapping Reporting Workflow. Starting with introducing the mapping procedure, writing code to do mapping ans validation, showing that the code works, clean up data for merging.

An interactive output showing the difference between two datasets from the R package daff.

Four major documentation type in terms of data harmonisation

A modified heatmap to show command variables.

Thank you