14th November 2025
Research Officer from National Heart Centre Singapore who collects, cleans and harmonises clinical data.
Taming the Data Beast from “Cleaning Medical Data with R” workshop by Shannon Pileggi, Crystal Lewis and Peter Higgins presented at R/Medicine 2023. Illustrated by Allison Horst.
Data harmonisation is part of data wrangling process where
Image from Mallya et al. Circ Cardiovasc Qual Outcomes. 2023 Nov; 16(11):e009938 doi: 10.1161/CIRCOUTCOMES.123.009938.
Cheerful Businessman designed by Iftikhar Alam from Vecteezy and Medical Doctor Man from Creazilla.
Cheerful Businessman designed by Iftikhar Alam from Vecteezy and Medical Doctor Man from Creazilla.
Cheerful Businessman designed by Iftikhar Alam from Vecteezy and Medical Doctor Man from Creazilla.
snapshot from Ready for QA | MonkeyUser 2SP Animation Video from MonkeyUser.com.
snapshot from Ready for QA | MonkeyUser 2SP Animation Video from MonkeyUser.com.
Turn my sorrow into opportunities.
Cheerful Businessman designed by Iftikhar Alam from Vecteezy and Medical Doctor Man from Creazilla.
Some data fields just cannot be planned in advanced.
While there are 📦 to facilitate data harmonisation,
retroharmonize for survey data.
Rmonize for epidemiological data.
psHarmonize for health and education data.
There are limited resources on how to make a data harmonisation report.
A template to offer a systematic way to report data harmonisation processes.
R Programming Logo from CleanPNG and Quarto Hex Sticker from Posit.
What Did We Forget
to Teach You about ?
Image from Project Oriented Workflows slides from What They Forgot to Teach You About R.
We will share a glimpse of “Everything else” R and its friends can do.
\(\geq\) 4.1.0 has a “pipe” symbol |> to make code easier to read.
Without |>
With |>
Inspired from the Bash Pipe |
terminal
More details between the two pipes in Understanding the native R pipe |>.
tells R explicitly to use the function select from the package dplyr
can help to avoid name conflicts (e.g., MASS::select())
does not require library(dplyr)
Before starts up in a given project, it will perform the following steps
We can customise Step 1 and 2 using these two main text files.
.Renviron (Contains environment variables to be set in sessions.).Rprofile (Contains code to be run in each session.).Renviron✅ -specific environment variables.
✅ API keys or other secrets
❌ R code
APPDATA="D:/Jeremy/PortableR/RAppData/Roaming"
LOCALAPPDATA="D:/Jeremy/PortableR/RAppData/Local"
TEMP="D:/Jeremy/PortableR/RPortableWorkDirectory/temp"
TMP="D:/Jeremy/PortableR/RPortableWorkDirectory/temp"
_R_CHECK_SYSTEM_CLOCK_=0
RENV_CONFIG_PAK_ENABLED=TRUE
CONNECT_API_KEY=DaYK2hBUriSBYUEGIAiyXsRJHSjTYJN3
DB_USER=elephant
DB_PASS=p0stgr3s~/.Renviron
path/to/your/project/.Renviron
.Rprofile✅ set a default CRAN mirror.
✅ customize R prompt.
source("renv/activate.R")
options(
repos = c(
P3M_20250306 = "https://packagemanager.posit.co/cran/2025-10-13",
ropensci = "https://ropensci.r-universe.dev",
janmarvin = "https://janmarvin.r-universe.dev",
CRAN = 'https://cloud.r-project.org'
)
)
if (interactive()) prompt::set_prompt(prompt::prompt_fancy)Prompt Resources:
📦 – prompt.
xinxxxin/rprofile-custom-prompt.R.
Me, Myself and my Rprofile.
Prompt-moting a custom R prompt.
📦 – Structure
Go for ( | ) binary 📦 in CRAN.
Consider installing ( | | ) binaries from
Set in your .Rprofile file.
> install.packages("parallelly", repos = "https://cran.r-project.org")
Installing package into ‘C:/Users/edavi/Documents/R/win-library/4.1’
(as ‘lib’ is unspecified)
trying URL 'https://cran.r-project.org/bin/windows/contrib/4.1/parallelly_1.32.1.zip'
Content type 'application/zip' length 306137 bytes (298 KB)
downloaded 298 KB
package ‘parallelly’ successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\edavi\AppData\Local\Temp\Rtmpa2s3e8\downloaded_packages> install.packages("renv", repos="https://cran.r-project.org")
Installing package into ‘/Users/edavidaja/Library/R/x86_64/4.1/library’
(as ‘lib’ is unspecified)
trying URL 'https://cran.r-project.org/bin/macosx/contrib/4.1/renv_0.15.5.tgz'
Content type 'application/x-gzip' length 1866760 bytes (1.8 MB)
==================================================
downloaded 1.8 MB
The downloaded binary packages are in
/var/folders/b5/fl4ff68d23s148tg1_1gnflc0000gn/T//RtmpMk69B0/downloaded_packages> install.packages("remotes")
Installing package into ‘C:/Users/WDAGUtilityAccount/AppData/Local/R/win-library/4.2’
(as ‘lib’ is unspecified)
trying URL 'https://p3m.dev/cran/latest/bin/windows/contrib/4.2/remotes_2.4.2.zip'
Content type 'binary/octet-stream' length 399930 bytes (390 KB)
downloaded 390 KB
package ‘remotes’ successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\WDAGUtilityAccount\AppData\Local\Temp\RtmpA1edRi\downloaded_packagesGo for source 📦 if you need the latest version urgently.
only have source 📦 in CRAN.
You will need additional tools and dependencies.
Run devtools::has_devel() in console.
## Your system is ready to build packages!
pakConsider using pak::pkg_install instead of install.packages()
renvMost commonly used 📦 to create isolated project environments.
renvSome advice …
renv::init(bare = TRUE) to initiate renv with an empty library and then install 📦 manaully.pak in the renv environment, set RENV_CONFIG_PAK_ENABLED=TRUE in the .Renviron file for renv::install() to use 📦 pak at the backend to install 📦..renvignore file to ignore to speed up the snapshot process (renv::snapshot)renv.lock file.
renvIf you are frustrated about renv::restore() … please
watch Practical {renv} and read Practical {renv} Materials
RequireOrganise your project as you go instead of waiting for “tomorrow”.
Data is cheap but time is expensive.
Research Compendium by Scriberia from The Turing Way project and Project Layout from Good enough practices in scientific computing.
My harmoniation template organisation is based on the 📦 rcompendium but there are others (orderly, prodigenr and workflowr) as well.
It is better to organise your custom functions into a 📦 to make your code easier to re-use, document, and test.
Jenny will come into your your office and SET YOUR COMPUTER ON FIRE 🔥.
📦 with file system functions (fs and here).
❌Avoid typing absolute path in script. Let do it for you.
User’s home directory
Project directory
here::here() does not create directories; that’s your job.
❌Avoid typing / or \ manually. Let do it for you.
[1] "data/raw-data.csv"
C:/Users/Jeremy/data/raw-data.csv
[1] "D:/Jeremy/PortableR/RPortableWorkDirectory/hat_2025/data/raw-data.csv"
✅ Use relative path within the project directory.
Works on my machine, works on yours!
https://rstats-wtf.github.io/wtf-project-oriented-workflow-slides
Quarto is an open-source software that weaves narrative and programming code together to produce elegantly formatted output as documents (in HTML, Word, PDF), presentations, books, web pages, and more.
Artwork from “Hello, Quarto” keynote by Julia Lowndes and Mine Çetinkaya-Rundel, presented at RStudio Conference 2022. Illustrated by Allison Horst.
Create a Quarto document
A Quarto file is a plain text file that has the extension .qmd containing three important types of content:
Simple Quarto file example from R for Data Science (2e) Chapter 28 Quarto
How to change document fonts & formats in Quarto (Word/Docx)
Preparing RStudio to Generate PDF Files with Quarto and tinyTeX
Collaborator can send the raw data once and you keep updating the cleaned data for harmonsation.
Dataset vectors by Vectora Artworks from Vecteezy.
Cheerful Businessman designed by Iftikhar Alam from Vecteezy and Medical Doctor Man from Creazilla.
Cheerful Businessman designed by Iftikhar Alam from Vecteezy and Medical Doctor Man from Creazilla.
Collaborator can update the raw data. For example, adding new clinical data, add more patients, correct errors.
Dataset vectors by Vectora Artworks from Vecteezy.
New version means new bugs or reopen issues to fix. Is there an automated way to catch warnings/issues when reading these updated files ?
Dataset vectors by Vectora Artworks from Vecteezy. Anticipate from MonkeyUser.com
Is there an automated way to catch warnings/issues when reading csv files ?
If there are issues with the data, the output of vroom::problems will be a tibble.
# A tibble: 4 × 5
row col expected actual file
<int> <int> <chr> <chr> <chr>
1 2 2 an integer missing D:/Jeremy/PortableR/RPortableWorkDirectory/hat…
2 4 2 an integer missing D:/Jeremy/PortableR/RPortableWorkDirectory/hat…
3 10 2 an integer missing D:/Jeremy/PortableR/RPortableWorkDirectory/hat…
4 17 2 an integer missing D:/Jeremy/PortableR/RPortableWorkDirectory/hat…
To check for this automatically, we can use pointblank::expect_row_count_match.
Error: Row counts for the two tables did not match.
The `expect_row_count_match()` validation failed beyond the absolute threshold level (1).
* failure level (1) >= failure threshold (1)
Here is a case with no issues.
# A tibble: 0 × 5
# ℹ 5 variables: row <int>, col <int>, expected <chr>, actual <chr>, file <chr>
Is there an automated way to catch warnings/issues when reading Excel files ?
We can read the Excel file with testthat::expect_no_condition.
Error: Expected `... <- NULL` to run without any conditions.
ℹ Actually got a <simpleWarning> with text:
Expecting numeric in B7 / R7C2: got 'missing'
However, this method means that you will lose the pipe workflow.
testthat::expect_no_condition(
cohort_data_excel <- readxl::read_excel(
path = here::here("data-raw", "Cohort_Excel",
"data_to_harmonise_age_issue_fixed.xlsx"),
sheet = "Sheet1",
col_types = c("text", "numeric")
)
)
cohort_data_excel <- cohort_data_excel |>
# Check if Serial Number is unique
pointblank::rows_distinct(
columns = "Serial Number",
)We can use the tee pipe operator %T>% from 📦 magrittr.
With Issues
Error: Expected `.` to run without any conditions.
ℹ Actually got a <simpleWarning> with text:
Expecting numeric in B7 / R7C2: got 'missing'
No Issues
cohort_data_excel_2 <- readxl::read_excel(
path = here::here("data-raw", "Cohort_Excel",
"data_to_harmonise_age_issue_fixed.xlsx"),
sheet = "Sheet1",
col_types = c("text", "numeric")
) %T>%
testthat::expect_no_condition() |>
# Check if Serial Number is unique
pointblank::rows_distinct(
columns = "Serial Number",
)Let take this data set as an example.
cohort_csv_data <- vroom::vroom(
file = here::here("data-raw",
"Cohort_csv",
"data_to_harmonise.csv"),
delim = ",",
col_select = 1:8,
show_col_types = FALSE,
col_types = list(
ID = vroom::col_character(),
Age = vroom::col_integer(),
Sex = vroom::col_character(),
Height = vroom::col_double(),
Weight = vroom::col_double(),
`Smoke History` = vroom::col_character(),
`Chest Pain Character` = vroom::col_character(),
Dyspnea = vroom::col_character()
)
) |>
dplyr::rename(cohort_unique_id = "ID") |>
# Remove rows when the ID value is NA
dplyr::filter(!is.na(.data[["cohort_unique_id"]])) |>
# Remove white spaces in column names
dplyr::rename_all(stringr::str_trim) |>
# Check if cohort id is unique
pointblank::rows_distinct(
columns = "cohort_unique_id",
)
cohort_csv_data |>
vroom::problems() |>
pointblank::expect_row_count_match(count = 0)Let the reader know how the collaborator’s data Smoke History is going to be mapped.
smoking_data <- cohort_csv_data |>
dplyr::select(c("cohort_unique_id",
"Smoke History")) |>
dplyr::mutate(
smoke_current = dplyr::case_when(
is.na(.data[["Smoke History"]]) ~ "-1",
.data[["Smoke History"]] == "non-smoker" ~ "0",
.data[["Smoke History"]] == "past smoker" ~ "0",
.data[["Smoke History"]] == "current smoker" ~ "1",
.default = NA_character_
),
smoke_current = forcats::fct_relevel(
.data[["smoke_current"]],
c("0", "1")),
smoke_past = dplyr::case_when(
is.na(.data[["Smoke History"]]) ~ "-1",
.data[["Smoke History"]] == "non-smoker" ~ "0",
.data[["Smoke History"]] == "past smoker" ~ "1",
.data[["Smoke History"]] == "current smoker" ~ "0",
.default = NA_character_
),
smoke_past = forcats::fct_relevel(
.data[["smoke_past"]],
c("0", "1")),
`Smoke History` = forcats::fct(
.data[["Smoke History"]]
)
)smoking_data <- smoking_data |>
pointblank::col_vals_in_set(
columns = c("smoke_current", "smoke_past"),
set = c("0", "1", "-1")
) |>
pointblank::col_vals_expr(
expr = pointblank::expr(
(.data[["smoke_current"]] == "1" & .data[["smoke_past"]] == "0") |
(.data[["smoke_current"]] == "-1" & .data[["smoke_past"]] == -"1") |
(.data[["smoke_current"]] == "0" & .data[["smoke_past"]] %in% c("0", "1"))
)
)Make use of Quarto’s parameters, conditional content and !expr knitr engine syntax to choose what code/items to run/display on your html, pdf or word report.
Parameterized Quarto Reports Improve Understanding of Soil Health by Jadey Ryan.

Html Output
Suppose we have completed harmonising a batch of clinical data.
How can we merge them without issues of missing rows or additional columns ?
unmatched = "error" in dplyr::inner_join helps to avoid patients with no match.
join_specification <- dplyr::join_by("cohort_unique_id")
demo_behave_data <- cohort_csv_data |>
dplyr::select(c("cohort_unique_id")) |>
dplyr::inner_join(age_gender_data,
by = join_specification,
unmatched = "error",
relationship = "one-to-one") |>
dplyr::inner_join(body_measurement_data,
by = join_specification,
unmatched = "error",
relationship = "one-to-one") |>
dplyr::inner_join(smoking_data,
by = join_specification,
unmatched = "error",
relationship = "one-to-one") |>
dplyr::relocate(c("bsa_m2", "bmi"),
.after = "sex")three_penguins <- tibble::tribble(
~samp_id, ~species, ~island,
1, "Adelie", "Torgersen",
2, "Gentoo", "Biscoe",
)
weight_extra <- tibble::tribble(
~samp_id, ~body_mass_g,
1, 3220,
2, 4730,
4, 4725
)
three_penguins |>
dplyr::inner_join(
y = weight_extra,
by = dplyr::join_by("samp_id"),
unmatched = "error"
) Error in `dplyr::inner_join()`:
! Each row of `y` must be matched by `x`.
ℹ Row 3 of `y` was not matched.
Reference: https://www.tidyverse.org/blog/2023/08/teach-tidyverse-23/#improved-and-expanded-_join-functionality
unmatched = "error" in dplyr::inner_join helps to avoid patients with no match.
join_specification <- dplyr::join_by("cohort_unique_id")
demo_behave_data <- cohort_csv_data |>
dplyr::select(c("cohort_unique_id")) |>
dplyr::inner_join(age_gender_data,
by = join_specification,
unmatched = "error",
relationship = "one-to-one") |>
dplyr::inner_join(body_measurement_data,
by = join_specification,
unmatched = "error",
relationship = "one-to-one") |>
dplyr::inner_join(smoking_data,
by = join_specification,
unmatched = "error",
relationship = "one-to-one") |>
dplyr::relocate(c("bsa_m2", "bmi"),
.after = "sex")three_penguins <- tibble::tribble(
~samp_id, ~species, ~island,
1, "Adelie", "Torgersen",
2, "Gentoo", "Biscoe",
3, "Chinstrap", "Dream"
)
weight_extra <- tibble::tribble(
~samp_id, ~body_mass_g,
1, 3220,
3, 4725
)
three_penguins |>
dplyr::inner_join(
y = weight_extra,
by = dplyr::join_by("samp_id"),
unmatched = "error"
) Error in `dplyr::inner_join()`:
! Each row of `x` must have a match in `y`.
ℹ Row 2 of `x` does not have a match.
Reference: https://www.tidyverse.org/blog/2023/08/teach-tidyverse-23/#improved-and-expanded-_join-functionality
relationship = "one-to-one" in dplyr::inner_join helps to avoid patients with multiple match.
join_specification <- dplyr::join_by("cohort_unique_id")
demo_behave_data <- cohort_csv_data |>
dplyr::select(c("cohort_unique_id")) |>
dplyr::inner_join(age_gender_data,
by = join_specification,
unmatched = "error",
relationship = "one-to-one") |>
dplyr::inner_join(body_measurement_data,
by = join_specification,
unmatched = "error",
relationship = "one-to-one") |>
dplyr::inner_join(smoking_data,
by = join_specification,
unmatched = "error",
relationship = "one-to-one") |>
dplyr::relocate(c("bsa_m2", "bmi"),
.after = "sex")three_penguins <- tibble::tribble(
~samp_id, ~species, ~island,
1, "Adelie", "Torgersen",
2, "Gentoo", "Biscoe",
3, "Chinstrap", "Dream"
)
weight_extra <- tibble::tribble(
~samp_id, ~body_mass_g,
1, 3220,
2, 4730,
2, 4725,
3, 4000
)
three_penguins |>
dplyr::inner_join(
y = weight_extra,
by = dplyr::join_by("samp_id"),
relationship = "one-to-one"
) Error in `dplyr::inner_join()`:
! Each row in `x` must match at most 1 row in `y`.
ℹ Row 2 of `x` matches multiple rows in `y`.
Reference: https://www.tidyverse.org/blog/2023/08/teach-tidyverse-23/#improved-and-expanded-_join-functionality
Use pointblank::has_columns to ensure we only have harmonised variables.
testthat::expect_false(
pointblank::has_columns(
demo_behave_data,
columns = c(
dplyr::ends_with(".x"),
dplyr::ends_with(".y")
)
)
)
testthat::expect_equal(
ncol(demo_behave_data), 9
)
testthat::expect_true(
pointblank::has_columns(
demo_behave_data,
columns = c(
"age_years", "sex",
"height_cm", "weight_kg", "bsa_m2", "bmi",
"smoke_current", "smoke_past"
)
)
) three_penguins <- tibble::tribble(
~samp_id, ~species, ~island,
1, "Adelie", "Torgersen",
2, "Gentoo", "Biscoe",
3, "Chinstrap", "Dream"
)
weight_extra <- tibble::tribble(
~samp_id, ~island,
1, "Torgersen",
2, "Biscoe",
3, "Dream"
)
three_penguins <- three_penguins |>
dplyr::inner_join(
y = weight_extra,
by = dplyr::join_by("samp_id"),
unmatched = "error",
relationship = "one-to-one"
)
three_penguins |>
pointblank::has_columns(
columns = c(
dplyr::ends_with(".x"),
dplyr::ends_with(".y")
)
)[1] TRUE
[1] "samp_id" "species" "island.x" "island.y"
Use 📦 daff to compare different version of harmonised datasets.
Daff Comparison: 'data1' vs. 'data2'
B:A A:B C:-
! : ---
@@ col1 Name col2
1:1 1 P1 11
2:2 -> 2->3 P2 13
-:3 +++ 3 P6 <NA>
3:- --- 3 P3 14
4:4 -> 4->6 P4 15
5:5 -> 5->9 P5 17
Use summary() to return a summary list.
Data diff: 'data1' vs. 'data2'
# Modified Reordered Deleted Added
Rows 5 3 0 1 1
Columns 3 --> 2 1 1 1 0
Data diff: 'data1' vs. 'data2'
# Modified Reordered Deleted Added
Rows 5 0 0 0 0
Columns 3 0 0 0 0
Use the summary list and pointblank::expect_col_vals_in_set to do the validation automatically.
tibble::tibble(
row_deletes = compare_different_summary$row_deletes,
row_inserts = compare_different_summary$row_inserts,
row_updates = compare_different_summary$row_updates,
row_reorders = compare_different_summary$row_reorders,
col_deletes = compare_different_summary$col_deletes,
col_inserts = compare_different_summary$col_inserts,
col_updates = compare_different_summary$col_updates,
col_reorders = compare_different_summary$col_reorders,
) |>
pointblank::expect_col_vals_in_set(
columns = c(
"row_deletes", "row_inserts",
"row_updates", "row_reorders",
"col_deletes", "col_inserts",
"col_updates", "col_reorders"
),
set = c(0)
)Error: Exceedance of failed test units where values in `row_deletes` should have been in the set of `0`.
The `expect_col_vals_in_set()` validation failed beyond the absolute threshold level (1).
* failure level (1) >= failure threshold (1)
tibble::tibble(
row_deletes = compare_same_summary$row_deletes,
row_inserts = compare_same_summary$row_inserts,
row_updates = compare_same_summary$row_updates,
row_reorders = compare_same_summary$row_reorders,
col_deletes = compare_same_summary$col_deletes,
col_inserts = compare_same_summary$col_inserts,
col_updates = compare_same_summary$col_updates,
col_reorders = compare_same_summary$col_reorders,
) |>
pointblank::expect_col_vals_in_set(
columns = c(
"row_deletes", "row_inserts",
"row_updates", "row_reorders",
"col_deletes", "col_inserts",
"col_updates", "col_reorders"
),
set = c(0)
)One variable mapping report takes at least one page.
On average, a clinical trial will have a few hundred variables.
Harmonisation report can have at least a few hundreds pages for each cohort.
There is a need to automate the creation of these reports.
Businessman in pile of documents asking for help by Amonrat Rungreangfangsai
To make a Quarto book or website, we need a _quarto.yml and index.qmd file
_quarto.yml is a configuration file to tell Quarto to create a book.
_quarto.yml
---
project:
type: book
output-dir: reports/Cohort_B
book:
downloads: [pdf, docx]
title: "Harmonisation Template for Cohort B"
author: "My Name"
navbar:
search: true
sidebar:
collapse-level: 1
chapters:
- index.qmd
- part: Cohort B Cleaning
chapters:
- codes/Cohort_B/00_R_Package_And_Environment.qmd
- codes/Cohort_B/01_Read_Cohort_B_Data.qmd
- codes/Cohort_B/02_Extract_Demographic.qmd
- codes/Cohort_B/03_Export_To_Excel.qmd
crossref:
chapters: false
fig-title: Figure # (default is "Figure")
tbl-title: Table # (default is "Table")
fig-prefix: Figure # (default is "Figure")
tbl-prefix: Table # (default is "Table")
ref-hyperlink: true # (default is true)
title-delim: ":" # (default is ":")
bibliography: references.bib
csl: csl_file.csl
format:
html:
toc: true
toc-depth: 5
toc-location: right
toc-expand: true
number-sections: true
number-depth: 5
smooth-scroll: true
theme:
light:
- flatly
#- custom.scss
dark:
- solar
docx:
reference-doc: custom-reference.docx
toc: true
toc-depth: 5
number-sections: true
number-depth: 5
prefer-html: true
highlight-style: github
pdf:
pdf-engine: xelatex
documentclass: scrreprt
papersize: a4
toc-depth: 5
number-sections: true
number-depth: 5
keep-tex: false
include-in-header:
text: |
\usepackage{fvextra}
\DefineVerbatimEnvironment{Highlighting}{Verbatim}{breaklines,commandchars=\\\{\}}
\DefineVerbatimEnvironment{OutputCode}{Verbatim}{breaklines,commandchars=\\\{\}}
include-before-body:
text: |
\begin{flushleft}
\begin{sloppypar}
\RecustomVerbatimEnvironment{verbatim}{Verbatim}{
showspaces = false,
showtabs = false,
breaksymbolleft={}, % Need pacakge fvextra
breaklines
% Note: setting commandchars=\\\{\} here will cause an error
}
include-after-body:
text: |
\end{sloppypar}
\end{flushleft}
---index.qmd file gives the preface (homepage) content of the Quarto book (website). It is compulsory file needed for the rendering to work.
index.qmd
---
date: "2025-03-10"
format:
html:
code-fold: true
freeze: false
params:
show_table: TRUE
---
```{r}
#| label: output type
#| echo: false
#| warning: false
#| message: false
out_type <- knitr::opts_chunk$get("rmarkdown.pandoc.to")
```
# Preface {.unnumbered .unlisted}
Here is the documentation of the data harmonisation step generated using [Quarto](https://quarto.org/). To learn more about Quarto books visit <https://quarto.org/docs/books>.
## File Structure
Here is the file structure of the project used to generate the document.
```
harmonisation/ # Root of the project template.
|
├── .quarto/ (not in repository) # Folder to keep intermediate files/folders
| # generated when Quarto renders the files.
|
├── archive/ # Folder to keep previous books and harmonised data.
| |
│ ├── reports/ # Folder to keep previous versions of
| | | # data harmonisation documentation.
| | |
| | ├── {some_date}_batch/ # Folder to keep {some_date} version of
| | | # data harmonisation documentation.
| | |
| | └── Flowchart.xlsx # Flowchart sheet to record version control.
| |
| └── harmonised/ # Folder to keep previous version of harmonised data.
| |
| ├── {some_date}_batch/ # Folder to keep {some_date} version of
| | # harmonised data.
| |
| └── Flowchart.xlsx # Flowchart sheet to record version control.
|
├── codes/ # Folder to keep R/Quarto scripts
| | # to run data harmonisation.
| |
│ ├── {cohort name}/ # Folder to keep Quarto scripts to run
| | | # data cleaning, harmonisation
| | | # and output them for each cohort.
| | |
| | └── preprocessed_data/ # Folder to keep preprocessed data.
| |
│ ├── harmonisation_summary/ # Folder to keep Quarto scripts to create
| | # data harmonisation summary report.
| |
│ ├── output/ # Folder to keep harmonised data.
| |
| ├── cohort_harmonisation_script.R # R script to render each {cohort name}/ folder.
| | # folder into html, pdf and word document.
| |
| └── harmonisation_summary_script.R # R script to render the {harmonisation_summary}/
| # folder into word document.
│
├── data-raw/ # Folder to keep cohort raw data (.csv, .xlsx, etc.)
| |
│ ├── {cohort name}/ # Folder to keep cohort raw data.
| | |
| | ├── {data_dictionary} # Data dictionary file that correspond to the
| | | # cohort raw data. Can be one from the
| | | # collaborator provide or provided by us.
| | |
| | └── Flowchart.xlsx # Flowchart sheet to record version control.
| |
| ├── data-dictionary/ # Folder to keep data dictionary
| | | # used for harmonising data.
| | |
| | └── Flowchart.xlsx # Flowchart sheet to record version control.
| |
| └── data-input/ # Folder to keep data input file
| | # for collaborators to fill in.
| |
| └── Flowchart.xlsx # Flowchart sheet to record version control.
|
├── docs/ # Folder to keep R functions documentation
| # generated using pkgdown:::build_site_external().
|
├── inst/ # Folder to keep arbitrary additional files
| | # to include in the project.
| |
| └── WORDLIST # File generated by spelling::update_wordlist()
|
├── man/ # Folder to keep R functions documentation
| | # generated using devtools::document().
| |
│ ├── {fun-demo}.Rd # Documentation of the demo R function.
| |
│ └── harmonisation-template.Rd # High-level documentation.
|
├── R/ # Folder to keep R functions.
| |
│ ├── {fun-demo}.R # Script with R functions.
| |
│ └── harmonisation-package.R # Dummy R file for high-level documentation.
│
├── renv/ (not in repository) # Folder to keep all packages
| # installed in the renv environment.
|
├── reports/ # Folder to keep the most recent data harmonisation
| # documentation.
|
├── templates/ # Folder to keep template files needed to generate
| | # data harmonisation documentation efficiently.
| |
| ├── quarto-yaml/ # Folder to keep template files to generate
| | | # data harmonisation documentation structure
| | | # in Quarto.
| | |
│ | ├── _quarto_{cohort name}.yml # Quarto book template data harmonisation documentation
| | | # for {cohort name}.
| | |
| | └── _quarto_summary.yml # Quarto book template data harmonisation summary.
| |
| └── index-qmd/ # Folder to keep template files to generate
| | # the preface page of the data harmonisation
| | # documentation.
| |
| ├── _index_report.qmd # Preface template for each cohort data harmonisation
| | # report.
| |
| └── _index_summary.qmd # Preface template for data harmonisation
| # summary report.
|
├── tests/ # Folder to keep test unit files.
| # Files will be used by R package testhat.
|
├── .Rbuildignore # List of files/folders to be ignored while
│ # checking/installing the package.
|
├── .Renviron (not in repository) # File to set environment variables.
|
├── .Rprofile (not in repository) # R code to be run when R starts up.
| # It is run after the .Renviron file is sourced.
|
├── .Rhistory (not in repository) # File containing R command history.
|
├── .gitignore # List of files/folders to be ignored while
│ # using the git workflow.
|
├── .lintr # Configuration for linting
| # R projects and packages using linter.
|
├── .renvignore # List of files/folders to be ignored when
│ # renv is doing its snapshot.
|
├── DESCRIPTION[*] # Overall metadata of the project.
|
├── LICENSE # Content of the MIT license generated via
| # usethis::use_mit_license().
|
├── LICENSE.md # Content of the MIT license generated via
| # usethis::use_mit_license().
|
├── NAMESPACE # List of functions users can use or imported
| # from other R packages. It is generated
| # by devtools::document().
│
├── README.md # GitHub README markdown file generated by Quarto.
|
├── README.qmd # GitHub README quarto file used to generate README.md.
|
├── _pkgdown.yml # Configuration for R package documentation
| # using pkgdown:::build_site_external().
|
├── _quarto.yml # Configuration for Quarto book generation.
| # It is also the project configuration file.
|
├── csl_file.csl # Citation Style Language (CSL) file to ensure
| # citations follows the Lancet journal.
|
├── custom-reference.docx # Microsoft word template for data harmonisation
| # documentation to Word.
|
├── harmonisation_template.Rproj # RStudio project file.
|
├── index.qmd # Preface page of Quarto book content.
|
├── references.bib # Bibtex file for Quarto book.
|
└── renv.lock # Metadata of R packages installed generated
# using renv::snapshot().
[*] These files are automatically created but user needs to manually add some information.
```Collaborator wants different ways to report how data harmonisation is done.
The documentation system by Divio
Collaborator wants different ways to report how data harmonisation is done.
We create a _quarto.yml file and relevant Quarto files for each cohort.
We create an index.qmd file for each kind of report.
Create a script to generate technical reports in pdf, word and html for each cohort.
Create a script to generate technical reports in pdf, word and html for each cohort.
copy_and_render <- function(
cohort
) {
# Copy quarto.yml file
# for each cohort
quarto_yml_file <- paste0(
"_quarto_",
cohort,
".yml"
)
fs::file_copy(
path = here::here(
"templates",
"quarto-yaml",
quarto_yml_file),
new_path = here::here("_quarto.yml"),
overwrite = TRUE
)
# Render each cohort
quarto::quarto_render(
as_job = FALSE
)
}
cohort_name <- c("Cohort_A",
"Cohort_B")
purrr::walk(
.x = cohort_name,
.f = ~copy_and_render(
cohort = .x
)
)A similar method is done to create a summary report in word using 📦 flextable.
How many variables can each cohort provide ?
How many variables can be harmonised ?
demographic_list <- list(
A = c("Age", "Sex",
"Hypertension", "Dyslipidemia", "Family Hx CAD", "Diabetes",
"Smoke Current", "Smoke Past",
"Have Chest Pain", "Chest Pain Character",
"Dyspnea",
"BMI", "Height", "Weight"),
B = c("Age", "Sex",
"Hypertension", "Dyslipidemia", "Family Hx CAD", "Diabetes",
"Smoke Current", "Smoke Past",
"Have Chest Pain", "Chest Pain Character",
"Dyspnea",
"HDL", "Total Cholesterol",
"Triglyceride", "LDL"),
C = c("Age", "Sex",
"Hypertension", "Dyslipidemia", "Family Hx CAD", "Diabetes",
"Smoke Current", "Smoke Past",
"Have Chest Pain", "Chest Pain Character",
"Dyspnea",
"BMI", "Height", "Weight",
"HDL", "Total Cholesterol",
"Triglyceride", "LDL")
)
cohort_a_label_name <- "Cohort A\n(n=1000)"
cohort_b_label_name <- "Cohort B\n(n=2000)"
cohort_c_label_name <- "Cohort C\n(n=1500)"
demographic_venn_data <- demographic_list |>
ggVennDiagram::Venn() |>
ggVennDiagram::process_data()
demographic_venn_regionedge_data <- demographic_venn_data |>
ggVennDiagram::venn_regionedge() |>
dplyr::mutate(
name = stringr::str_replace_all(
.data[["name"]],
"/",
" and "
),
name = stringr::str_wrap(.data[["name"]], width = 10),
name = forcats::fct_reorder(.data[["name"]],
nchar(.data[["id"]]))
) |>
dplyr::rename(`Cohort` = "name")
demographic_venn_label_data <- demographic_venn_data |>
ggVennDiagram::venn_setlabel() |>
dplyr::mutate(
name = dplyr::case_match(
.data[["name"]],
"A" ~ cohort_a_label_name,
"B" ~ cohort_b_label_name,
"C" ~ cohort_c_label_name
)
) |>
dplyr::rename(`Cohort` = "name")
demographic_venn_regionlabel_data <- demographic_venn_data |>
ggVennDiagram::venn_regionlabel() |>
dplyr::mutate(
name = stringr::str_replace_all(
.data[["name"]],
"/",
" and "
),
name = stringr::str_wrap(.data[["name"]], width = 10),
name = forcats::fct_reorder(.data[["name"]],
nchar(.data[["id"]]))
) |>
dplyr::rename(`Cohort` = "name")
demographic_venn_edge_data <- demographic_venn_data |>
ggVennDiagram::venn_setedge()
demographic_venn_diagram <- ggplot2::ggplot() +
# 1. region count layer
ggplot2::geom_polygon(
data = demographic_venn_regionedge_data,
mapping = ggplot2::aes(
x = .data[["X"]], y = .data[["Y"]],
fill = .data[["Cohort"]],
group = .data[["id"]])
) +
# 2. set edge layer
ggplot2::geom_path(
data = demographic_venn_edge_data,
mapping = ggplot2::aes(
x = .data[["X"]], y = .data[["Y"]],
colour = "black",
group = .data[["id"]]
),
show.legend = FALSE
) +
# 3. set label layer
ggplot2::geom_text(
data = demographic_venn_label_data,
mapping = ggplot2::aes(
x = .data[["X"]], y = .data[["Y"]],
label = .data[["Cohort"]]),
size = 5.5
) +
# 4. region label layer
ggplot2::geom_label(
data = demographic_venn_regionlabel_data,
mapping = ggplot2::aes(
x = .data[["X"]], y = .data[["Y"]],
label = .data[["count"]]),
size = 7,
fill = "white"
) +
ggplot2::scale_x_continuous(
expand = ggplot2::expansion(mult = 0.2)
) +
ggplot2::theme_void() +
ggplot2::theme(
text = ggplot2::element_text(size = 20)
)Venn diagram does not work for many (> 10) cohorts.
Upset plots are too complicated for clinicians.
demographic_venn <- tibble::tibble(
column_name = c("Age", "Sex",
"Hypertension", "Dyslipidemia", "Family Hx CAD", "Diabetes",
"Smoke Current", "Smoke Past",
"Have Chest Pain", "Chest Pain Character",
"Dyspnea",
"BMI", "Height", "Weight",
"HDL", "Total Cholesterol",
"Triglyceride", "LDL"),
`Cohort A` = c(1, 1,
1, 1, 1, 1,
1, 1,
1, 1,
1,
1, 1, 1,
0, 0,
0, 0),
`Cohort B` = c(1, 1,
1, 1, 1, 1,
1, 1,
1, 1,
1,
0, 0, 0,
1, 1,
1, 1),
`Cohort C` = c(1, 1,
1, 1, 1, 1,
1, 1,
1, 1,
1,
1, 1, 1,
1, 1,
1, 1),
`Cohort D` = c(1, 1,
1, 1, 1, 1,
1, 1,
1, 1,
0,
1, 0, 0,
1, 1,
1, 1),
`Cohort E` = c(1, 1,
1, 1, 1, 1,
1, 1,
1, 1,
0,
1, 1, 1,
1, 1,
0, 0),
`Cohort F` = c(1, 1,
1, 1, 1, 1,
1, 1,
1, 1,
0,
1, 1, 1,
0, 0,
0, 0),
)
cohort_a_upset_col_name <- "Cohort A\n(n=1000)"
cohort_b_upset_col_name <- "Cohort B\n(n=2000)"
cohort_c_upset_col_name <- "Cohort C\n(n=1500)"
cohort_d_upset_col_name <- "Cohort D\n(n=500)"
cohort_e_upset_col_name <- "Cohort E\n(n=1000)"
cohort_f_upset_col_name <- "Cohort F\n(n=2500)"
demographic_upset_data <- demographic_venn |>
dplyr::rename(
!!cohort_a_upset_col_name := "Cohort A",
!!cohort_b_upset_col_name := "Cohort B",
!!cohort_c_upset_col_name := "Cohort C",
!!cohort_d_upset_col_name := "Cohort D",
!!cohort_e_upset_col_name := "Cohort E",
!!cohort_f_upset_col_name := "Cohort F"
)
upset_plot <- ComplexUpset::upset(
demographic_upset_data,
c(cohort_a_upset_col_name,
cohort_b_upset_col_name,
cohort_c_upset_col_name,
cohort_d_upset_col_name,
cohort_e_upset_col_name,
cohort_f_upset_col_name),
base_annotations = list(
`Intersection size` =
ComplexUpset::intersection_size(
text = list(size = 9, nudge_y = 0.5),
text_colors = c(on_background='black',
on_bar='white')
) +
ggplot2::annotate(
size = 5.5,
geom = 'text',
x = Inf,
y = Inf,
label = paste('Total:', nrow(demographic_venn)),
vjust = 1,
hjust = 1
) +
ggplot2::theme(
axis.title.y = ggplot2::element_text(angle = 0, size = 20),
axis.title.x = ggplot2::element_text(angle = 0, size = 20),
text = ggplot2::element_text(size = 20)
) +
ggplot2::labs(y = "",
title = "Intersection size")
),
set_sizes = (
ComplexUpset::upset_set_size() +
ggplot2::geom_text(
size = 5.5,
mapping = ggplot2::aes(label= ggplot2::after_stat(.data[["count"]])),
hjust = 1.5,
stat = 'count') +
ggplot2::expand_limits(y = 30) +
ggplot2::theme(
axis.title.y = ggplot2::element_text(angle = 0, size = 15),
axis.title.x = ggplot2::element_text(angle = 0, size = 15),
text = ggplot2::element_text(size = 15)
) +
ggplot2::labs(y = "Number of variables")
),
sort_intersections_by = "degree",
sort_sets = FALSE,
name = "",
width_ratio = 0.25,
themes = ComplexUpset::upset_default_themes(
text = ggplot2::element_text(size = 15)
)
)Cannot answer follow-up questions:
How many cohorts provide patient’s blood lipid information and how many patients have them ?
Create a “heatmap”using Microsoft PowerPoint.
Harmonisation project template: https://github.com/JauntyJJS/harmonisation/
