Table of Content
- Motivation
- Acknowledgement
- File Structure
- Software Installation
- R Package Installation
- Using
renv
- R Functions Management
- R Packages Used
- R Platform Information
- Data Harmonisation Report For Each Cohort
- Combined Data Harmonisation Report For All Cohort
- Data Harmonisation Summary
- General Recommendations
Motivation
Some large cohort studies involve the pooling of data from multiple sites, studies or clinical trials. Prior to statistical or machine learning analysis, a data steward must be able to not just clean but organise and sort through these heterogeneous inputs in a standardised and consistent format. This process is sometimes called retrospective data harmonisation. As methods of data harmonisation for certain data fields or variables can be complicated, it must be recorded in a coherent way such that different stakeholders (such as your collaborators or study committee members) can understand what is being done to the raw/provided data. Despite its importance in the big data environment, there are limit resources on how to document the data harmonisation process in a structured, efficient (with some automation) and robust way.
This repository aims to be a project template to allow a data steward to be able to create data harmonisation reports using R and Quarto books. To learn more about Quarto books visit https://quarto.org/docs/books.
Output of these reports are as follows:
Run the R script cohort_harmonisation_script.R
in codes
folder to generate:
- Cohort_A Harmonisation Report:
- Cohort_B Harmonisation Report:
Run the R script cohort_all_harmonisation_script.R
in codes
folder to generate:
- Combined (All cohorts) Harmonisation Report:
Run the R script harmonisation_summary_script.R
in codes
folder to generate:
- Harmonisation Summary:
File Structure
Here is the file structure of this project.
harmonisation/ # Root of the project template.
|
βββ .quarto/ (not in repository) # Folder to keep intermediate files/folders
| # generated when Quarto renders the files.
|
βββ archive/ # Folder to keep previous books and harmonised data.
| |
β βββ reports/ # Folder to keep previous versions of
| | | # data harmonisation documentation.
| | |
| | βββ {some_date}_batch/ # Folder to keep {some_date} version of
| | | # data harmonisation documentation.
| | |
| | βββ Flowchart.xlsx # Flowchart sheet to record version control.
| |
| βββ harmonised/ # Folder to keep previous version of harmonised data.
| |
| βββ {some_date}_batch/ # Folder to keep {some_date} version of
| | # harmonised data.
| |
| βββ Flowchart.xlsx # Flowchart sheet to record version control.
|
βββ codes/ # Folder to keep R/Quarto scripts
| | # to run data harmonisation.
| |
β βββ {cohort name}/ # Folder to keep Quarto scripts to run
| | | # data cleaning, harmonisation
| | | # and output them for each cohort.
| | |
| | βββ preprocessed_data/ # Folder to keep preprocessed data.
| |
β βββ harmonisation_summary/ # Folder to keep Quarto scripts to create
| | # data harmonisation summary report.
| |
β βββ output/ # Folder to keep harmonised data.
| |
| βββ cohort_harmonisation_script.R # R script to render each {cohort name}/ folder.
| | # folder into html, pdf and word document.
| |
| βββ harmonisation_summary_script.R # R script to render the {harmonisation_summary}/
| # folder into word document.
β
βββ data-raw/ # Folder to keep cohort raw data (.csv, .xlsx, etc.)
| |
β βββ {cohort name}/ # Folder to keep cohort raw data.
| | |
| | βββ {data_dictionary} # Data dictionary file that correspond to the
| | | # cohort raw data. Can be one from the
| | | # collaborator provide or provided by us.
| | |
| | βββ Flowchart.xlsx # Flowchart sheet to record version control.
| |
| βββ data-dictionary/ # Folder to keep data dictionary
| | | # used for harmonising data.
| | |
| | βββ Flowchart.xlsx # Flowchart sheet to record version control.
| |
| βββ data-input/ # Folder to keep data input file
| | # for collaborators to fill in.
| |
| βββ Flowchart.xlsx # Flowchart sheet to record version control.
|
βββ docs/ # Folder to keep R functions documentation
| # generated using pkgdown:::build_site_external().
|
βββ inst/ # Folder to keep arbitrary additional files
| | # to include in the project.
| |
| βββ WORDLIST # File generated by spelling::update_wordlist()
|
βββ man/ # Folder to keep R functions documentation
| | # generated using devtools::document().
| |
β βββ {fun-demo}.Rd # Documentation of the demo R function.
| |
β βββ harmonisation-template.Rd # High-level documentation.
|
βββ R/ # Folder to keep R functions.
| |
β βββ {fun-demo}.R # Script with R functions.
| |
β βββ harmonisation-package.R # Dummy R file for high-level documentation.
β
βββ renv/ (not in repository) # Folder to keep all packages
| # installed in the renv environment.
|
βββ reports/ # Folder to keep the most recent data harmonisation
| # documentation.
|
βββ templates/ # Folder to keep template files needed to generate
| | # data harmonisation documentation efficiently.
| |
| βββ quarto-yaml/ # Folder to keep template files to generate
| | | # data harmonisation documentation structure
| | | # in Quarto.
| | |
β | βββ _quarto_{cohort name}.yml # Quarto book template data harmonisation documentation
| | | # for {cohort name}.
| | |
| | βββ _quarto_summary.yml # Quarto book template data harmonisation summary.
| |
| βββ index-qmd/ # Folder to keep template files to generate
| | # the preface page of the data harmonisation
| | # documentation.
| |
| βββ _index_report.qmd # Preface template for each cohort data harmonisation
| | # report.
| |
| βββ _index_summary.qmd # Preface template for data harmonisation
| # summary report.
|
βββ tests/ # Folder to keep test unit files.
| # Files will be used by R package testhat.
|
βββ .Rbuildignore # List of files/folders to be ignored while
β # checking/installing the package.
|
βββ .Renviron (not in repository) # File to set environment variables.
|
βββ .Rprofile (not in repository) # R code to be run when R starts up.
| # It is run after the .Renviron file is sourced.
|
βββ .Rhistory (not in repository) # File containing R command history.
|
βββ .gitignore # List of files/folders to be ignored while
β # using the git workflow.
|
βββ .lintr # Configuration for linting
| # R projects and packages using linter.
|
βββ .renvignore # List of files/folders to be ignored when
β # renv is doing its snapshot.
|
βββ DESCRIPTION[*] # Overall metadata of the project.
|
βββ LICENSE # Content of the MIT license generated via
| # usethis::use_mit_license().
|
βββ LICENSE.md # Content of the MIT license generated via
| # usethis::use_mit_license().
|
βββ NAMESPACE # List of functions users can use or imported
| # from other R packages. It is generated
| # by devtools::document().
β
βββ README.md # GitHub README markdown file generated by Quarto.
|
βββ README.qmd # GitHub README quarto file used to generate README.md.
|
βββ _pkgdown.yml # Configuration for R package documentation
| # using pkgdown:::build_site_external().
|
βββ _quarto.yml # Configuration for Quarto book generation.
| # It is also the project configuration file.
|
βββ csl_file.csl # Citation Style Language (CSL) file to ensure
| # citations follows the Lancet journal.
|
βββ custom-reference.docx # Microsoft word template for data harmonisation
| # documentation to Word.
|
βββ harmonisation_template.Rproj # RStudio project file.
|
βββ index.qmd # Preface page of Quarto book content.
|
βββ references.bib # Bibtex file for Quarto book.
|
βββ renv.lock # Metadata of R packages installed generated
# using renv::snapshot().
[*] These files are automatically created but user needs to manually add some information.
Software Installation
Installing R
Go to https://cran.rstudio.com/. Choose a version of R that matches the computerβs operating system.
Installing RStudio
Go to https://posit.co/download/rstudio-desktop/. Scroll down and choose a version of RStudio that matches the computerβs operating system.
Installing Rtools
Go to https://cran.r-project.org/bin/windows/Rtools/. Choose a version of Rtools that matches the R version that was installed.
Quarto
Quarto converts R scripts into a technical report or notebook in html, pdf, Microsoft Word, etc. It is installed together with RStudio. User can also go to https://quarto.org/docs/get-started/ to install it separately. For Quarto to be able to create pdf files, a pdf engine must be installed as well. For ease, it is suggested to install TinyTex using the terminal command quarto install tinytex
.
R Package Installation
Use Posit Public Package Manager PPM to set up your repository environment to install R packages from CRAN. This is because PPM allows installation of frozen R package versions based on a snapshot date.
One way to do that is to set in the .Rprofile
file with the code options(repos = c(P3M = "{link to repository url form Posit Public Package Manager}"))
R packages can be installed using pak::pkg_install()
from the R package pak
as an alternative to install.packages()
and remotes::install_github()
. Benefits of using pak
can be found here
You can also view your repository environment using the command pak::repo_get()
Using renv
You can increase reproducibility by using the package renv
. Install renv
from CRAN with pak::pak("renv")
. If this is your first time using renv
, start with the Introduction to renv vignette
. Use renv::init(bare = TRUE)
to start with an empty renv
environment.
renv
will freeze the exact package versions you depend on (in renv.lock
). This ensures that each collaborator (or you in the future) will use the exact same versions of these packages. Moreover renv
provides to each project its own private package library making each project isolated from others.
Install required dependencies locally with pak::pkg_install()
from CRAN, Bioconductor, R-universe, etc.
Sometimes the right downloader (libcurl or others) needs to set for installation of R packages inside the renv
environment to be successful. Setting the R environmental variable RENV_DOWNLOAD_FILE_METHOD = βlibcurlβ may help.
Save the local environment with renv::snapshot()
to create the renv.lock
file.
R Functions Management
R functions heavily used in this project can be found in the R
folder. Documentation (man
folder), test units (test
folder) corresponding to these functions are structured the same as creating an R package. Relevant R packages required for R package development (and available on Posit Public Package Manager PPM) are
package | description | version | date | source | repository |
---|---|---|---|---|---|
covr | Test Coverage for Packages | 3.6.4 | 2023-11-09 | RSPM | https://packagemanager.posit.co/cran/2025-03-06 |
devtools | Tools to Make Developing R Packages Easier | 2.4.5 | 2022-10-11 | RSPM | https://packagemanager.posit.co/cran/2025-03-06 |
lintr | A βLinterβ for R Code | 3.2.0 | 2025-02-12 | RSPM | https://packagemanager.posit.co/cran/2025-03-06 |
pkgdown | Make Static HTML Documentation for a Package | 2.1.1 | 2024-09-17 | RSPM | https://packagemanager.posit.co/cran/2025-03-06 |
roxygen2 | In-Line Documentation for R | 7.3.2 | 2024-06-28 | RSPM | https://packagemanager.posit.co/cran/2025-03-06 |
sinew | Package Development Documentation and Namespace Management | 0.4.0 | 2022-03-31 | RSPM | https://packagemanager.posit.co/cran/2025-03-06 |
spelling | Tools for Spell Checking in R | 2.3.1 | 2024-10-04 | RSPM | https://packagemanager.posit.co/cran/2025-03-06 |
testthat | Unit Testing for R | 3.2.3 | 2025-01-13 | RSPM | https://packagemanager.posit.co/cran/2025-03-06 |
usethis | Automate Package and Project Setup | 3.1.0 | 2024-11-26 | RSPM | https://packagemanager.posit.co/cran/2025-03-06 |
Here is an example of the command to use pak::pak("{package name}")
to install packages from the Posit Public Package Manager PPM.
There is no need to source the functions in the R
folder. Use devtools::load_all()
instead. devtools::load_all()
will load required dependencies listed in DESCRIPTION
and R functions stored in R/
. Prior installation of these dependencies is required for the load to be successful.
After loading, R functions can be documented (using devtools::document()
), tested (using devtools::test()
and then devtools::check()
) and even installed as an R package (using devtools::install
).
More information of this workflow can be found in Chapter 1: The Whole Game of the R Packages (2e) book.
Documentation of the functions in the R
folder can be found in https://jauntyjjs.github.io/harmonisation/reference/index.html.
R Packages Used
R packages installed from Posit Public Package Manager PPM or CRAN using command pak::pkg_install("{package name}")
are
Here are all the R packages used in this analysis.
package | title | version | date | source | repository |
---|---|---|---|---|---|
cli | Helpers for Developing Command Line Interfaces | 3.6.4 | 2025-02-13 | RSPM | https://packagemanager.posit.co/cran/2025-03-06 |
dplyr | A Grammar of Data Manipulation | 1.1.4 | 2023-11-17 | RSPM | https://packagemanager.posit.co/cran/2025-03-06 |
flextable | Functions for Tabular Reporting | 0.9.7 | 2024-10-27 | RSPM | https://packagemanager.posit.co/cran/2025-03-06 |
fontawesome | Easily Work with βFont Awesomeβ Icons | 0.5.3 | 2024-11-16 | RSPM | https://packagemanager.posit.co/cran/2025-03-06 |
forcats | Tools for Working with Categorical Variables (Factors) | 1.0.0 | 2023-01-29 | RSPM | https://packagemanager.posit.co/cran/2025-03-06 |
fs | Cross-Platform File System Operations Based on βlibuvβ | 1.6.5 | 2024-10-30 | RSPM | https://packagemanager.posit.co/cran/2025-03-06 |
fst | Lightning Fast Serialization of Data Frames | 0.9.8 | 2022-02-08 | RSPM | https://packagemanager.posit.co/cran/2025-03-06 |
ftExtra | Extensions for βFlextableβ | 0.6.4 | 2024-05-10 | RSPM | https://packagemanager.posit.co/cran/2025-03-06 |
glue | Interpreted String Literals | 1.8.0 | 2024-09-30 | RSPM | https://packagemanager.posit.co/cran/2025-03-06 |
harmonisation | Utility Functions For A Data Harmonisation Project | 1.0.0.0 | 2025-03-14 | local | NA |
here | A Simpler Way to Find Your Files | 1.0.1 | 2020-12-13 | RSPM | https://packagemanager.posit.co/cran/2025-03-06 |
htmltools | Tools for HTML | 0.5.8.1 | 2024-04-04 | RSPM | https://packagemanager.posit.co/cran/2025-03-06 |
htmlwidgets | HTML Widgets for R | 1.6.4 | 2023-12-06 | RSPM | https://packagemanager.posit.co/cran/2025-03-06 |
knitr | A General-Purpose Package for Dynamic Report Generation in R | 1.49 | 2024-11-08 | RSPM | https://packagemanager.posit.co/cran/2025-03-06 |
magrittr | A Forward-Pipe Operator for R | 2.0.3 | 2022-03-30 | RSPM | https://packagemanager.posit.co/cran/2025-03-06 |
openxlsx | Read, Write and Edit xlsx Files | 4.2.8 | 2025-01-25 | RSPM | https://packagemanager.posit.co/cran/2025-03-06 |
pointblank | Data Validation and Organization of Metadata for Local and Remote Tables | 0.12.2 | 2024-10-23 | RSPM | https://packagemanager.posit.co/cran/2025-03-06 |
purrr | Functional Programming Tools | 1.0.4 | 2025-02-05 | RSPM | https://packagemanager.posit.co/cran/2025-03-06 |
quarto | R Interface to βQuartoβ Markdown Publishing System | 1.4.4 | 2024-07-20 | RSPM | https://packagemanager.posit.co/cran/2025-03-06 |
reactable | Interactive Data Tables for R | 0.4.4 | 2023-03-12 | RSPM | https://packagemanager.posit.co/cran/2025-03-06 |
readxl | Read Excel Files | 1.4.4 | 2025-02-27 | RSPM | https://packagemanager.posit.co/cran/2025-03-06 |
renv | Project Environments | 1.1.2 | 2025-03-03 | RSPM | https://packagemanager.posit.co/cran/2025-03-06 |
rlang | Functions for Base Types and Core R and βTidyverseβ Features | 1.1.5 | 2025-01-17 | RSPM | https://packagemanager.posit.co/cran/2025-03-06 |
rmarkdown | Dynamic Documents for R | 2.29 | 2024-11-04 | RSPM | https://packagemanager.posit.co/cran/2025-03-06 |
sessioninfo | R Session Information | 1.2.2 | 2021-12-06 | CRAN (R 4.4.2) | https://cran.rstudio.com |
stringr | Simple, Consistent Wrappers for Common String Operations | 1.5.1 | 2023-11-14 | RSPM | https://packagemanager.posit.co/cran/2025-03-06 |
testthat | Unit Testing for R | 3.2.3 | 2025-01-13 | RSPM | https://packagemanager.posit.co/cran/2025-03-06 |
tibble | Simple Data Frames | 3.2.1 | 2023-03-20 | RSPM | https://packagemanager.posit.co/cran/2025-03-06 |
tidyr | Tidy Messy Data | 1.3.1 | 2024-01-24 | RSPM | https://packagemanager.posit.co/cran/2025-03-06 |
vroom | Read and Write Rectangular Text Data Quickly | 1.6.5 | 2023-12-05 | RSPM | https://packagemanager.posit.co/cran/2025-03-06 |
R Platform Information
Here are the R platform environment used in this analysis.
setting | value |
---|---|
version | R version 4.4.2 (2024-10-31 ucrt) |
os | Windows 11 x64 (build 26100) |
system | x86_64, mingw32 |
ui | RTerm |
language | (EN) |
collate | English_Singapore.utf8 |
ctype | English_Singapore.utf8 |
tz | Asia/Singapore |
date | 2025-03-17 |
pandoc | 3.2 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown) |
quarto | 1.6.37 @ C:/Program Files/Quarto/bin/quarto.exe/ (via quarto) |
knitr | 1.49 from RSPM |
Data Harmonisation Report For Each Cohort
To start the harmonisation of data, run the R script cohort_harmonisation_script.R
in codes
folder. The script will clean and harmonise the raw data and create a Quarto harmonisation report book for each cohort in html, word and pdf.
This involves
- copying a specific
yml
file (_quarto_{cohort name}.yml
) from thetemplates/quarto-yaml
folder to the project folderharmonisation
and rename it as_quarto.yml
, overwriting any existing_quarto.yml
file. - copying a specific
qmd
file (_index_report.qmd
) from thetemplates/index-qmd
folder to the project folderharmonisation
and rename it asindex.qmd
, overwriting any existingindex.qmd
file.
Using the _quarto.yml
, index.qmd
, references.bib
and csl_file.csl
files, Quarto will then start running the Quarto scripts in the codes/{cohort_name}
folder. This involves reading the raw data in the data-raw/{cohort_name}
folder, placing preprocessing data in the codes/{cohort_name}/preprocessed_data
folder, outputting the harmonised data as excel file called harmonised_{cohort_name}.xlsx
in the codes/output/harmonised
folder. Also, the data harmonisation process documentation will be created in the reports/{cohort_name}
folder as a Quarto book in html, word and pdf.
Combined Data Harmonisation Report For All Cohort
To start the harmonisation of data, run the R script cohort_all_harmonisation_script.R
in codes
folder. The script will clean and harmonise the raw data and create a Quarto harmonisation report (all cohort combined) book in html.
This involves
- copying a specific
yml
file (_quarto_all.yml
) from thetemplates/quarto-yaml
folder to the project folderharmonisation
and rename it as_quarto.yml
, overwriting any existing_quarto.yml
file. - copying a specific
qmd
file (_index_report.qmd
) from thetemplates/index-qmd
folder to the project folderharmonisation
and rename it asindex.qmd
, overwriting any existingindex.qmd
file.
Using the _quarto.yml
, index.qmd
, references.bib
and csl_file.csl
files, Quarto will then start running the Quarto scripts in the codes/{cohort_name}
folder. This involves reading the raw data in the data-raw/{cohort_name}
folder, placing preprocessing data in the codes/{cohort_name}/preprocessed_data
folder, outputting the harmonised data as excel file called harmonised_{cohort_name}.xlsx
in the codes/output/harmonised
folder. Also, the data harmonisation process documentation will be created in the reports/all
folder as a Quarto book in html.
A harmonisation report file can consist of a few hundred pages. It is not recommended to output the combined report as one pdf or word document file because the file size may be too large and it takes a long time to open the file.
Data Harmonisation Summary
To start creating the data harmonisation summary document, run the R script harmonisation_summary_script.R
in codes
folder. The script will create the document in word.
This involves
- copying a specific
yml
file (_quarto_summary.yml
) from thetemplates/quarto-yaml
folder to the project folderharmonisation
and rename it as_quarto.yml
, overwriting any existing_quarto.yml
file. - copying a specific
qmd
file (_index_summary.qmd
) from thetemplates/index-qmd
folder to the project folderharmonisation
and rename it asindex.qmd
, overwriting any existingindex.qmd
file.
Using the _quarto.yml
, index.qmd
, references.bib
and csl_file.csl
files, Quarto will then start running the Quarto scripts in the codes/harmonisation_summary
folder. The data harmonisation summary documentation will be created in the reports/harmonisation_summary_report
folder as a Quarto book in word.
General Recommendations
- Ensure the workspace is always in a blank state. Use
usethis::use_blank_slate(scope = c("user", "project"))
to create this setting. - Keep the root of the project as clean as possible
- Store your raw data in
data-raw
- Document raw data, data dictionary, data input file and archived files modifications in
Flowchart.xlsx
provided. - Export modified raw data in
codes/{cohort_name}/preprocessed_data
- Store only R functions in
R/
- Store only R scripts and/or qmd in
codes/{cohort_name}_Cleaning
- Built relative paths using
here::here()
- Call external functions as
{package_name}::{function()}
- Use
devtools::document()
to update theNAMESPACE
- Do not source your functions but use instead
devtools::load_all()
.devtools::load_all()
will load required dependencies listed inDESCRIPTION
and R functions stored inR/