Article type: Application Note. This is a preprint. It has not yet been peer-reviewed.
Authors: David M. Miller1* MD PhD, Sophia Z. Shalhout1 PhD
1Department of Medicine, Division of Hematology/Oncology and the Department of Dermatology, Massachusetts General Hospital, Boston, MA
*Corresponding author: David M. Miller MD PhD
Massachusetts General Hospital
Funding sources: This work was supported by a grant from Project Data Sphere.
Conflicts of interest: None
Manuscript word count: 1869
Abstract word count: 128
Keywords: data visualization, patient registries, shiny app, REDCap, clinical informatics
Abbreviations: AWD: alive with disease, CSV: comma-separated values, CTCAE: common terminology criteria for adverse events, DCIs: data collection instruments, EDA: exploratory data analysis, EDC: electronic data capture, NED: no evidence of disease, PCT: primary cutaneous tumor, PDS: project data sphere, REDCap: research electronic data capture, RWD: real-world data, UI: user interface.
Objectives: Tumor registries are a rich source of real-world data which can be used to test important hypotheses that inform clinical care. Exploratory data analysis at the level of individual subjects, when enhanced by interactive data visualizations, has the potential to provide novel insights and generate new hypothesis.
Materials and Methods: We created StoryboardR: An R package and Shiny application designed to visualize real-world data from tumor registries.
Results: StoryboardR facilitates the data visualization of real-world data from tumor registries captured in REDCap®.
Conclusions: StoryboardR is freely available under the Massachusetts Institute of Technology license and can be obtained from GitHub. StoryboardR is executed in R and deployed as a Shiny application for non-R users. It produces data visualizations of patient journeys from tumor registries.
Tumor registries are a rich source of patient-level data that can lead to important clinical insights. When optimally executed, tumor registries capture highly structured real-world data (RWD) which facilitates time-to-analysis and time-to-insight. There is significant variability in the scope and depth of data captured within various tumor registries, depending on which elements of the patient journey are targeted for capture. For example, the Project Data Sphere led Merkel Cell Carcinoma Patient Registry1,2 currently captures over two thousand data elements across 11 forms (Table 1).
While tumor registries can provide large data sets to test important hypotheses, exploratory data analysis (EDA) at the level of individual subjects can lead to novel insights and hypothesis generation. Visualizing patient-level data is a critical part of EDA. Good data visualizations can facilitate the digestion of complex information. Ideal data visualizations leverage superior data, function and design and are thus simple to generate, make data easy to understand, are informative, and visually appealing.
Here we present StoryboardR, an R package with a Shiny application front-end, which facilitates the visualization of real-world data from clinical registries collected in a Research Electronic Data Capture (REDCap®)-based project. REDCap® is a web-based electronic data capture (EDC) system utilized by investigators to capture structured data3. The functions of StoryboardR wrangle and transform data from REDCap®-based tumor registries to produce an interactive data visualization of the patient journey (Figure 1). StoryboardR is executed in R; however, the application is deployed via Shiny to enhance the user interface for non-R users.
In this manuscript we provide: (1) the data dictionary to allow users to adopt the MCC Patient Registry platform; (2) the StoryboardR R-package with installation instructions and examples which can be viewed on the package GitHub page (http://github.com/TheMillerLab/StoryboardR); and (3) a sample data set for demonstration purposes, which is embedded in the R-package. Importantly, these resources may be adopted by other clinical investigators to facilitate development of a variety of disease-specific registries.
Figure 1. Schema of StoryboardR. The StoryboardR package takes a csv file of clinical registry data stored in a REDCap® project as input. End users then select a subject to generate an interactive data visualization. Once selected, the package executes a series of server-side functions that wrangles and transforms that data and generates an interactive data visualization of the patient journey.
StoryboardR is written in R (version 4.0.0), organized using roxygen24, and utilizes the following packages dplyr5, tidyr6, readr7, stringr8, TimeWarp9, magrittr10, plotly11, splitstackshape12, Shinydashboard13, and Shiny14. For full details, instructions and examples refer to the video demonstration(https://github.com/TheMillerLab/StoryboardR/blob/main/Video_Demo.md), or README file (https://github.com/TheMillerLab/StoryboardR/blob/main/README.md), both of which can be viewed on the package GitHub page.
StoryboardR facilitates data visualizations of patient data from the Merkel Cell Carcinoma Tumor Registry electronic data capture system, a REDCap®-based EDC. The data dictionary for this platform is available on the package GitHub page(https://github.com/TheMillerLab/StoryboardR/blob/main/data-raw/StoryboardR_DataDictionary.csv). While this platform is currently being used by the Merkel Cell Carcinoma Tumor Registry, the fields are generalizable to most solid tumors. Potential customizations of the platform are described below.
As shown in Figure 1, StoryboardR takes data from a REDCap® project that has incorporated the instruments from Table 1. The StoryboardR Shiny application is launched via the function
launch_StoryboardR(). This function takes two arguments: “Data” and “DateShift”. The “Data” argument is a data frame that contains the raw data from the desired REDCap® project. “DateShift”, which defaults to
FALSE, will generate a random and uniform shift of all the dates in the data frame if
TRUE is used (this is described in more detail below).
launch_StoryboardR() is the only function required to execute and utilize StorybaordR. Once
launch_StoryboardR() is called, end users interface with StoryboardR in a web browser.
The Shiny application web browser incorporates a streamlined user interface (UI) with one user input - the subject’s “Record ID” - to maximize usability (Figure 2). The Shiny UI utilizes functions from the shinydashboard package to generate a fully customizable, centralized, easy-to-view dashboard of high-yield clinical information. Patient Characteristics and Initial Staging, Burden of Disease, Genomic Analysis and Therapeutic Interventions are directly pulled from the tumor registry and displayed for the viewer.
launch_StoryboardR(), a set of functions that wrangle data from the tumor registry’s structured forms, transforms that information into data frames with key information from the patient journey, and graphs the output as an interactive storyboard (Figure 1). Table 2 summarizes core functions and their respective actions.
Figure 3. Multi-level Functional Processing Approach of StoryboardR. StoryboardR utilizes a series of multi-level functions to wrangle and transform data from individual data collection instruments in order to construct patient-level data visualizations. At the base, the functions ss(), surgery(), clinical_staging(), pathological_staging(), lesion(), surgery(), systemic_therapy(), adverse_events(), xrt() and genomics()) select fields from their respective case report forms and re-map them to a five vector data frame, containing the variables “record_id”, “description”, “value”, “date”, and “hover”. The mid-level function combine_storyboard_dfs() combines the output of each of the base-level functions into one principal data frame using the function rbind(). The data frame that results from combine_storyboard_dfs() can then be filtered by a subject Record ID to produce a patient-specific interactive data visualization by calling the function storyboard_plot().
genomics()) select fields from their respective case report forms (e.g., Subject Status, Surgery, Genomics) and re-map them to a five vector data frame, containing the variables “record_id”, “description”, “value”, “date”, and “hover”. The vector “record_id” holds the individual subject identifier. The vector “description” designates the function from which the data was processed (e.g., genomics). In this data frame, the vector “value” possesses the transformed data contained within the tumor registry (for example, the name of the therapeutic agent captured (e.g., “Avelumab”) or type of lesion detected (e.g., “Metastasis”) that will be displayed on the patient storyboard. The vector “hover” contains additional data from the registry that will be displayed on the storyboard as hover text (e.g., the number of Gray of radiation used for treatment). An example of the output of one of the base-level functions,
lesion(), when applied to the sample data set embedded in the StoryboardR package, is shown in Table 3.
lesion(). Depicted is the output of the base-level function
lesion()after it is applied to the sample data set embedded in the StoryboardR package. Calling
lesion()produces a data frame with five vectors: record_id, description, value, data and hover. The data frame can then be combined with the output of other base-level functions using the mid-level function
combine_storyboard_dfs(). Depicted is the output of
combine_storyboard_dfs()after it is applied to the sample data set embedded in the StoryboardR package. The function
combine_storyboard_dfs()takes one argument, “data”, a data frame that contains the raw data from the desired REDCap® project. Contained within the mid-level function
combine_storyboard_dfs()is all of the base-level functions (e.g.
lesion(), etc.). Calling
combine_storyboard_dfs()aggregates the output of all of the base-level functions via
rbind()to produce one principal data frame with the five vectors: record_id, description, value, data and hover. The output of
combine_storyboard_dfs()can then be filtered by a subject Record ID to produce a patient-specific interactive data visualization by calling the function
Finally, the top-level function
storyboard_plot() generates an interactive plotly storyboard (see next section). Importantly, the multi-level functional processing design permits a high degree of customization and therefore maximizes extensibility. Developers can either modify base-level functions or create new ones to incorporate additional data from their clinical registry project.