Keywords: clinico-genomics, data abstraction, electronic health records, shiny app, REDCap, clinical informatics
Acknowledgments: We would like to acknowledge Ravikumar Komandur, PhD, Project Director at Project Data Sphere for review and critique of the manuscript; Guardant Health and Foundation Medicine for making sample reports available for the development of this package. GENETEX is for research purposes only. No clinical decisions should be made with the information obtained from its output. This article reflects the views and work (including development and use of GENETEX) of the authors and should not be construed to represent the work, policies, of any of the vendors whose reports were used to develop GENETEX and whose reports may be provided as part of the package.
Abbreviations: BWH: Brigham and Women’s hospital, CGD: clinico-genomic data, CNV: copy number variants, EDC: electronic data capture, EHR: electronic health record, MGH: Massachusetts General Hospital, MMR: mismatch repair, NSLS: numeric suffix linker system, REDCap: Research Electronic Data Capture, RWD: real-world data, TMB: tumor mutation burden
Objectives: Clinico-Genomic Data (CGD) acquired through routine clinical practice has the potential to improve our understanding of clinical oncology. However, these data often reside in heterogeneous and semi-structured data, resulting in prolonged time-to-analyses. Materials and Methods: We created GENETEX: an R package and Shiny application for text mining genomic reports from EHR and direct import into REDCap®. Results: GENETEX facilitates the abstraction of CGD from EHR and streamlines capture of structured data into REDCap®. Its functions include natural language processing of key genomic information, transformation of semi-structured data into structured data and importation into REDCap. When evaluated with manual abstraction, GENETEX had >99% agreement and captured CGD in approximately one-fifth the time. Conclusions: GENETEX is freely available under the Massachusetts Institute of Technology license and can be obtained from GitHub. GENETEX is executed in R and deployed as a Shiny application for non-R users. It produces high-fidelity abstraction of CGD in a fraction of the time.
Advances in clinical oncology require a deep understanding of cancer biology. Clinico-Genomic Data (CGD) obtained from routine clinical practice can greatly increase our comprehension of tumor biology. However, there are a number of barriers that impede capitalization of these critical real-world data (RWD). Paramount amongst these obstacles include prolonged time-to-analyses secondary to the difficulties of capturing data from heterogeneous sources, as well as the challenges of processing vast amounts of genomic information. These hurdles increase time-to-insight from RWD and threaten our ability to fully maximize on advances in molecular and information technologies.
In the real-world setting, genomic information resides in a variety of formats. The most common is a report from an institutional molecular pathology department or a commercial vendor. This information is often accessed by clinicians or clinical researchers as semi-structured data. Collecting CGD of a patient cohort in a structured electronic data capture (EDC) system can facilitate analysis and maximize time-to-analytics and time-to-action.
We previously published an overview of a methodology and design of a Research Electronic Data Capture (REDCap®)1-based system to facilitate capture of RWD2. That platform incorporates a form entitled Genomics Instrument, which provides a structured format for the collection of CGD3. This instrument is freely available and can be incorporated into any existing REDCap project. It is currently being used by the Project Data Sphere led Merkel Cell Carcinoma Patient Registry4 to capture CGD.
Here we present GENETEX (pronounced “genetics”), an R package with a Shiny application front-end, which facilitates the abstraction of CGD from EHR and streamlines capture of structured data into the Genomics Instrument in REDCap®. Its functions include natural language processing of key genomic information, transformation of semi-structured data into structured data, and importation into REDCap® (Figure 1). GENETEX is executed in R but is deployed as a Shiny application to enhance the user interface for non-R users.
Figure 1. Schema of GENETEX
GENETEX is written in R (version 4.0.0), organized using roxygen25, and utilizes the following packages dplyr6, tidyr7, readr8, stringr9, purrr10, REDCapR11, magrittr12, splitstackshape13 and Shiny14. For full details, instructions and examples refer to the README, which can be viewed on the package GitHub page(http://github.com/TheMillerLab/genetex).
Clinical Informatics Dependencies
GENETEX facilitates abstraction of medical records for importation into the Genomics Instrument in REDCap®. The data dictionary for this form has been previously published3
Comparison of GENETEX to Manual Abstraction
Sample genomic reports were either generated or obtained from commercial vendors. These data were devoid of protected health information; thus, no IRB was required for this project. Two highly-trained abstractors manually abstracted the reports (i.e. each report had duplicate abstraction), recording the time-to-capture for each report. GENETEX was used to abstract these same reports. To simulate the real-world experience, both techniques incorporated a manual visual quality-control step to verify that imported results were accurate. This step was included in the time captured. Agreement rates were compared using R. Comparison of time-to-capture involved averaging the capture time for each duplicate manual abstraction and comparing to GENETEX. A paired wilcoxon test was used to compare manual abstraction to GENETEX.
CGD in the real-world setting is predominantly contained in either portable document format (PDF) documents sent to providers by commercial vendors or in text files contained within EHR. Thus to facilitate abstraction of these data, we developed a browser-based user interface that incorporates text data captured on a clipboard as input in a Shiny application. Text is copied to a computer’s clipboard and pasted into the text area input in the Shiny application (Figure 2).
Figure 2: Browser-Based User Interface
Users then control additional inputs including free text of the subject’s record id (required field), REDCap® instrument instance (required), lesion tag descriptor (optional field), and date the tissue was obtained (optional). Drop down inputs are also presented to the user including selection of the platform used to generate the genomics report (required) and the type of lesion the genomics report was generated from (e.g. primary lesion vs. metastases) (optional). Finally, to direct the data to REDCap®, users enter strings of the web address of the REDCap® platform (required) as well as the REDCap® API Token (required). These inputs are then called to the function genetex_to_redcap() by the action button “Run GENETEX to REDCap”.
Server Side Verbs
The server side of the Shiny application contains the executable code of GENETEX. The package contains a set of verbs that then parse, text mine and transform the input into structured data to serve as the substrate for import into REDCap® (Figure 1). Table 1 summarizes the verbs and their respective function to extract key elements from the genomics report. We have methods to automatically mine HUGO Gene Nomenclature Committee-approved gene names and detected amino acid and/or nucleotide alterations, tumor mutational burden (tmb), mismatch repair status (mmr), copy number variants (cnv) and mutational signatures. In addition, our implementation transforms the data and links the appropriate CGD with the variables used in the Genomics Instrument so that they can be uploaded into REDCap®.
integrates key verbs to provide NLP tools to abstract data from a variety of genomic reports and import them to REDCap
integrates various platform-specific NLP functions to text mine gene names and nucleotide variants from genomic reports and transforms them to structured data for import into REDCap
integrates various platform-specific NLP functions to text mine gene names and copy number variants data from a variety of genomic reports and transforms them to structured data for import into REDCap
text mines mismatch repair status from genomic reports and transforms it to structured data for import into REDCap
text mines mutational signatures data from a variety of genomic reports and transforms it to structured data for import into REDCap
text mines tumor mutation burden (TMB) data from a variety of genomic reports and transforms it to structured data for import into REDCap
applies regular expressions to assign a numerical value for the various platforms used for genomic reports that aligns with the genomics_platform field in the REDCap Genomics Instrument
produces a regular expression of over 900 HGNC gene names
produces a regular expression of over 900 HGNC gene names as a unique string with word boundaries
applies regular expressions to assign a numerical value for the various platforms used for genomic reports that aligns with the genomics_platform field in the REDCap Genomics Instrument
Table 1 - GENETEX Verbs. Key functions unique to GENETEX with brief description of action are shown. Description of other verbs can be found in the package’s Help Page.
Text Mining CGD
Overview of Data Processing
Following the initial step of securing CGD into the Shiny application, GENETEX converts these character strings to a data frame for text mining. At this time, CGD from the following platforms are able to be processed by GENETEX: Guardant360™15, FoundationOne®16, MGH SNaPshot17 and BWH Oncopanel18
Due to idiosyncratic differences between these reports, we developed platform-specific functions to text mine data. For example, gene.variants.isolate.oncopanel() and gene.variants.isolate.snapshot() isolate gene variant data from BWH Oncopanel and MGH SNaPshot reports, respectively. However, in general, following securing CGD, the data is tokenized using the cSplit() function from the splitstackshape package.
Isolating Genomic Data with Regular Expressions
To perform text mining of CGD, we created a number of regular expressions (regex) to identify gene names, nucleotide and amino acid sequences and cell-free DNA (cfDNA) data within genomic reports. For example, the function genes_boundary_regex() generates a regular expression of >900 gene names surrounded by word boundaries (Figure 3a). We further designed regular expressions to detect nucleotide and amino acid sequences and the magnitude of cfDNA (Figure 3b). The regular expressions were effectively combined to identify only those genomic information of interest (Figure 3C). Finally, we filter out unnecessary elements (e.g. the strings “RESULTS:” “Single” “nucleotide” “variants”) that are not intended to be captured in the Genomics instrument (Figure 3D)
Figure 3d: Example of report filtered with genes_nuc_aa_cfdna_regex *
Group Correlated Text
In order to group correlated genomic information (e.g. gene name with the associated nucleotide/amino acid variant), GENETEX utilizes keyword-group pairing. A logical vector “keywords” is created using the dplyr verb mutate() paired with str_detect() and the regular expression “genes_boundary_regex”. This logical vector becomes the object of the cumsum() function to create the numeric vector “group”; effectively grouping each unique gene name with its correlated data (Figure 3e).
In addition to isolating key genomic data from reports, the above regular expressions are also used to map variables used in the Genomics Instrument. The instrument uses the following variable prefixes “variant_gene”, “variant_nucleotide”, “variant_protein” and “variant_gene_perc_cfdna” to enable tidy data for gene names, nucleotide variants, amino acid variants and percent cfDNA, respectively. Using the following combination of ifelse() statements, str_detect() and the aforementioned regular expressions, these variables can be linked to the corresponding tidy data (Figure 3f).
Figure 3f: Mapping REDCap variables to data elements
As previously described3, in order to produce unique variables that can be linked with other related information, the Genomics Instrument utilizes a Numeric Suffix Linker System (NSLS). The NSLS links related elements of CGD with a character string in the variable name (e.g. “variant”) with an underscore and a numeric (e.g "_1“). Therefore, a given gene will be grouped with its correlated nucleotide/amino acid variants and/or cfDNA information with a unique character string and numeric. For example, the variable prefixes”variant_gene“,”variant_nucleotide“,”variant_protein" and “variant_gene_perc_cfdna” will all be linked with the same numeric suffix (e.g. "_1“). Consequently, these elements can be grouped during analysis with the unique pairing of”variant" and "_1“. Therefore, a final step in creating this unique variable linked system involves pasting the”var" vector with the “group” vector(Figure 3g).
Figure 3g: Complete mapping REDCap variables to data elements with numeric suffix linker system
An analogous approach is used to identify and abstract data on cnvs, tmb, mmr and mutational signatures. Please see the description file and annotated R scripts contained within the package in GitHub for further details.
The front-end Shiny application is executed with an action button that produces two easy-to-view outputs. The first, which can be viewed by clicking on “Report” in the sidebar, produces a verbatimTextOutput of the genomic report to ensure that the correct report was pasted into the textAreaInput. The second one, viewed by clicking on “Data” in the sidebar, is a table of the data executed by genetex_to_redcap() (Supplemental Table 1). This is intended to provide the user with an output of the data generated by the function so that a quality-control step can take place.
Data imported via genomics package
Table 2 - Output of genetex_to_redcap()
Import to REDCap®
The data in Table 2 is imported to REDCap® from the Shiny application by calling the function redcap_write_oneshot() from the REDCapR package. An example of a portion of that form with data imported from GENETEX is seen in Supplemental Figure 1.
Supplemental Figure 1 - Example of the data form in REDCap® after the input text has been passed to genetex_to_redcap().
In order to evaluate the performance of augmented abstraction with GENETEX in the real-world setting compared to classical abstraction (i.e. “manual abstraction”), we selected 7 genomic reports at random (3 Guardant, 2 Foundation Medicine, 1 MGH SNaPshot, 1 BWH Oncopanel) for abstraction. Each report was abstracted independently by two data abstractors via manual abstraction, as well as with GENETEX. In total, 744 data elements were captured from these 7 reports. Agreement rates between the two human abstractors was 99.19%. Importantly, >99% agreement was reached between both human abstractors and GENETEX (99.33% and 99.19%). Given that the agreement between classical and augmented abstraction was high, we next evaluated if the GENETEX pipeline would improve time-to-analysis. The mean time for manual abstraction for each report was 784.5 (range: 220.5-3096.5) seconds compared with 136 (range: 75- 216) seconds for augmented abstraction (Wilcoxon test p value = 0.015625).
Limitations and Solutions
GENETEX is to be used in conjunction with Genomics Instrument and thus, it is dependent on that form being installed into a REDCap® project. However, we have made the data dictionary freely available so that others may incorporate it into their individual project. Importantly, our regular expressions system of mapping variable names to key genomic data provides a high-degree of flexibility to map alternate variable names for REDCap® instruments with different data dictionaries. Additional limitations include the fact that at this time, GENETEX does not support all potential platforms available for CGD. However, due to its open-source position, external developers can perform pull-requests on GitHub for incorporation of additional platforms and future refinement. Lastly, genomic findings abstracted from individual reports describe results at the point in time at which they were provided are a historical record and may not represent the latest understanding of genomic science or precision oncology treatment paradigms.
Ideally, structured data objects that exactly match clinical genomic reports would be universally available directly from the providers of genomic profiling reports, and may be at some point in the future. Until then, there may be a desire to abstract or otherwise process these reports into structured formats. GENETEX is a browser-based application for natural language processing of CGD obtained in routine clinical practice. It facilitates extraction of data from EHR, transformation of semi-structured data into a structured format and loading into REDCap®. Its Shiny extension enables non-R users to execute the package without familiarity of R. Real-world deployment of the GENETEX demonstrated excellent agreement with classical abstraction in roughly 1/5 of the time. Thus, augmented abstraction with browser-based applications can decrease the barrier to data capture and importantly improve time-to-analysis of clinico-genomic data.
Harris, P. A. et al. Research electronic data capture (REDCap)–a metadata-driven methodology and workflow process for providing translational research informatics support. Journal of Biomedical Informatics42, 377–381 (2009).
Shalhout, S. & Miller, D. Optimizing real-world data collection: Clinical genomics. (2020).
Miller, D. & Shalhout, S. Optimizing real-world data collection: Genomics electronic data capture instrument. (2021).
Project data sphere. https://www.projectdatasphere.org/research/programs/rare-tumor-registries/merkel-cell-carcinoma.
H., W., P., D. & M., E. roxygen2: In-source documentation for r https://cran.r-project.org/package=roxygen2 (version 6.0.1. R packageVersion 7.1.1, (2013).
Wickham, H., Francois, R., Henry, L. & Muller, K. Dplyr: A grammar of data manipulation. R PackageVersion 1.0.5, (2021).
Wickham, H. Tidyr: Tidy messy data. R packageVersion 1.1.3, (2013).
Wickham, H., Hester, J. & Francois, R. Readr: Read rectangular text data. R packageVersion 1.4.0, (2020).
Wickham, H. Stringr: Simple, consistent wrappers for common string. R packageVersion 1.4.0, (2019).
L., H. & H., W. Purrr: Functional programming tools. R packageVersion 0.3.4, (2020).
Beasley, W. et al. REDCapR: Interaction between r and REDCap. R packageVersion 0.11.0, (2020).
Bache, SM., Wickham, H. & Henry, L. Magrittr: A forward-pipe operator for r. R packageVersion 2.0.1, (2020).
Mahto, A. Splitstackshape: Stack and reshape datasets after splitting concatenated values. R packageVersion 1.4.8, (2019).
Chang, W., Cheng, J., Allaire, JJ., Xie, Y. & McPherson, J. Shiny: Web application framework for r. R packageVersion 1.6.0, (2018).
Foundation medicine. https://www.foundationmedicine.com.
Massachusetts general hospital center for integrated diagnostics. https://www.massgeneral.org/pathology/services/center-for-integrated-diagnostics-default.
Brigham and women’s hospital center for advanced molecular diagnostics. (https://www.brighamandwomens.org/pathology/center-for-advanced-molecular-diagnostics/molecular-diagnostics-laboratory.