GENETEX - A GENomics Report TEXt Mining R Package and Shiny Application Designed to Capture Real-World Clinico-Genomic Data

Preprint Article Submitted to JAMIA Open

Article type: Application Note

Authors: David M. Miller1* MD PhD, Sophia Z. Shalhout1 PhD

1Department of Medicine, Division of Hematology/Oncology and the Department of Dermatology, Massachusetts General Hospital, Boston, MA

*Corresponding author: David M. Miller MD PhD
Massachusetts General Hospital
15 Parkman St. Room 132
Boston MA 02114
Telephone: 617-726-2667
Fax: 617-643-1740
Email:

Funding sources: The Harvard Cancer Center Merkel Cell Carcinoma patient registry is supported by grants from Project Data Sphere, ECOG-ACRIN and the American Skin Association.

Conflicts of interest: None

Manuscript word count: 1925 [max 2000 excluding abstract/tabs/figs/res/suppl]
Abstract word count: 150 [max 150]
References: 18 [unlimited]
Figures: 3 [max 3]
Tables: 1 [max 2]
Supplemental Figures: 2

Keywords: clinico-genomics, data abstraction, electronic health records, shiny app, REDCap, clinical informatics

Acknowledgments: We would like to acknowledge Ravikumar Komandur, PhD, Project Director at Project Data Sphere for review and critique of the manuscript; Guardant Health and Foundation Medicine for making sample reports available for the development of this package. GENETEX is for research purposes only. No clinical decisions should be made with the information obtained from its output. This article reflects the views and work (including development and use of GENETEX) of the authors and should not be construed to represent the work, policies, of any of the vendors whose reports were used to develop GENETEX and whose reports may be provided as part of the package.

Abbreviations: BWH: Brigham and Women’s hospital, CGD: clinico-genomic data, CNV: copy number variants, EDC: electronic data capture, EHR: electronic health record, MGH: Massachusetts General Hospital, MMR: mismatch repair, NSLS: numeric suffix linker system, REDCap: Research Electronic Data Capture, RWD: real-world data, TMB: tumor mutation burden

ABSTRACT

Objectives: Clinico-Genomic Data (CGD) acquired through routine clinical practice has the potential to improve our understanding of clinical oncology. However, these data often reside in heterogeneous and semi-structured data, resulting in prolonged time-to-analyses.
Materials and Methods: We created GENETEX: an R package and Shiny application for text mining genomic reports from EHR and direct import into REDCap®.
Results: GENETEX facilitates the abstraction of CGD from EHR and streamlines capture of structured data into REDCap®. Its functions include natural language processing of key genomic information, transformation of semi-structured data into structured data and importation into REDCap. When evaluated with manual abstraction, GENETEX had >99% agreement and captured CGD in approximately one-fifth the time.
Conclusions: GENETEX is freely available under the Massachusetts Institute of Technology license and can be obtained from GitHub. GENETEX is executed in R and deployed as a Shiny application for non-R users. It produces high-fidelity abstraction of CGD in a fraction of the time.

Objectives

Advances in clinical oncology require a deep understanding of cancer biology. Clinico-Genomic Data (CGD) obtained from routine clinical practice can greatly increase our comprehension of tumor biology. However, there are a number of barriers that impede capitalization of these critical real-world data (RWD). Paramount amongst these obstacles include prolonged time-to-analyses secondary to the difficulties of capturing data from heterogeneous sources, as well as the challenges of processing vast amounts of genomic information. These hurdles increase time-to-insight from RWD and threaten our ability to fully maximize on advances in molecular and information technologies.

In the real-world setting, genomic information resides in a variety of formats. The most common is a report from an institutional molecular pathology department or a commercial vendor. This information is often accessed by clinicians or clinical researchers as semi-structured data. Collecting CGD of a patient cohort in a structured electronic data capture (EDC) system can facilitate analysis and maximize time-to-analytics and time-to-action.

We previously published an overview of a methodology and design of a Research Electronic Data Capture (REDCap®)1-based system to facilitate capture of RWD2. That platform incorporates a form entitled Genomics Instrument, which provides a structured format for the collection of CGD3. This instrument is freely available and can be incorporated into any existing REDCap project. It is currently being used by the Project Data Sphere led Merkel Cell Carcinoma Patient Registry4 to capture CGD.

Here we present GENETEX (pronounced “genetics”), an R package with a Shiny application front-end, which facilitates the abstraction of CGD from EHR and streamlines capture of structured data into the Genomics Instrument in REDCap®. Its functions include natural language processing of key genomic information, transformation of semi-structured data into structured data, and importation into REDCap® (Figure 1). GENETEX is executed in R but is deployed as a Shiny application to enhance the user interface for non-R users. Figure 1. Schema of GENETEX Figure 1. Schema of GENETEX

Methods

Software Dependencies

GENETEX is written in R (version 4.0.0), organized using roxygen25, and utilizes the following packages dplyr6, tidyr7, readr8, stringr9, purrr10, REDCapR11, magrittr12, splitstackshape13 and Shiny14. For full details, instructions and examples refer to the README, which can be viewed on the package GitHub page(http://github.com/TheMillerLab/genetex).

Clinical Informatics Dependencies

GENETEX facilitates abstraction of medical records for importation into the Genomics Instrument in REDCap®. The data dictionary for this form has been previously published3

Comparison of GENETEX to Manual Abstraction

Sample genomic reports were either generated or obtained from commercial vendors. These data were devoid of protected health information; thus, no IRB was required for this project. Two highly-trained abstractors manually abstracted the reports (i.e. each report had duplicate abstraction), recording the time-to-capture for each report. GENETEX was used to abstract these same reports. To simulate the real-world experience, both techniques incorporated a manual visual quality-control step to verify that imported results were accurate. This step was included in the time captured. Agreement rates were compared using R. Comparison of time-to-capture involved averaging the capture time for each duplicate manual abstraction and comparing to GENETEX. A paired wilcoxon test was used to compare manual abstraction to GENETEX.

Results

Inputs/User Interface

CGD in the real-world setting is predominantly contained in either portable document format (PDF) documents sent to providers by commercial vendors or in text files contained within EHR. Thus to facilitate abstraction of these data, we developed a browser-based user interface that incorporates text data captured on a clipboard as input in a Shiny application. Text is copied to a computer’s clipboard and pasted into the text area input in the Shiny application (Figure 2).

Figure 2: Browser-Based User Interface

Users then control additional inputs including free text of the subject’s record id (required field), REDCap® instrument instance (required), lesion tag descriptor (optional field), and date the tissue was obtained (optional). Drop down inputs are also presented to the user including selection of the platform used to generate the genomics report (required) and the type of lesion the genomics report was generated from (e.g. primary lesion vs. metastases) (optional). Finally, to direct the data to REDCap®, users enter strings of the web address of the REDCap® platform (required) as well as the REDCap® API Token (required). These inputs are then called to the function genetex_to_redcap() by the action button “Run GENETEX to REDCap”.

Server Side Verbs

The server side of the Shiny application contains the executable code of GENETEX. The package contains a set of verbs that then parse, text mine and transform the input into structured data to serve as the substrate for import into REDCap® (Figure 1). Table 1 summarizes the verbs and their respective function to extract key elements from the genomics report. We have methods to automatically mine HUGO Gene Nomenclature Committee-approved gene names and detected amino acid and/or nucleotide alterations, tumor mutational burden (tmb), mismatch repair status (mmr), copy number variants (cnv) and mutational signatures. In addition, our implementation transforms the data and links the appropriate CGD with the variables used in the Genomics Instrument so that they can be uploaded into REDCap®.

verbs function
genetex_to_redcap() integrates key verbs to provide NLP tools to abstract data from a variety of genomic reports and import them to REDCap
gene.variants() integrates various platform-specific NLP functions to text mine gene names and nucleotide variants from genomic reports and transforms them to structured data for import into REDCap
cnv() integrates various platform-specific NLP functions to text mine gene names and copy number variants data from a variety of genomic reports and transforms them to structured data for import into REDCap
mmr() text mines mismatch repair status from genomic reports and transforms it to structured data for import into REDCap
mutational.signatures() text mines mutational signatures data from a variety of genomic reports and transforms it to structured data for import into REDCap
tmb() text mines tumor mutation burden (TMB) data from a variety of genomic reports and transforms it to structured data for import into REDCap
platform() applies regular expressions to assign a numerical value for the various platforms used for genomic reports that aligns with the genomics_platform field in the REDCap Genomics Instrument
genes_regex() produces a regular expression of over 900 HGNC gene names
genes_boundary_regex() produces a regular expression of over 900 HGNC gene names as a unique string with word boundaries
genomics.tissue.type() applies regular expressions to assign a numerical value for the various platforms used for genomic reports that aligns with the genomics_platform field in the REDCap Genomics Instrument

Table 1 - GENETEX Verbs. Key functions unique to GENETEX with brief description of action are shown. Description of other verbs can be found in the package’s Help Page.


Text Mining CGD

Overview of Data Processing

Following the initial step of securing CGD into the Shiny application, GENETEX converts these character strings to a data frame for text mining. At this time, CGD from the following platforms are able to be processed by GENETEX: Guardant36015, FoundationOne®16, MGH SNaPshot17 and BWH Oncopanel18

Due to idiosyncratic differences between these reports, we developed platform-specific functions to text mine data. For example, gene.variants.isolate.oncopanel() and gene.variants.isolate.snapshot() isolate gene variant data from BWH Oncopanel and MGH SNaPshot reports, respectively. However, in general, following securing CGD, the data is tokenized using the cSplit() function from the splitstackshape package.

Isolating Genomic Data with Regular Expressions

To perform text mining of CGD, we created a number of regular expressions (regex) to identify gene names, nucleotide and amino acid sequences and cell-free DNA (cfDNA) data within genomic reports. For example, the function genes_boundary_regex() generates a regular expression of >900 gene names surrounded by word boundaries (Figure 3a). We further designed regular expressions to detect nucleotide and amino acid sequences and the magnitude of cfDNA (Figure 3b). The regular expressions were effectively combined to identify only those genomic information of interest (Figure 3C). Finally, we filter out unnecessary elements (e.g. the strings “RESULTS:” “Single” “nucleotide” “variants”) that are not intended to be captured in the Genomics instrument (Figure 3D)

genetex::genes_boundary_regex()
## # A tibble: 1 × 1
##   genes                                                                         
##   <chr>                                                                         
## 1 "\\bABCB1\\b|\\bABCB11\\b|\\bABCC3\\b|\\bABL1\\b|\\bABL2\\b|\\bACTA2\\b|\\bAC…

Figure 3a: Regular expression of gene names

nuc_regex <- "[ACTG]>[ACTG]|del[ACTG]" 
aa_regex <- "(\\b([A-Z][0-9]{1,}(([A-Z])|(_[A-Z][1-9]{1,}del)|(fs\\*[1-9]{1,})|(\\*)|(fs)|(del)))|(p\\.[A-Z]))|([0-9]ins[A-Z])"

cfdna_regex <- "\\b[0-9]{1,2}\\.[0-9]{1,2}%"
Figure 3b: Regular expression of nucleotide and amino acid sequences, and cell-free DNA



X
RESULTS:
Single
nucleotide
variants:
ATM
2353C>T
R785C
0.1%
PTEN
L257V
769C>G
8.1%

Figure 3c: Tokenized genomics report

genes_nuc_aa_cfdna_regex <- paste(genes_boundary_regex, aa_regex, nuc_regex, cfdna_regex, sep = "|")

dt.1 <- dt %>%
    dplyr::filter(stringr::str_detect(string = dt$X,
                                      pattern = stringr::regex(genes_nuc_aa_cfdna_regex)))
X
ATM
2353C>T
R785C
0.1%
PTEN
L257V
769C>G
8.1%

Figure 3d: Example of report filtered with genes_nuc_aa_cfdna_regex *


Group Correlated Text

In order to group correlated genomic information (e.g. gene name with the associated nucleotide/amino acid variant), GENETEX utilizes keyword-group pairing. A logical vector “keywords” is created using the dplyr verb mutate() paired with str_detect() and the regular expression “genes_boundary_regex”. This logical vector becomes the object of the cumsum() function to create the numeric vector “group”; effectively grouping each unique gene name with its correlated data (Figure 3e).

dt.2 <- dt.1 %>%
    dplyr::mutate(keywords = stringr::str_detect(string = dt.1$X,
                                                 pattern = stringr::regex(genes_boundary_regex)),
                  group = base::cumsum(keywords))
X keywords group
ATM TRUE 1
2353C>T FALSE 1
R785C FALSE 1
0.1% FALSE 1
PTEN TRUE 2
L257V FALSE 2
769C>G FALSE 2
8.1% FALSE 2

Figure 3e: Example of report grouped by gene name

Mapping REDCap® Variables

In addition to isolating key genomic data from reports, the above regular expressions are also used to map variables used in the Genomics Instrument. The instrument uses the following variable prefixes “variant_gene”, “variant_nucleotide”, “variant_protein” and “variant_gene_perc_cfdna” to enable tidy data for gene names, nucleotide variants, amino acid variants and percent cfDNA, respectively. Using the following combination of ifelse() statements, str_detect() and the aforementioned regular expressions, these variables can be linked to the corresponding tidy data (Figure 3f).

dt.3 <- dt.2 %>%
  mutate(var = ifelse(test = str_detect(string = dt.2$X,
                                        pattern = regex(genes_boundary_regex)),
                      yes = "variant_gene",
                       no = ifelse(test = str_detect(string = dt.2$X,
                                                     pattern = regex(nuc_regex)),
                                   yes = "variant_nucleotide",
                                   no = ifelse(test = str_detect(string = dt.2$X,
                                                                 pattern = regex(aa_regex)),
                                               yes = "variant_protein",
                                               no = ifelse(test = str_detect(string = dt.2$X,
                                                                             pattern = regex(cfdna_regex)),
                                                           yes = "variant_gene_perc_cfdna",
                                                           no = "")))))
X keywords group var
ATM TRUE 1 variant_gene
2353C>T FALSE 1 variant_nucleotide
R785C FALSE 1 variant_protein
0.1% FALSE 1 variant_gene_perc_cfdna
PTEN TRUE 2 variant_gene
L257V FALSE 2 variant_protein
769C>G FALSE 2 variant_nucleotide
8.1% FALSE 2 variant_gene_perc_cfdna

Figure 3f: Mapping REDCap variables to data elements


As previously described3, in order to produce unique variables that can be linked with other related information, the Genomics Instrument utilizes a Numeric Suffix Linker System (NSLS). The NSLS links related elements of CGD with a character string in the variable name (e.g. “variant”) with an underscore and a numeric (e.g "_1“). Therefore, a given gene will be grouped with its correlated nucleotide/amino acid variants and/or cfDNA information with a unique character string and numeric. For example, the variable prefixes”variant_gene“,”variant_nucleotide“,”variant_protein" and “variant_gene_perc_cfdna” will all be linked with the same numeric suffix (e.g. "_1“). Consequently, these elements can be grouped during analysis with the unique pairing of”variant" and "_1“. Therefore, a final step in creating this unique variable linked system involves pasting the”var" vector with the “group” vector(Figure 3g).

dt.4 <- dt.3 %>%
    mutate(variables = paste(var, group, sep = "_"))
X keywords group var variables
ATM TRUE 1 variant_gene variant_gene_1
2353C>T FALSE 1 variant_nucleotide variant_nucleotide_1
R785C FALSE 1 variant_protein variant_protein_1
0.1% FALSE 1 variant_gene_perc_cfdna variant_gene_perc_cfdna_1
PTEN TRUE 2 variant_gene variant_gene_2
L257V FALSE 2 variant_protein variant_protein_2
769C>G FALSE 2 variant_nucleotide variant_nucleotide_2
8.1% FALSE 2 variant_gene_perc_cfdna variant_gene_perc_cfdna_2

Figure 3g: Complete mapping REDCap variables to data elements with numeric suffix linker system

An analogous approach is used to identify and abstract data on cnvs, tmb, mmr and mutational signatures. Please see the description file and annotated R scripts contained within the package in GitHub for further details.

Package Outputs

The front-end Shiny application is executed with an action button that produces two easy-to-view outputs. The first, which can be viewed by clicking on “Report” in the sidebar, produces a verbatimTextOutput of the genomic report to ensure that the correct report was pasted into the textAreaInput. The second one, viewed by clicking on “Data” in the sidebar, is a table of the data executed by genetex_to_redcap() (Supplemental Table 1). This is intended to provide the user with an output of the data generated by the function so that a quality-control step can take place.

variables results
record_id Lewis Jones
redcap_repeat_instrument genomics
redcap_repeat_instance 1
genomics_yn 1
lesion_tag_genomics Liquid Biopsy
genomics_tissue_type 5
genomics_date 2020-12-22
genomics_platform 6
genomics_platform_other
mmr 2
tmb_yn 1
tmb 81.47
tmb_abs
mutation_signature_number 0
mutation_signature_1
mutation_signature_2
mutation_signature_3
mutation_signature_4
mutation_signature_5
mutation_signature_6
mutation_signature_7
mutation_signature_8
mutation_signature_9
mutation_signature_10
variant_number 20
variant_gene_1 NTRK2
variant_nucleotide_1
variant_protein_1 A31T
variant_gene_perc_cfdna_1 3.5%
variant_gene_2 CTNNB1
variant_nucleotide_2
variant_protein_2 R550R
variant_gene_perc_cfdna_2 3.2%
variant_gene_3 PTEN
variant_nucleotide_3
variant_protein_3 E157fs
variant_gene_perc_cfdna_3 3.2%
variant_gene_4 PIK3CA
variant_nucleotide_4
variant_protein_4 E545K
variant_gene_perc_cfdna_4 2.6%
variant_gene_5 FANCA
variant_nucleotide_5
variant_protein_5 R1400H
variant_gene_perc_cfdna_5 2.5%
variant_gene_6 PTEN
variant_nucleotide_6
variant_protein_6 K267fs
variant_gene_perc_cfdna_6 2.5%
variant_gene_7 BRCA2
variant_nucleotide_7
variant_protein_7 I605fs
variant_gene_perc_cfdna_7 2.3%
variant_gene_8 BRAF
variant_nucleotide_8
variant_protein_8 A762V
variant_gene_perc_cfdna_8 2.2%
variant_gene_9 PALB2
variant_nucleotide_9
variant_protein_9 R37C
variant_gene_perc_cfdna_9 1.9%
variant_gene_10 CHEK2
variant_nucleotide_10
variant_protein_10 R346H
variant_gene_perc_cfdna_10 1.8%
variant_gene_11 NTRK1
variant_nucleotide_11
variant_protein_11 T741T
variant_gene_perc_cfdna_11 1.7%
variant_gene_12 AR
variant_nucleotide_12
variant_protein_12 H875Y
variant_gene_perc_cfdna_12 0.5%
variant_gene_13 MTOR
variant_nucleotide_13
variant_protein_13 W1456R
variant_gene_perc_cfdna_13 0.4%
variant_gene_14 NOTCH1
variant_nucleotide_14
variant_protein_14 F357del
variant_gene_perc_cfdna_14 0.4%
variant_gene_15 DDR2
variant_nucleotide_15
variant_protein_15 P157L
variant_gene_perc_cfdna_15 0.3%
variant_gene_16 GNA11
variant_nucleotide_16
variant_protein_16 G208fs
variant_gene_perc_cfdna_16 0.3%
variant_gene_17 MPL
variant_nucleotide_17
variant_protein_17 P530P
variant_gene_perc_cfdna_17 0.3%
variant_gene_18 ALK
variant_nucleotide_18
variant_protein_18 Y1584Y
variant_gene_perc_cfdna_18 0.2%
variant_gene_19 NOTCH1
variant_nucleotide_19
variant_protein_19 S2486fs
variant_gene_perc_cfdna_19 0.2%
variant_gene_20 ARID1A
variant_nucleotide_20
variant_protein_20 P1710P
variant_gene_perc_cfdna_20 0.1%
variant_gene_21
variant_nucleotide_21
variant_protein_21
variant_gene_perc_cfdna_21
variant_gene_22
variant_nucleotide_22
variant_protein_22
variant_gene_perc_cfdna_22
variant_gene_23
variant_nucleotide_23
variant_protein_23
variant_gene_perc_cfdna_23
variant_gene_24
variant_nucleotide_24
variant_protein_24
variant_gene_perc_cfdna_24
variant_gene_25
variant_nucleotide_25
variant_protein_25
variant_gene_perc_cfdna_25
variant_gene_26
variant_nucleotide_26
variant_protein_26
variant_gene_perc_cfdna_26
variant_gene_27
variant_nucleotide_27
variant_protein_27
variant_gene_perc_cfdna_27
variant_gene_28
variant_nucleotide_28
variant_protein_28
variant_gene_perc_cfdna_28
variant_gene_29
variant_nucleotide_29
variant_protein_29
variant_gene_perc_cfdna_29
variant_gene_30
variant_nucleotide_30
variant_protein_30
variant_gene_perc_cfdna_30
variant_gene_31
variant_nucleotide_31
variant_protein_31
variant_gene_perc_cfdna_31
variant_gene_32
variant_nucleotide_32
variant_protein_32
variant_gene_perc_cfdna_32
variant_gene_33
variant_nucleotide_33
variant_protein_33
variant_gene_perc_cfdna_33
variant_gene_34
variant_nucleotide_34
variant_protein_34
variant_gene_perc_cfdna_34
variant_gene_35
variant_nucleotide_35
variant_protein_35
variant_gene_perc_cfdna_35
variant_gene_36
variant_nucleotide_36
variant_protein_36
variant_gene_perc_cfdna_36
variant_gene_37
variant_nucleotide_37
variant_protein_37
variant_gene_perc_cfdna_37
variant_gene_38
variant_nucleotide_38
variant_protein_38
variant_gene_perc_cfdna_38
variant_gene_39
variant_nucleotide_39
variant_protein_39
variant_gene_perc_cfdna_39
variant_gene_40
variant_nucleotide_40
variant_protein_40
variant_gene_perc_cfdna_40
variant_gene_41
variant_nucleotide_41
variant_protein_41
variant_gene_42
variant_nucleotide_42
variant_protein_42
variant_gene_43
variant_nucleotide_43
variant_protein_43
variant_gene_44
variant_nucleotide_44
variant_protein_44
variant_gene_45
variant_nucleotide_45
variant_protein_45
variant_gene_46
variant_nucleotide_46
variant_protein_46
variant_gene_47
variant_nucleotide_47
variant_protein_47
variant_gene_48
variant_nucleotide_48
variant_protein_48
variant_gene_49
variant_nucleotide_49
variant_protein_49
variant_gene_50
variant_nucleotide_50
variant_protein_50
variant_gene_51
variant_nucleotide_51
variant_protein_51
variant_gene_52
variant_nucleotide_52
variant_protein_52
variant_gene_53
variant_nucleotide_53
variant_protein_53
variant_gene_54
variant_nucleotide_54
variant_protein_54
variant_gene_55
variant_nucleotide_55
variant_protein_55
variant_gene_56
variant_nucleotide_56
variant_protein_56
variant_gene_57
variant_nucleotide_57
variant_protein_57
variant_gene_58
variant_nucleotide_58
variant_protein_58
variant_gene_59
variant_nucleotide_59
variant_protein_59
variant_gene_60
variant_nucleotide_60
variant_protein_60
variant_gene_61
variant_nucleotide_61
variant_protein_61
variant_gene_62
variant_nucleotide_62
variant_protein_62
variant_gene_63
variant_nucleotide_63
variant_protein_63
variant_gene_64
variant_nucleotide_64
variant_protein_64
variant_gene_65
variant_nucleotide_65
variant_protein_65
variant_gene_66
variant_nucleotide_66
variant_protein_66
variant_gene_67
variant_nucleotide_67
variant_protein_67
variant_gene_68
variant_nucleotide_68
variant_protein_68
variant_gene_69
variant_nucleotide_69
variant_protein_69
variant_gene_70
variant_nucleotide_70
variant_protein_70
variant_gene_71
variant_nucleotide_71
variant_protein_71
variant_gene_72
variant_nucleotide_72
variant_protein_72
variant_gene_73
variant_nucleotide_73
variant_protein_73
variant_gene_74
variant_nucleotide_74
variant_protein_74
variant_gene_75
variant_nucleotide_75
variant_protein_75
variant_gene_76
variant_nucleotide_76
variant_protein_76
variant_gene_77
variant_nucleotide_77
variant_protein_77
variant_gene_78
variant_nucleotide_78
variant_protein_78
variant_gene_79
variant_nucleotide_79
variant_protein_79
variant_gene_80
variant_nucleotide_80
variant_protein_80
variant_gene_81
variant_nucleotide_81
variant_protein_81
variant_gene_82
variant_nucleotide_82
variant_protein_82
variant_gene_83
variant_nucleotide_83
variant_protein_83
variant_gene_84
variant_nucleotide_84
variant_protein_84
variant_gene_85
variant_nucleotide_85
variant_protein_85
variant_gene_86
variant_nucleotide_86
variant_protein_86
variant_gene_87
variant_nucleotide_87
variant_protein_87
variant_gene_88
variant_nucleotide_88
variant_protein_88
variant_gene_89
variant_nucleotide_89
variant_protein_89
variant_gene_90
variant_nucleotide_90
variant_protein_90
variant_gene_91
variant_nucleotide_91
variant_protein_91
variant_gene_92
variant_nucleotide_92
variant_protein_92
variant_gene_93
variant_nucleotide_93
variant_protein_93
variant_gene_94
variant_nucleotide_94
variant_protein_94
variant_gene_95
variant_nucleotide_95
variant_protein_95
variant_gene_96
variant_nucleotide_96
variant_protein_96
variant_gene_97
variant_nucleotide_97
variant_protein_97
variant_gene_98
variant_nucleotide_98
variant_protein_98
variant_gene_99
variant_nucleotide_99
variant_protein_99
variant_gene_100
variant_nucleotide_100
variant_protein_100
variant_gene_101
variant_nucleotide_101
variant_protein_101
variant_gene_102
variant_nucleotide_102
variant_protein_102
cnv_number 0
cnv_gene_1
cnv_gain_or_loss_1
cnv_gene_2
cnv_gain_or_loss_2
cnv_gene_3
cnv_gain_or_loss_3
cnv_gene_4
cnv_gain_or_loss_4
cnv_gene_5
cnv_gain_or_loss_5
cnv_gene_6
cnv_gain_or_loss_6
cnv_gene_7
cnv_gain_or_loss_7
cnv_gene_8
cnv_gain_or_loss_8
cnv_gene_9
cnv_gain_or_loss_9
cnv_gene_10
cnv_gain_or_loss_10
cnv_gene_11
cnv_gain_or_loss_11
cnv_gene_12
cnv_gain_or_loss_12
cnv_gene_13
cnv_gain_or_loss_13
cnv_gene_14
cnv_gain_or_loss_14
cnv_gene_15
cnv_gain_or_loss_15
cnv_gene_16
cnv_gain_or_loss_16
cnv_gene_17
cnv_gain_or_loss_17
cnv_gene_18
cnv_gain_or_loss_18
cnv_gene_19
cnv_gain_or_loss_19
cnv_gene_20
cnv_gain_or_loss_20
cnv_gene_21
cnv_gain_or_loss_21
cnv_gene_22
cnv_gain_or_loss_22
cnv_gene_23
cnv_gain_or_loss_23
cnv_gene_24
cnv_gain_or_loss_24
cnv_gene_25
cnv_gain_or_loss_25
cnv_gene_26
cnv_gain_or_loss_26
cnv_gene_27
cnv_gain_or_loss_27
cnv_gene_28
cnv_gain_or_loss_28
cnv_gene_29
cnv_gain_or_loss_29
cnv_gene_30
cnv_gain_or_loss_30
cnv_gene_31
cnv_gain_or_loss_31
cnv_gene_32
cnv_gain_or_loss_32
cnv_gene_33
cnv_gain_or_loss_33
cnv_gene_34
cnv_gain_or_loss_34
cnv_gene_35
cnv_gain_or_loss_35
cnv_gene_36
cnv_gain_or_loss_36
cnv_gene_37
cnv_gain_or_loss_37
cnv_gene_38
cnv_gain_or_loss_38
cnv_gene_39
cnv_gain_or_loss_39
cnv_gene_40
cnv_gain_or_loss_40
cnv_gene_41
cnv_gain_or_loss_41
cnv_gene_42
cnv_gain_or_loss_42
cnv_gene_43
cnv_gain_or_loss_43
cnv_gene_44
cnv_gain_or_loss_44
cnv_gene_45
cnv_gain_or_loss_45
cnv_gene_46
cnv_gain_or_loss_46
cnv_gene_47
cnv_gain_or_loss_47
cnv_gene_48
cnv_gain_or_loss_48
cnv_gene_49
cnv_gain_or_loss_49
cnv_gene_50
cnv_gain_or_loss_50
cnv_gene_51
cnv_gain_or_loss_51
cnv_gene_52
cnv_gain_or_loss_52
cnv_gene_53
cnv_gain_or_loss_53
cnv_gene_54
cnv_gain_or_loss_54
cnv_gene_55
cnv_gain_or_loss_55
cnv_gene_56
cnv_gain_or_loss_56
cnv_gene_57
cnv_gain_or_loss_57
cnv_gene_58
cnv_gain_or_loss_58
cnv_gene_59
cnv_gain_or_loss_59
cnv_gene_60
cnv_gain_or_loss_60
cnv_gene_61
cnv_gain_or_loss_61
cnv_gene_62
cnv_gain_or_loss_62
cnv_gene_63
cnv_gain_or_loss_63
cnv_gene_64
cnv_gain_or_loss_64
cnv_gene_65
cnv_gain_or_loss_65
cnv_gene_66
cnv_gain_or_loss_66
cnv_gene_67
cnv_gain_or_loss_67
cnv_gene_68
cnv_gain_or_loss_68
cnv_gene_69
cnv_gain_or_loss_69
cnv_gene_70
cnv_gain_or_loss_70
cnv_gene_71
cnv_gain_or_loss_71
cnv_gene_72
cnv_gain_or_loss_72
cnv_gene_73
cnv_gain_or_loss_73
cnv_gene_74
cnv_gain_or_loss_74
cnv_gene_75
cnv_gain_or_loss_75
cnv_gene_76
cnv_gain_or_loss_76
cnv_gene_77
cnv_gain_or_loss_77
cnv_gene_78
cnv_gain_or_loss_78
cnv_gene_79
cnv_gain_or_loss_79
cnv_gene_80
cnv_gain_or_loss_80
cnv_gene_81
cnv_gain_or_loss_81
cnv_gene_82
cnv_gain_or_loss_82
cnv_gene_83
cnv_gain_or_loss_83
cnv_gene_84
cnv_gain_or_loss_84
cnv_gene_85
cnv_gain_or_loss_85
cnv_gene_86
cnv_gain_or_loss_86
cnv_gene_87
cnv_gain_or_loss_87
cnv_gene_88
cnv_gain_or_loss_88
cnv_gene_89
cnv_gain_or_loss_89
cnv_gene_90
cnv_gain_or_loss_90
cnv_gene_91
cnv_gain_or_loss_91
cnv_gene_92
cnv_gain_or_loss_92
cnv_gene_93
cnv_gain_or_loss_93
cnv_gene_94
cnv_gain_or_loss_94
cnv_gene_95
cnv_gain_or_loss_95
cnv_gene_96
cnv_gain_or_loss_96
cnv_gene_97
cnv_gain_or_loss_97
cnv_gene_98
cnv_gain_or_loss_98
cnv_gene_99
cnv_gain_or_loss_99
cnv_gene_100
cnv_gain_or_loss_100
cnv_gene_101
cnv_gain_or_loss_101
cnv_gene_102
cnv_gain_or_loss_102
cnv_gene_103
cnv_gain_or_loss_103
cnv_gene_104
cnv_gain_or_loss_104
cnv_gene_105
cnv_gain_or_loss_105
cnv_gene_106
cnv_gain_or_loss_106
cnv_gene_107
cnv_gain_or_loss_107
cnv_gene_108
cnv_gain_or_loss_108
cnv_gene_109
cnv_gain_or_loss_109
cnv_gene_110
cnv_gain_or_loss_110
cnv_gene_111
cnv_gain_or_loss_111
cnv_gene_112
cnv_gain_or_loss_112
cnv_gene_113
cnv_gain_or_loss_113
cnv_gene_114
cnv_gain_or_loss_114
cnv_gene_115
cnv_gain_or_loss_115
cnv_gene_116
cnv_gain_or_loss_116
cnv_gene_117
cnv_gain_or_loss_117
cnv_gene_118
cnv_gain_or_loss_118
cnv_gene_119
cnv_gain_or_loss_119
cnv_gene_120
cnv_gain_or_loss_120
cnv_gene_121
cnv_gain_or_loss_121
cnv_gene_122
cnv_gain_or_loss_122
cnv_gene_123
cnv_gain_or_loss_123
cnv_gene_124
cnv_gain_or_loss_124
cnv_gene_125
cnv_gain_or_loss_125
cnv_gene_126
cnv_gain_or_loss_126
cnv_gene_127
cnv_gain_or_loss_127
cnv_gene_128
cnv_gain_or_loss_128
cnv_gene_129
cnv_gain_or_loss_129
cnv_gene_130
cnv_gain_or_loss_130
cnv_gene_131
cnv_gain_or_loss_131
cnv_gene_132
cnv_gain_or_loss_132
cnv_gene_133
cnv_gain_or_loss_133
cnv_gene_134
cnv_gain_or_loss_134
cnv_gene_135
cnv_gain_or_loss_135
cnv_gene_136
cnv_gain_or_loss_136
cnv_gene_137
cnv_gain_or_loss_137
cnv_gene_138
cnv_gain_or_loss_138
cnv_gene_139
cnv_gain_or_loss_139
cnv_gene_140
cnv_gain_or_loss_140
cnv_gene_141
cnv_gain_or_loss_141
cnv_gene_142
cnv_gain_or_loss_142
cnv_gene_143
cnv_gain_or_loss_143
cnv_gene_144
cnv_gain_or_loss_144
cnv_gene_145
cnv_gain_or_loss_145
cnv_gene_146
cnv_gain_or_loss_146
cnv_gene_147
cnv_gain_or_loss_147
cnv_gene_148
cnv_gain_or_loss_148
cnv_gene_149
cnv_gain_or_loss_149
cnv_gene_150
cnv_gain_or_loss_150
cnv_gene_151
cnv_gain_or_loss_151
cnv_gene_152
cnv_gain_or_loss_152
cnv_gene_153
cnv_gain_or_loss_153
cnv_gene_154
cnv_gain_or_loss_154
cnv_gene_155
cnv_gain_or_loss_155
cnv_gene_156
cnv_gain_or_loss_156
cnv_gene_157
cnv_gain_or_loss_157
cnv_gene_158
cnv_gain_or_loss_158
cnv_gene_159
cnv_gain_or_loss_159
cnv_gene_160
cnv_gain_or_loss_160
cnv_gene_161
cnv_gain_or_loss_161
cnv_gene_162
cnv_gain_or_loss_162
cnv_gene_163
cnv_gain_or_loss_163
cnv_gene_164
cnv_gain_or_loss_164
cnv_gene_165
cnv_gain_or_loss_165
cnv_gene_166
cnv_gain_or_loss_166
cnv_gene_167
cnv_gain_or_loss_167
cnv_gene_168
cnv_gain_or_loss_168
cnv_gene_169
cnv_gain_or_loss_169
cnv_gene_170
cnv_gain_or_loss_170
cnv_gene_171
cnv_gain_or_loss_171
cnv_gene_172
cnv_gain_or_loss_172
cnv_gene_173
cnv_gain_or_loss_173
cnv_gene_174
cnv_gain_or_loss_174
cnv_gene_175
cnv_gain_or_loss_175
cnv_gene_176
cnv_gain_or_loss_176
cnv_gene_177
cnv_gain_or_loss_177
cnv_gene_178
cnv_gain_or_loss_178
cnv_gene_179
cnv_gain_or_loss_179
cnv_gene_180
cnv_gain_or_loss_180
cnv_gene_181
cnv_gain_or_loss_181
cnv_gene_182
cnv_gain_or_loss_182
cnv_gene_183
cnv_gain_or_loss_183
cnv_gene_184
cnv_gain_or_loss_184
cnv_gene_185
cnv_gain_or_loss_185
cnv_gene_186
cnv_gain_or_loss_186
cnv_gene_187
cnv_gain_or_loss_187
cnv_gene_188
cnv_gain_or_loss_188
cnv_gene_189
cnv_gain_or_loss_189
cnv_gene_190
cnv_gain_or_loss_190
cnv_gene_191
cnv_gain_or_loss_191
cnv_gene_192
cnv_gain_or_loss_192
cnv_gene_193
cnv_gain_or_loss_193
cnv_gene_194
cnv_gain_or_loss_194
cnv_gene_195
cnv_gain_or_loss_195
cnv_gene_196
cnv_gain_or_loss_196
cnv_gene_197
cnv_gain_or_loss_197
cnv_gene_198
cnv_gain_or_loss_198
cnv_gene_199
cnv_gain_or_loss_199
cnv_gene_200
cnv_gain_or_loss_200
cnv_gene_201
cnv_gain_or_loss_201
cnv_gene_202
cnv_gain_or_loss_202
cnv_gene_203
cnv_gain_or_loss_203
cnv_gene_204
cnv_gain_or_loss_204
cnv_gene_205
cnv_gain_or_loss_205
cnv_gene_206
cnv_gain_or_loss_206
cnv_gene_207
cnv_gain_or_loss_207
cnv_gene_208
cnv_gain_or_loss_208
cnv_gene_209
cnv_gain_or_loss_209
cnv_gene_210
cnv_gain_or_loss_210
cnv_gene_211
cnv_gain_or_loss_211
cnv_gene_212
cnv_gain_or_loss_212
cnv_gene_213
cnv_gain_or_loss_213
cnv_gene_214
cnv_gain_or_loss_214
cnv_gene_215
cnv_gain_or_loss_215
cnv_gene_216
cnv_gain_or_loss_216
cnv_gene_217
cnv_gain_or_loss_217
cnv_gene_218
cnv_gain_or_loss_218
cnv_gene_219
cnv_gain_or_loss_219
cnv_gene_220
cnv_gain_or_loss_220
cnv_gene_221
cnv_gain_or_loss_221
cnv_gene_222
cnv_gain_or_loss_222
cnv_gene_223
cnv_gain_or_loss_223
cnv_gene_224
cnv_gain_or_loss_224
cnv_gene_225
cnv_gain_or_loss_225
cnv_gene_226
cnv_gain_or_loss_226
cnv_gene_227
cnv_gain_or_loss_227
cnv_gene_228
cnv_gain_or_loss_228
cnv_gene_229
cnv_gain_or_loss_229
cnv_gene_230
cnv_gain_or_loss_230
amplifications_number 0
amplifications_gene_1
amplifications_gene_2
amplifications_gene_3
amplifications_gene_4
amplifications_gene_5
amplifications_gene_6
amplifications_gene_7
amplifications_gene_8
amplifications_gene_9
amplifications_gene_10
amplifications_gene_11
amplifications_gene_12
amplifications_gene_13
amplifications_gene_14
amplifications_gene_15
amplifications_gene_16
amplifications_gene_17
amplifications_gene_18
amplifications_gene_19
amplifications_gene_20
amplifications_gene_21
amplifications_gene_22
amplifications_gene_23
amplifications_gene_24
amplifications_gene_25
amplifications_gene_26
amplifications_gene_27
amplifications_gene_28
amplifications_gene_29
amplifications_gene_30
genomics_add_notes Data imported via genomics package
gen_qcdash___daf
gen_qcdash___dac 1
gen_qcdash___dmf
gen_qcdash___dmqc
gen_qcdash___lpf
gen_qcdash___lpqc
gen_daf_note
gen_dmf_note
gen_lpf_note
genomics_complete

Table 2 - Output of genetex_to_redcap()


Import to REDCap®

The data in Table 2 is imported to REDCap® from the Shiny application by calling the function redcap_write_oneshot() from the REDCapR package. An example of a portion of that form with data imported from GENETEX is seen in Supplemental Figure 1.

Supplemental Figure 1 - Example of the data form in REDCap® after the input text has been passed to genetex_to_redcap().


Real-World Deployment

In order to evaluate the performance of augmented abstraction with GENETEX in the real-world setting compared to classical abstraction (i.e. “manual abstraction”), we selected 7 genomic reports at random (3 Guardant, 2 Foundation Medicine, 1 MGH SNaPshot, 1 BWH Oncopanel) for abstraction. Each report was abstracted independently by two data abstractors via manual abstraction, as well as with GENETEX. In total, 744 data elements were captured from these 7 reports. Agreement rates between the two human abstractors was 99.19%. Importantly, >99% agreement was reached between both human abstractors and GENETEX (99.33% and 99.19%). Given that the agreement between classical and augmented abstraction was high, we next evaluated if the GENETEX pipeline would improve time-to-analysis. The mean time for manual abstraction for each report was 784.5 (range: 220.5-3096.5) seconds compared with 136 (range: 75- 216) seconds for augmented abstraction (Wilcoxon test p value = 0.015625).

Limitations and Solutions

GENETEX is to be used in conjunction with Genomics Instrument and thus, it is dependent on that form being installed into a REDCap® project. However, we have made the data dictionary freely available so that others may incorporate it into their individual project. Importantly, our regular expressions system of mapping variable names to key genomic data provides a high-degree of flexibility to map alternate variable names for REDCap® instruments with different data dictionaries. Additional limitations include the fact that at this time, GENETEX does not support all potential platforms available for CGD. However, due to its open-source position, external developers can perform pull-requests on GitHub for incorporation of additional platforms and future refinement. Lastly, genomic findings abstracted from individual reports describe results at the point in time at which they were provided are a historical record and may not represent the latest understanding of genomic science or precision oncology treatment paradigms.

Conclusions

Ideally, structured data objects that exactly match clinical genomic reports would be universally available directly from the providers of genomic profiling reports, and may be at some point in the future. Until then, there may be a desire to abstract or otherwise process these reports into structured formats. GENETEX is a browser-based application for natural language processing of CGD obtained in routine clinical practice. It facilitates extraction of data from EHR, transformation of semi-structured data into a structured format and loading into REDCap®. Its Shiny extension enables non-R users to execute the package without familiarity of R. Real-world deployment of the GENETEX demonstrated excellent agreement with classical abstraction in roughly 1/5 of the time. Thus, augmented abstraction with browser-based applications can decrease the barrier to data capture and importantly improve time-to-analysis of clinico-genomic data.

References

1.
Harris, P. A. et al. Research electronic data capture (REDCap)–a metadata-driven methodology and workflow process for providing translational research informatics support. Journal of Biomedical Informatics 42, 377–381 (2009).
2.
Shalhout, S. & Miller, D. Optimizing real-world data collection: Clinical genomics. (2020).
3.
Miller, D. & Shalhout, S. Optimizing real-world data collection: Genomics electronic data capture instrument. (2021).
4.
Project data sphere. https://www.projectdatasphere.org/research/programs/rare-tumor-registries/merkel-cell-carcinoma.
5.
H., W., P., D. & M., E. roxygen2: In-source documentation for r https://cran.r-project.org/package=roxygen2 (version 6.0.1. R package Version 7.1.1, (2013).
6.
Wickham, H., Francois, R., Henry, L. & Muller, K. Dplyr: A grammar of data manipulation. R Package Version 1.0.5, (2021).
7.
Wickham, H. Tidyr: Tidy messy data. R package Version 1.1.3, (2013).
8.
Wickham, H., Hester, J. & Francois, R. Readr: Read rectangular text data. R package Version 1.4.0, (2020).
9.
Wickham, H. Stringr: Simple, consistent wrappers for common string. R package Version 1.4.0, (2019).
10.
L., H. & H., W. Purrr: Functional programming tools. R package Version 0.3.4, (2020).
11.
Beasley, W. et al. REDCapR: Interaction between r and REDCap. R package Version 0.11.0, (2020).
12.
Bache, SM., Wickham, H. & Henry, L. Magrittr: A forward-pipe operator for r. R package Version 2.0.1, (2020).
13.
Mahto, A. Splitstackshape: Stack and reshape datasets after splitting concatenated values. R package Version 1.4.8, (2019).
14.
Chang, W., Cheng, J., Allaire, JJ., Xie, Y. & McPherson, J. Shiny: Web application framework for r. R package Version 1.6.0, (2018).
15.
GuardantHealth. https://guardanthealth.com.
16.
Foundation medicine. https://www.foundationmedicine.com.
17.
Massachusetts general hospital center for integrated diagnostics. https://www.massgeneral.org/pathology/services/center-for-integrated-diagnostics-default.
18.
Brigham and women’s hospital center for advanced molecular diagnostics. (https://www.brighamandwomens.org/pathology/center-for-advanced-molecular-diagnostics/molecular-diagnostics-laboratory.
Avatar
David Michael Miller
Medical Oncologist and Dermatologist

My research interests include clinical and translational research in advanced skin cancers.

Related