Optimizing Real-World Data Collection: Clinical Genomics


Sophia Z. Shalhout, David M. Miller


September 12, 2020

Facilitate capture of real-world next-generation genomic data


Clinico-Genomics Databases

Overview of REDCap

Capturing Tumor Sequencing/Genomics Data:

Key Points

  • In this post we provide tips and recommendations to provide ease for the design of a genomics data capture instrument for use in a REDCap clinico-genomics database
  • Recommendations for design to avoid downstream pitfalls on the backend.
  • After this monograph, you will:
    • Know what aspects to consider in the design and deployment of a Genomics based instrument for capturing clincially acquired next generation sequencing (NGS) results from the EHR into the registry.
    • Understand how to routinely update newly added genes/features of targeted panels for sequencing as they expand to cover more genes that prove to be relevant overtime.
  • Skill Level: Intermediate Knowledge of REDCap Insrument Database design and basic data wrangling in R is required.

Aim 1: Accomodating Several Sequenced Tissues per record

  • One record may have several tissues sequenced, for example a primary tumor and then a metastases. So we will enable the REDCAp repeating instruments function for the Genomics Capture by enabling Repeatable instruments in the Project Setup..

  • In the event that more than one instrument is created per record for different tissue samples, the ability to label each tissue by the specific lesion is important. Providing a Tissue Lesion Identifier is essential. However, throughout all the instruments in any clinico-genomics database, consistency aids back end analysis. It is useful to have the same lesion identified with the same identifier or tag in every instrument instance.
  • For example, if a lesion on the upper right extremity has an instance in the Pathology instrument and the Genomics instrument, it is most useful to have the same lesion labeled the same way for correct association on the back end and to facilitate analysis.
  • One way to promote accuracy and consistency with exact nomenclature is to utilize REDCap’s piping feature and pipe the labels given in one instrument into other instruments.
  • This helps to ensure exact duplicated tag names by the abstractor.

Aim 2: Ensure Record Capture Completeness

  • It is unlikely every record will have tissue that undergoes institutional next generation sequencing. Therefore, it is important to add a “Yes/No” field to indicate if genomic analysis was performed on a tumor specimen.
  • This allows the abstractor to answer “NO” after thoroughly searching the EHR chart. A”NO” option at the current time allows for assurance the record has been searched and there is not missing genomics data yet to be entered.
  • This is beneficial for the abstractors performing a second pass abstraction, as well as for the database manager and the QC aspect of the database pipeline.

Aim 3: Unique identifiers

  • Create dropdown fields to allow designation of the type of tissue sequenced and the date it was performed.
  • This will allow for unique identifiers for each sequencing date, for each tissue specimen within each record.

Aim 4: Targeted Assay version-year

  • Create a dropdown menu that allows for entering the type of targeted NGS assay performed.
  • This is essential since different platforms have different coverage and sequence different gene panels.
  • Furthermore including information on the assay reveals which genes were wildtype as opposed to never sequenced.
  • Most sequencing platforms report the altered genes, not every gene on the panel that came back wildtype.
  • The version/year of the sequencing platform is also important since different versions of the same platform sequence different genes. For example, over the last five years, some targeted NGS-assays have added over 200 new clincially relevant genes to their panel.
  • The full list of genes included in each platform should be maintained for back end use.

Aim 5: Which gene options to include?

  • Once the repeatable genomics instrument and the identifying specimem dropdown fields are created, it’s essential to decide which genes will be pre-populated in drop down menus as options to enter genomics data abstracted from the EHR.

  • As previously mentioned, after identifying the main NGS platforms typically used, curate a total gene list of all the genes tested in all the panels. This will serve as the Master Gene List. Obviously, many of the genes will overlap among the different assays or use different nomenclature. R software can be used to curate one final list.

  • For example, let’s create a simplified list for demonstration purposes.

Load Libraries

library(dplyr, warn.conflicts=FALSE)   # Useful for manipulating the dataframes
Let’s create a data set of assays and associated genes
  • This is an overly simplified number of assays with reduced lists for demonstration
  • In reality over ten different assays may be used each with 100-800 different genes
Master_Gene_List = data.frame (ASSAY= c('Assay A', 'Assay B', 'Assay C', 'Assay D'), Gene1= c("EGFR", "EGFR", "ALK", "BRAF"), Gene2= c("ALK", "PIK3CA", "ARID1A", "TP53"), Gene3= c("TP53", "B-raf", "TERT", "SMARCF1")) 
Let’s view the list of assays and associated genes
kable(Master_Gene_List) %>%
  kable_styling(bootstrap_options = "striped", full_width = F)
ASSAY Gene1 Gene2 Gene3
Assay B EGFR PIK3CA B-raf
  • Notice that different assays may contain the same exact genes. Furthermore, different assays may refer to the same genes in their respective panels but with different nomenclature or symbols.

  • Therefore, some tidying of the master list is needed. Once we have a finalized list of genes, it can be used in the dropdown menu in the REDCap genomics instrument.

  • Alternatively, each gene list can be kept as is and once a user clicks on the assay used for the sequencing, branching logic can be implemented and only allow dropdown menus with those assay-specific genes

  • However, we have found it easier to update the instrument with new genes using one master list, as opposed to new individual assay lists

    • Furthermore, QC is more amenable to gene errors on one list as opposed to branching logic
      • For example, if the assay type or version was incorrect and upon QC switched to the correct assay type, the rest of the branching logic dependent data entered will be dependent on the wrong assay and deleted
      • These gene lists are long and it is easier to maintain one main gene list than separate ones for separate assays
  • Note that several paths work and the reader is urged to test what works best for their system.

  • If the user chooses to have the assay specific list, then stop here. Once an assay is chosen, the assay-specific dropdown menu should branch off and provide the abstractor with the specific assay associated gene list. These genes should be the only ones available for that NGS-specific assay.

  • For example, if the abstractor chooses “Assay A,” three fields with dropdown menus for indels, fusions, and SNVs will only include the genes associated with that assay. Note in this example, that would be ‘EGFR’, ‘ALK’, and ‘TP53’

    • This Assay Dependent Gene List method will still require genes with different names/symbols to be wrangled and replaced with one common preferred symbol/name for faciliation and ease of analysis.
  • Below, we outline how to create one master list, remove duplications, and ease data abstractor gene choices. Remember, with updated panels, some adopt new nomenclature of the genes instead of alias/archaic names

  • Thus, it is important UX/UI is considered to ensure ease as well as accuracy in data entry.

Creating one Master Gene List

  • For one master list, we will next remove the duplicated genes.
  • This Master list will be used in the dropdown menu for indels, fusions, and SNVs for the abstractor to enter genomics data, independent of the assay selection.
  • Since the Master Gene List is independent of the assay, we will melt the datatable and remove duplications.
Genes<- melt (Master_Gene_List %>% select("ASSAY", "Gene1", "Gene2", "Gene3"), id.var="ASSAY")

Let’s view the melted table

 kable(Genes) %>% 
  kable_styling(bootstrap_options = "striped", full_width = F) 
ASSAY variable value
Assay A Gene1 EGFR
Assay B Gene1 EGFR
Assay C Gene1 ALK
Assay D Gene1 BRAF
Assay A Gene2 ALK
Assay B Gene2 PIK3CA
Assay C Gene2 ARID1A
Assay D Gene2 TP53
Assay A Gene3 TP53
Assay B Gene3 B-raf
Assay C Gene3 TERT
Assay D Gene3 SMARCF1

Now lets keep only distinct genes in the ‘value’ column and view this new table

 kable(Genes %>% distinct(value, .keep_all= TRUE)) %>% 
  kable_styling(bootstrap_options = "striped", full_width = F) 
ASSAY variable value
Assay A Gene1 EGFR
Assay C Gene1 ALK
Assay D Gene1 BRAF
Assay B Gene2 PIK3CA
Assay C Gene2 ARID1A
Assay D Gene2 TP53
Assay B Gene3 B-raf
Assay C Gene3 TERT
Assay D Gene3 SMARCF1
  • As you can see, the duplicated EGFR and TP53 have been removed. Again this Master Gene List is independent of the Assay so we will only be adding the ‘value’ column to the dropdown menu list in REDCap indels, SNVs and fusion genomics field.
  • Note that “BRAF” and “B-raf” are the same gene but appear with different nomenclature/symbols. We have found that abstractors are not necessarily familiar with all annotations/symbols of every gene. A simple solution is to include the other options for the gene symbol so that if the abstractor starts to type “BR” in the field with the drop down menu or “b-”, only one correct option in the drop down menu appears in REDCap
  • So, lets replace ‘BRAF’ with “BRAF (B-raf)” to facilitate the data abstractor choosing the same gene whether entering ‘B-raf’ mutations from Assay C or “BRAF” mutations detected from Assay D.
Genes$value <- replace(Genes$value, Genes$value == "B-raf", "BRAF (B-raf)")
  • Let’s drop the BRAF alone value
Genes<-subset(Genes, value != "BRAF" )
  • Let’s view the new table
 kable(Genes %>% distinct(value, .keep_all= TRUE)) %>% 
  kable_styling(bootstrap_options = "striped", full_width = F) 
ASSAY variable value
Assay A Gene1 EGFR
Assay C Gene1 ALK
Assay B Gene2 PIK3CA
Assay C Gene2 ARID1A
Assay D Gene2 TP53
Assay B Gene3 BRAF (B-raf)
Assay C Gene3 TERT
Assay D Gene3 SMARCF1
  • Note that adjusting this list is simple when only a few rows exist. We could have used “ifelse” statements in R or replaced both with “BRAF (B-raf)” with partial matches and then removed duplications
    • However, with a list of thousands of genes, this is not trivial. Also, note that ‘Partial matches’ may alter other unanticipated gene names when dealing with larger data set
    • Furthermore, even in this small gene list, only a curator familiar with gene symbols would have noticed that ‘ARID1A’ has an alias, ‘SMARCF1’.
#Thus we can adjust this as before.

 Genes$value <- replace(Genes$value, Genes$value == "SMARCF1", "ARID1A (SMARCF)")

#Let's drop the ARID1A alone value
Genes1<-subset(Genes, value != "ARID1A" )

#Let's view the new table

 kable(Genes1 %>% distinct(value, .keep_all= TRUE)) %>% 
  kable_styling(bootstrap_options = "striped", full_width = F)  
ASSAY variable value
Assay A Gene1 EGFR
Assay C Gene1 ALK
Assay B Gene2 PIK3CA
Assay D Gene2 TP53
Assay B Gene3 BRAF (B-raf)
Assay C Gene3 TERT
Assay D Gene3 ARID1A (SMARCF)

HGNC Human Gene Nomenclature and Aliases

  • A much more practical approach is to download an annoted gene list with symbols and aliases.
  • Use the downloaded list to curate the final Master Gene List.
  • The downloaded list can be used for search and replace or as a lookup table.
  • Furthermore, always provide an “other” option with a branching free text field for unanticipated genes that may need to be collected before updates are completed.
  • We prefer downloading sources located on the HGNC (HUGO Gene Nomenclature Committee) site, which provides several downloadable lists of approved gene human gene nomenclature.