# Survival Analysis in R

Author

David M. Miller, Sophia Z. Shalhout

Published

April 13, 2020

Abstract
In this post we provide a user guide on how to analyze Survival Data in R

# Key points

• This post provides a resource for navigating and applying the Survival Tools available in `R`.
• We provide an overview of time-to-event Survival Analysis in Clinical and Translational Research (CT Research).
• We outline the steps to creating Kaplan-Meier Curves and visualizing Hazard Ratios with Forest Plots and provide pearls on how to effectively analyze and plot data sets intended for Survival Analysis.
• We briefly describe the statistics behind Survival Analysis. Readers are referred to an abundance of resources available that provide a deeper explanation of the principles behind the statistical tests, which is beyond the scope of this post.
• Skill Level: Intermediate
• Assumptions made by this post is that readers have some familiarity with basic biostatistics and `R`.
• Additional resources on various elements of Survival Analysis include: Fox and Weisberg, and Rickert

## Background on Survival Analysis

• Survival Analysis is one the most common types of time-to-event data analysis in medical research
• Common objectives in Survival Analysis are:
1. Comparing the survival distribution between groups
2. Estimating the survival distribution
3. Modeling the effect of explanatory variables on the outcome variable
• In Clinical and Translational Research (CT Research) there are numerous applications for time-to-event analyses and various tools to conduct the above common objectives
• For example, in clinical trials, the two main analytic tasks in time-to-event evaluation are to:
1. Test the equality of the treatment groups
2. Estimate the magnitude of the treatment effect
• In CT Research the estimate of the magnitude of the treatment effect is often evaluated in the context of the estimation of risk of that treatment
• For example, in order for a labeled claim to be approved by the U.S. Food and Drug Administration, the sponsor of a Investigational New Drug must establish that the proposed agent demonstrates a “positive risk-benefit assessment”
• As Hajim Uno et al. describe it, when these two tasks are combined, one can use a “test/estimation” approach to evaluating time-to-event data
• Of note, there are different approaches to perform “test/estimation” analyses, though the combination of the log-rank test with the estimation of a Harzard Ratio is by far the most popular in medical research
• Therefore, in this post, we will focus on how to perform these two tests on time-to-event data in `R`

### Handling Dates

• Time-to-event data is more complex than traditional continuous outcomes because two or three pieces of information are involved
• In the simple case of a time-to-event when every subject starts at the same point (i.e. the beginning of a clinical trial), the two pieces of information are the time-to-event and an indicator of whether the event is a failure time or a censoring time
• Often, survival data start as calendar dates rather than as survival times, and then we must convert dates into a usable form for R before we can complete any analysis.
• If you are going to use Dates, they should be in YYYY-Month-Day format
• The `as.Date()` function can be applied to convert numbers in various charactor strings (e.g. “02/27/92”) into recognizable date formats
• `as.Date("02/27/92", "%m/%d/%y")`
• Often though, you will be using a discete time interval in days

# Load Packages and Data Sets

## Load the `survival` package

• You need the survival package (which was created by Terry Therneau at Mayo)
``````library(devtools)
install_github('therneau/survival',dependencies=TRUE)
library(survival)``````

``````library(tidyverse)
library(knitr)
library(survminer)
library(survival)``````

### Overview

• We will use two main data sets for this tutorial which are built-in datasets:
• the `veteran` Data Set
• the `colon` Data Set
• Let’s start by viewing the `veteran` data set
``````veteran <- veteran
trt celltype time status karno diagtime age prior
1 squamous 72 1 60 7 69 0
1 squamous 411 1 70 5 64 10
1 squamous 228 1 60 3 38 0
1 squamous 126 1 60 9 63 10
1 squamous 118 1 70 11 65 10
1 squamous 10 1 20 5 49 0

Definition of Variables in Veteran Data Set
trt: 1=standard, 2=test
time: survival time
status: censoring status
karno: Karnofsky performance score (100 = Normal; 0 = Dead)
diagtime: months from diagnosis to randomisation
age: in years
prior: prior therapy 0=no, 10=yes

# Kaplan-Meier Estimator

### Overview

• The Kaplan-Meier curve is a nonparametric estimator of the survival distribution (i.e. the “estimation” component of the “test/estimation” approach to analysis of time-to-event data)
• In their June 1958 paper in the Journal of the American Statistical Association, E.L. Kaplan and Paul Meier proposed a way to nonparametrically estimate S(t), even in the presence of censoring

### Analyze Survival Data in R

#### `Surv()` Function

• The `Surv()` function takes the following arguments:

function (time, time2, event, type = c(“right”, “left”, “interval”, “counting”, “interval2”, “mstate”), origin = 0)

• To use the functions in the `survival` library, we will have to specify both the “survival time” and the “failure indicator” in the `Surv()` function
• When we use the `Surv()` function, we specify the time variable first and the failure indicator second
• In R, the failure indicator should equal 1 for subjects with the event and equal 0 for subjects who are right censored
• You can type `time =` and `event =` as we will see below

#### Build Surv Object

``````# Generate Surv Object
veteran_Surv <- Surv(time = veteran\$time, event = veteran\$status)``````
• In the above, the variable “time” is the time variable and “status” connotes whether or not a subject had the event (“status = 1”) or was right censored (“status = 0”)
``````# View Surv Object
``````   72  411  228  126  118   10   82  110  314  100+  42    8  144   25+  11
  30  384    4   54   13  123+  97+ 153   59  117   16  151   22   56   21
  18  139   20   31   52  287   18   51  122   27   54    7   63  392   10
   8   92   35  117  132 ``````
• In the list above, each time that has a “+” connotes that it was censored in the analysis

#### Analyze the Survival Data with the `survfit()` function

• To analyze the data we use the `survfit()` function, in which you will place the Surv Object of interest (here `veteran_Surv`) followed by a “~” and a `predictor`.
• If you want a single curve, with no specific predictor, use “1”.
• If there is a `predictor variable` for which you want to compare the outcome of, you will place that variable to the right of the “~”.
• Then place the data you want evaluated to the right of the “,”, as below.

### Generate a KM Curve

#### Overview

• First let’s generate a KM Curve wiht No Predictor
• We will then show how to plot that basic curve
• Then we will demontrate how to generate a KM Curve stratifed by a Explanatory Variable (aka a “Predictor”)
``````# Generate Survfit Object
veteran_Survfit <- survfit(veteran_Surv ~ 1, data = veteran)

# View Survfit Analysis at various time points
summary(veteran_Survfit, times = c(1, 50, 100*(1:10)))``````
``````Call: survfit(formula = veteran_Surv ~ 1, data = veteran)

time n.risk n.event survival std.err lower 95% CI upper 95% CI
1    137       2    0.985  0.0102      0.96552       1.0000
50     84      50    0.619  0.0416      0.54299       0.7064
100     55      27    0.418  0.0425      0.34251       0.5101
200     25      26    0.205  0.0360      0.14560       0.2895
300     13      10    0.117  0.0295      0.07147       0.1917
400      6       7    0.054  0.0211      0.02509       0.1163
500      4       2    0.036  0.0175      0.01389       0.0934
600      2       2    0.018  0.0126      0.00459       0.0707
700      2       0    0.018  0.0126      0.00459       0.0707
800      2       0    0.018  0.0126      0.00459       0.0707
900      2       0    0.018  0.0126      0.00459       0.0707``````
• The “times” argument in the “summary()” function allows you to control which time intervals you print
• If you want to view the output of the analysis of the total output, use the following call
• Of note, you could have also used the following code to generate the same survfit object `veteran_Survfit <- survfit(Surv(time = veteran\$time, event = veteran\$status) ~ 1, data = veteran)`
``veteran_Survfit``
``````Call: survfit(formula = veteran_Surv ~ 1, data = veteran)

n events median 0.95LCL 0.95UCL
[1,] 137    128     80      52     105``````

You can also view the median of the data with the following code

``median(veteran\$time)``
`` 80``

# Visualizing Survival Data

### Overview

• You can plot your survival data with base `R` or other functions such as `ggsurvplot()`

### Plot Data Using Base `R`

• The `plot()` function can be used to take a surv object and generate a basic Kaplan-Meier curve, as below
``````plot(veteran_Surv,
main = "Overall Survival", #plot title
sub = "Unstratified Cohort", # Subtitle
xlab="Days",
ylab="Survival Probability" )`````` ### Plot Data Using `ggsurvplot`

• The `ggsurvplot` can generate visually appealing plots
``````# Kaplan-Meier using ggsurvplot
ggsurvplot(fit = veteran_Survfit,
data = veteran,
####### Format Title #######
title = "Overall Survival",
subtitle = "Unstratified Cohort",
font.title = c(22, "bold", "black"),
ggtheme = theme_minimal() + theme(plot.title = element_text(hjust = 0.5, face = "bold"))+
theme(plot.subtitle = element_text(hjust = 0.5, size = 16, face = "italic")), # center and bold title and subtitle
####### Format Axes #######
xlab="Days", # changes xlabel,
ylab = "Survival Probability",
font.x=c(18,"bold"), # changes x axis labels
font.y=c(18,"bold"), # changes y axis labels
font.xtickslab=c(14,"plain"), # changes the tick label on x axis
font.ytickslab=c(14,"plain"),
####### Format Curve Lines #######
color = "black",
####### Censor Details ########
censor.shape="|",
censor.size = 5,
####### Confidence Intervals ########
conf.int = TRUE, # To Remove conf intervals use "FALSE"
conf.int.fill = "purple", # fill color to be used for confidence interval
surv.median.line = "hv", # allowed values include one of c("none", "hv", "h", "v"). v: vertical, h:horizontal
######## Format Legend #######
legend.title = "All Patients",
legend.labs = "All Patients", # Change the Strata Legend
######## Risk Table #######
risk.table = TRUE, # Adds Risk Table
risk.table.height = 0.25, # Adjusts the height of the risk table (default is 0.25)
risk.table.fontsize = 4.5
)`````` # KM Curves with a Predictor

### Overview

• This is typically a more common use of the KM method, where one wants to test a hypothesis that a drug prolongs Overall Survival, for example

### Stratify Analysis Based on Treatment (Standard vs. Experimental)

``````# Generate Survfit_trt Object
veteran_Survfit_trt <- survfit(veteran_Surv ~ veteran\$trt, data = veteran)

# View Survfit Analysis at various time points
summary(veteran_Survfit_trt, times = c(1, 50, 100*(1:10)))``````
``````Call: survfit(formula = veteran_Surv ~ veteran\$trt, data = veteran)

veteran\$trt=1
time n.risk n.event survival std.err lower 95% CI upper 95% CI
1     69       0   1.0000  0.0000      1.00000        1.000
50     46      22   0.6797  0.0563      0.57782        0.800
100     34      12   0.5020  0.0606      0.39615        0.636
200     12      19   0.1947  0.0501      0.11761        0.322
300      5       6   0.0885  0.0371      0.03896        0.201
400      2       3   0.0354  0.0244      0.00917        0.137
500      1       1   0.0177  0.0175      0.00256        0.123

veteran\$trt=2
time n.risk n.event survival std.err lower 95% CI upper 95% CI
1     68       2   0.9706  0.0205      0.93125        1.000
50     38      28   0.5588  0.0602      0.45244        0.690
100     21      15   0.3326  0.0578      0.23670        0.467
200     13       7   0.2162  0.0517      0.13538        0.345
300      8       4   0.1464  0.0454      0.07973        0.269
400      4       4   0.0732  0.0344      0.02912        0.184
500      3       1   0.0549  0.0303      0.01861        0.162
600      2       1   0.0366  0.0251      0.00953        0.140
700      2       0   0.0366  0.0251      0.00953        0.140
800      2       0   0.0366  0.0251      0.00953        0.140
900      2       0   0.0366  0.0251      0.00953        0.140``````
``veteran_Survfit_trt``
``````Call: survfit(formula = veteran_Surv ~ veteran\$trt, data = veteran)

n events median 0.95LCL 0.95UCL
veteran\$trt=1 69     64  103.0      59     132
veteran\$trt=2 68     64   52.5      44      95``````
``````# Kaplan-Meier using ggsurvplot
ggsurvplot(fit = veteran_Survfit_trt,
data = veteran,
####### Format Title #######
title = "Overall Survival",
subtitle = "Stratified By Treatment",
font.title = c(22, "bold", "black"),
ggtheme = theme_classic() + theme(plot.title = element_text(hjust = 0.5, face = "bold"))+ # theme_classic will give a white background with no lines on the plot
theme(plot.subtitle = element_text(hjust = 0.5, size = 16, face = "italic")),
####### Format Axes #######
xlab="Days", # changes xlabel,
ylab = "Survival Probability",
font.x=c(18,"bold"), # changes x axis labels
font.y=c(18,"bold"), # changes y axis labels
font.xtickslab=c(14,"plain"), # changes the tick label on x axis
font.ytickslab=c(14,"plain"),
####### Format Curve Lines #######
palette = c("red","black"),
####### Censor Details ########
censor = TRUE, # logical value. If TRUE, censors will be drawn,
censor.shape="|",
censor.size = 5,
####### Confidence Intervals ########
conf.int = TRUE, # To Remove conf intervals use "FALSE"
conf.int.fill = "purple", # fill color to be used for confidence interval
surv.median.line = "hv", # allowed values include one of c("none", "hv", "h", "v"). v: vertical, h:horizontal
######## Format Legend #######
legend = "none", # If you'd prefer more space for your plot, consider removing the legend
legend.title = "All Patients",
legend.labs = c("Treatment 1","Treatment 2"), # Change the Strata Legend
######## Risk Table #######
risk.table = TRUE, # Adds Risk Table
risk.table.height = 0.25 # Adjusts the height of the risk table (default is 0.25)
)`````` #### Another Example of a KM stratified by an Explanatory Variable - Cell Type

``````# Generate Survfit_histo Object
veteran_Survfit_histo <- survfit(veteran_Surv ~ veteran\$celltype, data = veteran)

# View Survfit Analysis at various time points
summary(veteran_Survfit_histo, times = c(1, 50, 100*(1:10)))``````
``````Call: survfit(formula = veteran_Surv ~ veteran\$celltype, data = veteran)

veteran\$celltype=squamous
time n.risk n.event survival std.err lower 95% CI upper 95% CI
1     35       2    0.943  0.0392       0.8690        1.000
50     23       9    0.681  0.0794       0.5423        0.856
100     20       2    0.622  0.0828       0.4793        0.808
200     13       6    0.426  0.0873       0.2849        0.636
300      8       4    0.288  0.0820       0.1650        0.503
400      5       3    0.180  0.0711       0.0831        0.391
500      3       2    0.108  0.0581       0.0377        0.310
600      2       1    0.072  0.0487       0.0192        0.271
700      2       0    0.072  0.0487       0.0192        0.271
800      2       0    0.072  0.0487       0.0192        0.271
900      2       0    0.072  0.0487       0.0192        0.271

veteran\$celltype=smallcell
time n.risk n.event survival std.err lower 95% CI upper 95% CI
1     48       0   1.0000  0.0000       1.0000        1.000
50     25      23   0.5208  0.0721       0.3971        0.683
100     10      14   0.2257  0.0609       0.1330        0.383
200      3       5   0.0878  0.0457       0.0316        0.244
300      2       1   0.0585  0.0387       0.0160        0.214

time n.risk n.event survival std.err lower 95% CI upper 95% CI
1     27       0    1.000  0.0000       1.0000        1.000
50     14      13    0.519  0.0962       0.3605        0.746
100      5       8    0.206  0.0802       0.0959        0.442

veteran\$celltype=large
time n.risk n.event survival std.err lower 95% CI upper 95% CI
1     27       0   1.0000  0.0000      1.00000        1.000
50     22       5   0.8148  0.0748      0.68071        0.975
100     20       3   0.7037  0.0879      0.55093        0.899
200      9      10   0.3292  0.0913      0.19121        0.567
300      3       5   0.1235  0.0659      0.04335        0.352
400      1       2   0.0412  0.0401      0.00608        0.279
500      1       0   0.0412  0.0401      0.00608        0.279``````
``veteran_Survfit_histo``
``````Call: survfit(formula = veteran_Surv ~ veteran\$celltype, data = veteran)

n events median 0.95LCL 0.95UCL
veteran\$celltype=squamous  35     31    118      82     314
veteran\$celltype=smallcell 48     45     51      25      63
veteran\$celltype=adeno     27     26     51      35      92
veteran\$celltype=large     27     26    156     105     231``````
``````# Kaplan-Meier using ggsurvplot
ggsurvplot(fit = veteran_Survfit_histo,
data = veteran,
title = "Overall Survival",
subtitle = "Stratified By Histology",
font.title = c(22, "bold", "black"),
ggtheme = theme_minimal() + theme(plot.title = element_text(hjust = 0.5, face = "bold"))+ # theme_minimal will give a white background with grid lines on the plot
theme(plot.subtitle = element_text(hjust = 0.5, size = 16, face = "italic")),
####### Censor Details ########
censor = TRUE, #ogical value. If TRUE, censors will be drawn,
censor.shape="|",
censor.size = 5,
####### Confidence Intervals ########
conf.int = TRUE, # To Remove conf intervals use "FALSE"
surv.median.line = "hv", # allowed values include one of c("none", "hv", "h", "v"). v: vertical, h:horizontal
####### Format Axes #######
xlab="Days", # changes xlabel,
ylab = "Survival Probability",
font.x=c(18,"bold"), # changes x axis labels
font.y=c(18,"bold"), # changes y axis labels
font.xtickslab=c(14,"plain"), # changes the tick label on x axis
font.ytickslab=c(14,"plain"),
######## Format Legend #######
legend = "none",
legend.title = "All Patients",
legend.labs = c("Squamous","Small Cell","Adenocarcioma","Large"), # Change the Strata Legend
######## Plot Dimensions #######
surv.plot.height = 0.85, # Default is 0.75
######## Risk Table #######
risk.table = TRUE, # Adds Risk Table
risk.table.height = 0.25, # Adjusts the height of the risk table (default is 0.25)
risk.table.fontsize = 3.0)`````` # Log-Rank Analysis

### Overview

• To test for a difference between survival curves, we can use the `survdiff()` function
• We specify the survival object (which contains the time-to-event and failure indicators) using the `Surv command` and the predictor variable
• This takes the following argument structure:

survdiff(SurvObject ~ Predictor Variable, data = DataFrame)

``survdiff(Surv(time = veteran\$time, event = veteran\$status) ~ veteran\$trt, data = veteran)``
``````Call:
survdiff(formula = Surv(time = veteran\$time, event = veteran\$status) ~
veteran\$trt, data = veteran)

N Observed Expected (O-E)^2/E (O-E)^2/V
veteran\$trt=1 69       64     64.5   0.00388   0.00823
veteran\$trt=2 68       64     63.5   0.00394   0.00823

Chisq= 0  on 1 degrees of freedom, p= 0.9 ``````
• The output from the log-rank test provides important information:
• First, the output provides the observed number of events and the expected number of events under the null hypothesis of no group difference
• We can use this part of the output to determine the direction of the effect (i.e. which group had the event faster).
• In this table, we can see that both groups (trt=1) and (trt=2) were similar as expected
• Consequently it is not surprising that the p value is close to 1 (here p = 0.9)

If we look at the log-rank for outcomes of the different cell types

``survdiff(Surv(time = veteran\$time, event = veteran\$status) ~ veteran\$celltype, data = veteran)``
``````Call:
survdiff(formula = Surv(time = veteran\$time, event = veteran\$status) ~
veteran\$celltype, data = veteran)

N Observed Expected (O-E)^2/E (O-E)^2/V
veteran\$celltype=squamous  35       31     47.7      5.82     10.53
veteran\$celltype=smallcell 48       45     30.1      7.37     10.20
veteran\$celltype=adeno     27       26     15.7      6.77      8.19
veteran\$celltype=large     27       26     34.5      2.12      3.02

Chisq= 25.4  on 3 degrees of freedom, p= 1e-05 ``````
• Here we see that patients with Small Cell histology have more events than expected, meaning they did worse than expected compared to the null hypotheis.

#### Graph Kaplan-Meier with log-rank test displayed

``````ggsurvplot(fit = veteran_Survfit_histo,
data = veteran,
title = "Overall Survival",
subtitle = "Stratified By Histology",
font.title = c(22, "bold", "black"),
ggtheme = theme_grey() + theme(plot.title = element_text(hjust = 0.5, face = "bold"))+ # theme_grey will give a grey background with  lines on the plot
theme(plot.subtitle = element_text(hjust = 0.5, size = 16, face = "italic")),
####### Censor Details ########
censor = TRUE, #ogical value. If TRUE, censors will be drawn
censor.shape="|",
censor.size = 5,
####### Confidence Intervals ########
conf.int = TRUE, # To Remove conf intervals use "FALSE"
surv.median.line = "hv", # allowed values include one of c("none", "hv", "h", "v"). v: vertical, h:horizontal
####### Format Axes #######
xlab="Days", # changes xlabel,
ylab = "Survival Probability",
font.x=c(18,"bold"), # changes x axis labels
font.y=c(18,"bold"), # changes y axis labels
font.xtickslab=c(14,"plain"), # changes the tick label on x axis
font.ytickslab=c(14,"plain"),
######## Format Legend #######
legend.title = "All Patients",
legend.labs = c("Squamous","Small Cell","Adenocarcioma","Large"), # Change the Strata Legend
legend = c(0.8,0.8), #c(0,0) corresponds to the "bottom left" and c(1,1) corresponds to the "top right" position
######## Plot Dimensions #######
surv.plot.height = 0.95, # Default is 0.75
######## Risk Table #######
risk.table = FALSE, # Adds Risk Table
risk.table.height = 0.35, # Adjusts the height of the risk table (default is 0.25)
risk.table.fontsize = 3,
######## p-value details #######
pval = TRUE,
pval.size = 5,
pval.coord = c(1,1)
)`````` # Modeling Survival Data

### Overview

• The goal of modeling survival data is to model the relationship between survival and explanatory variables
• The Cox Proportional Hazards Model is the most common model used for survival data
• It is popular because it allows for a flexible choice of covariates, it is fairly easy to fit and can be performed by standard software (including, of course, `R`)
• First introduced in 1972, The Cox Proportional Hazards model is a linear model for the log of the hazard ratio
• Generally speaking a “Hazard” is the probability of experiencing an event in a defined time period given that the subject has not experienced that event leading up to the interval
• It is a “conditional” probability: i.e. the condition is that the subject in question has not yet experienced the event of interest by the beginning of the interval in question
• The Proportional Hazards (PH) models λ(t;Ζ) = λ0(t)Ψ(Ζ)
• The Cox Proportional hazards model has the advantage over a simple log-rank test of giving us an estimate of the “hazard ratio”
• This is more informative than just a test statistic, and we can also form confidence intervals for the hazard ratio

#### Important Caveats to Proportional Hazards Models

• A condition of the Cox model is that the hazards in the groups being tested are proportional
• What this means is that from the beginning of the study period to it’s end (i.e. the completion of the follow up period) the Hazard Ratio must remain constant for the hazards to be considered proportional
• However, as summarized nicely in their article, Stensrud and Hernán point out that in clinical research, especially intervention trials, due to factors such as variations in disease susceptibility and temporally differential effects between treatment groups, Hazards are quite commonly Not Proportional
• Given this reality, Hazard Ratios must be interpreted in their appropriate context. Stensrud and Hernán suggest ways to supplement HRs with reports of effect measures such as restricted mean survival difference
• A detailed analysis of these approaches is beyond the scope of this post, but we encourage you to read their JAMA article and accompanied references

#### `R` Logistics

• In R, a Cox Model takes the following form

coxph(surv(TimeVariable, EventVariable)~IndicatorVariable, data = DataFrame)

### Coxph in Action

In the `Veteran` data set, let’s look at a model that evaluates the outcome of survival time when the Predictor is Treatment

``coxph(Surv(time = veteran\$time, event = veteran\$status)  ~ veteran\$trt, data = veteran)``
``````Call:
coxph(formula = Surv(time = veteran\$time, event = veteran\$status) ~
veteran\$trt, data = veteran)

coef exp(coef) se(coef)     z     p
veteran\$trt 0.01774   1.01790  0.18066 0.098 0.922

Likelihood ratio test=0.01  on 1 df, p=0.9218
n= 137, number of events= 128 ``````
• Interpretation:
• patients who received treatment had an increase in the log(HR) of survival by 0.01774

In the `Veteran` data set, let’s look at a model that evaluates the outcome of survival time when the Predictor is Cell Type

``coxph(Surv(time = veteran\$time, event = veteran\$status)  ~ veteran\$celltype, data = veteran)``
``````Call:
coxph(formula = Surv(time = veteran\$time, event = veteran\$status) ~
veteran\$celltype, data = veteran)

coef exp(coef) se(coef)     z        p
veteran\$celltypesmallcell 1.0013    2.7217   0.2535 3.950 7.83e-05
veteran\$celltypeadeno     1.1477    3.1510   0.2929 3.919 8.90e-05
veteran\$celltypelarge     0.2301    1.2588   0.2773 0.830    0.407

Likelihood ratio test=24.85  on 3 df, p=1.661e-05
n= 137, number of events= 128 ``````
• Interpretation:
1. compared to patients with a squamous histology, for patients with small cell histology there is a increase in the log(HR) of having an event by 1.0013 (which is statisically significant)
2. compared to patients with a squamous histology, for patients with adenocarcinoma there is a increase in the log(HR) of having an event by 1.1477 (which is statisically significant)
3. compared to patients with a squamous histology, for patients with large cell histology there is a increase in the log(HR) of having an event by 0.2301 (though this is NOT statisically significant)
• These results make intuitive sense. If we look at the survival curves, Large Cell Histology appears to have the best OS (MST of 156 days), meaning this is the group that has the fewest proportional events, where as both small cell and adenocarcinoma have an MST of 51 days, meaning they do worse and thus have more events

With this dataset, we can also test hypotheses based on our biological understanding of lung cancer
- For example if we wanted to test the hypothesis that the treatment tested would be effective in patients with Small Cell Lung Cancer, we could subset the data set, pulling out the patients with SCLC and perform either a log-rank test or CoxPH, as follows

``````SCLC <-  veteran %>% filter(celltype=="smallcell")
survdiff(Surv(time = SCLC\$time, event = SCLC\$status) ~ SCLC\$trt, data = SCLC)``````
``````Call:
survdiff(formula = Surv(time = SCLC\$time, event = SCLC\$status) ~
SCLC\$trt, data = SCLC)

N Observed Expected (O-E)^2/E (O-E)^2/V
SCLC\$trt=1 30       28     32.3     0.575      2.28
SCLC\$trt=2 18       17     12.7     1.464      2.28

Chisq= 2.3  on 1 degrees of freedom, p= 0.1 ``````
``coxph(Surv(time = SCLC\$time, event = SCLC\$status) ~ SCLC\$trt, data = SCLC)``
``````Call:
coxph(formula = Surv(time = SCLC\$time, event = SCLC\$status) ~
SCLC\$trt, data = SCLC)

coef exp(coef) se(coef)     z    p
SCLC\$trt 0.5020    1.6521   0.3313 1.515 0.13

Likelihood ratio test=2.24  on 1 df, p=0.1342
n= 48, number of events= 45 ``````
• If we wanted to test the hypothesis that the treatment tested would be effective in patients with adenocarcinoma, we could subset the data set, pulling out the patients with adenocarcinoma and perform either a log-rank test or CoxPH, as follows
``````adeno <-  veteran %>% filter(celltype=="adeno")
``````Call:

N Observed Expected (O-E)^2/E (O-E)^2/V
adeno\$trt=1  9        9     10.1     0.128     0.233
adeno\$trt=2 18       17     15.9     0.082     0.233

Chisq= 0.2  on 1 degrees of freedom, p= 0.6 ``````
``coxph(Surv(time = adeno\$time, event = adeno\$status) ~ adeno\$trt, data = adeno)``
``````Call:

coef exp(coef) se(coef)     z     p
adeno\$trt 0.2067    1.2296   0.4322 0.478 0.633

Likelihood ratio test=0.23  on 1 df, p=0.6297
n= 27, number of events= 26 ``````
• If we wanted to test the hypothesis that the treatment tested would be effective in patients with squamous cell carcinoma, we could subset the data set, pulling out the patients with SCC and perform either a log-rank test or CoxPH, as follows
``````SCC <-  veteran %>% filter(celltype=="squamous")
survdiff(Surv(time = SCC\$time, event = SCC\$status) ~ SCC\$trt, data = SCC)``````
``````Call:
survdiff(formula = Surv(time = SCC\$time, event = SCC\$status) ~
SCC\$trt, data = SCC)

N Observed Expected (O-E)^2/E (O-E)^2/V
SCC\$trt=1 15       13     9.22     1.545      2.45
SCC\$trt=2 20       18    21.78     0.655      2.45

Chisq= 2.5  on 1 degrees of freedom, p= 0.1 ``````
``coxph(Surv(time = SCC\$time, event = SCC\$status) ~ SCC\$trt, data = SCC)``
``````Call:
coxph(formula = Surv(time = SCC\$time, event = SCC\$status) ~ SCC\$trt,
data = SCC)

coef exp(coef) se(coef)      z     p
SCC\$trt -0.6081    0.5444   0.3954 -1.538 0.124

Likelihood ratio test=2.32  on 1 df, p=0.1274
n= 35, number of events= 31 ``````

## Modeling more than one predictor

### Overview

• The Cox Proportional Model can be used to evalute the effect of multiple Explanatory Variables (i.e Predictors) on an outcome - here survival
• To do so, use the following structure
coxph(surv(TimeVariable, EventVariable) ~ IndicatorVariable1 + IndicatorVariable2 + …, data = DataFrame)
• If we wanted to model the `Veteran` data set evaluating the effect on survival of the treatment and cell type we would use the following model
``coxph(Surv(time = veteran\$time, event = veteran\$status)  ~ veteran\$trt + veteran\$celltype, data = veteran)``
``````Call:
coxph(formula = Surv(time = veteran\$time, event = veteran\$status) ~
veteran\$trt + veteran\$celltype, data = veteran)

coef exp(coef) se(coef)     z        p
veteran\$trt               0.1978    1.2187   0.1968 1.005    0.315
veteran\$celltypesmallcell 1.0964    2.9935   0.2725 4.024 5.73e-05
veteran\$celltypeadeno     1.1689    3.2184   0.2950 3.962 7.43e-05
veteran\$celltypelarge     0.2970    1.3459   0.2857 1.040    0.298

Likelihood ratio test=25.86  on 4 df, p=3.381e-05
n= 137, number of events= 128 ``````

If you had a biological rationale, you could build a model with all of the covariates in order to see their effect on survial, as follows:

``coxph(Surv(time = veteran\$time, event = veteran\$status)  ~ veteran\$trt + veteran\$celltype + veteran\$karno + veteran\$diagtime + veteran\$age + veteran\$prior, data = veteran)``
``````Call:
coxph(formula = Surv(time = veteran\$time, event = veteran\$status) ~
veteran\$trt + veteran\$celltype + veteran\$karno + veteran\$diagtime +
veteran\$age + veteran\$prior, data = veteran)

coef  exp(coef)   se(coef)      z        p
veteran\$trt                2.946e-01  1.343e+00  2.075e-01  1.419  0.15577
veteran\$celltypesmallcell  8.616e-01  2.367e+00  2.753e-01  3.130  0.00175
veteran\$celltypeadeno      1.196e+00  3.307e+00  3.009e-01  3.975 7.05e-05
veteran\$celltypelarge      4.013e-01  1.494e+00  2.827e-01  1.420  0.15574
veteran\$karno             -3.282e-02  9.677e-01  5.508e-03 -5.958 2.55e-09
veteran\$diagtime           8.132e-05  1.000e+00  9.136e-03  0.009  0.99290
veteran\$age               -8.706e-03  9.913e-01  9.300e-03 -0.936  0.34920
veteran\$prior              7.159e-03  1.007e+00  2.323e-02  0.308  0.75794

Likelihood ratio test=62.1  on 8 df, p=1.799e-10
n= 137, number of events= 128 ``````

# Review above principles with `colon` data set

• The colon data set consists of data from the trials evaluating adjuvant therapy in colon cancer and is a built-in dataset within R

Definition of Variables in Colon Data Set
id: id
study: 1 for all patients
rx: Treatment - Obs(ervation), Lev(amisole), Lev(amisole)+5-FU
sex: 1=male
age: in years
obstruct: obstruction of colon by tumour
perfor: perforation of colon
nodes: number of lymph nodes with detectable cancer
time: days until event or censoring
status: censoring status
differ: differentiation of tumour (1=well, 2=moderate, 3=poor)
extent: Extent of local spread (1=submucosa, 2=muscle, 3=serosa, 4=contiguous structures)
surg: time from surgery to registration (0=short, 1=long)
node4: more than 4 positive lymph nodes
etype: event type: 1=recurrence,2=death

Note: The study is originally described in a 1989 Journal of Clinical Oncology paper by Laurie et al. The main report is found in a 1990 New England Journal of Medicine paper by Moertel et al. This data set, which is contained in R, is closest to that of the final report in the 1991 Annals of Internal Medicine paper by Moertel et al. A version of the data with less follow-up time was used in the paper in Statistics in Medicine by Lin (1994).

References

JA Laurie, CG Moertel, TR Fleming, HS Wieand, JE Leigh, J Rubin, GW McCormack, JB Gerstner, JE Krook and J Malliard. Surgical adjuvant therapy of large-bowel carcinoma: An evaluation of levamisole and the combination of levamisole and fluorouracil: The North Central Cancer Treatment Group and the Mayo Clinic. J Clinical Oncology, 7:1447-1456, 1989.

DY Lin. Cox regression analysis of multivariate failure time data: the marginal approach. Statistics in Medicine, 13:2233-2247, 1994.

CG Moertel, TR Fleming, JS MacDonald, DG Haller, JA Laurie, PJ Goodman, JS Ungerleider, WA Emerson, DC Tormey, JH Glick, MH Veeder and JA Maillard. Levamisole and fluorouracil for adjuvant therapy of resected colon carcinoma. New England J of Medicine, 332:352-358, 1990.

CG Moertel, TR Fleming, JS MacDonald, DG Haller, JA Laurie, CM Tangen, JS Ungerleider, WA Emerson, DC Tormey, JH Glick, MH Veeder and JA Maillard, Fluorouracil plus Levamisole as an effective adjuvant therapy after resection of stage II colon carcinoma: a final report. Annals of Internal Med, 122:321-326, 1991.

### View Head of Data Set

``kable(head(colon))``
id study rx sex age obstruct perfor adhere nodes status differ extent surg node4 time etype
1 1 Lev+5FU 1 43 0 0 0 5 1 2 3 0 1 1521 2
1 1 Lev+5FU 1 43 0 0 0 5 1 2 3 0 1 968 1
2 1 Lev+5FU 1 63 0 0 0 1 0 2 3 0 0 3087 2
2 1 Lev+5FU 1 63 0 0 0 1 0 2 3 0 0 3087 1
3 1 Obs 0 71 0 0 1 7 1 2 2 0 1 963 2
3 1 Obs 0 71 0 0 1 7 1 2 2 0 1 542 1

## Kaplan-Meier Curve of the colon data set

``````colon_Survfit <- survfit(Surv(time = colon\$time, event = colon\$status) ~ 1, data = colon)
colon_ggSurvplot <- ggsurvplot (fit = colon_Survfit,
data = colon,
title = "Overall Survival",
subtitle = "Unstratified Curve",
font.title = c(22, "bold", "black"),
ggtheme = theme_dark() + theme(plot.title = element_text(hjust = 0.5, face = "bold"))+
# theme_dark will give a dark background with lines on the plot
theme(plot.subtitle = element_text(hjust = 0.5, size = 16, face = "italic")),
####### Censor Details ########
censor = TRUE, # logical value. If TRUE, censors will be drawn
censor.shape="|",
censor.size = 5,
####### Confidence Intervals ########
conf.int = TRUE, # To Remove conf intervals use "FALSE"
surv.median.line = "hv", # allowed values include one of c("none", "hv", "h", "v"). v: vertical, h:horizontal
####### Format Axes #######
xlab="Days", # changes xlabel,
ylab = "Survival Probability",
font.x=c(18,"bold"), # changes x axis labels
font.y=c(18,"bold"), # changes y axis labels
font.xtickslab=c(14,"plain"), # changes the tick label on x axis
font.ytickslab=c(14,"plain"),
######## Format Legend #######
legend.title = "All Patients",
legend.labs = c("Total Cohort"), # Change the Strata Legend
legend = c(0.8,0.8), #c(0,0) corresponds to the "bottom left" and c(1,1) corresponds to the "top right" position
######## Plot Dimensions #######
surv.plot.height = 0.75, # Default is 0.75
######## Risk Table #######
risk.table = FALSE, # Adds Risk Table
risk.table.height = 0.35, # Adjusts the height of the risk table (default is 0.25)
######## p-value details #######
pval = FALSE,
pval.size = 5,
pval.coord = c(1,1)
)
colon_ggSurvplot`````` ### View Kaplan-Meier Curve of Cohort Stratified by Treatment

First let’s generate a survfit object of the data stratified by treatment

``colon_Survfit_rx <- survfit(Surv(time = colon\$time, event = colon\$status) ~ colon\$rx, data = colon)``

Let’s now view that survfit object

``colon_Survfit_rx``
``````Call: survfit(formula = Surv(time = colon\$time, event = colon\$status) ~
colon\$rx, data = colon)

n events median 0.95LCL 0.95UCL
colon\$rx=Obs     630    345   1723    1323    2213
colon\$rx=Lev     620    333   1709    1219    2593
colon\$rx=Lev+5FU 608    242     NA      NA      NA``````

From the above table, it appears like the group of patients that received Levamisole (Lev) and 5-FU in the adjuvant setting had fewer events than those in the observation group or the levamisole only group.

To evaluate if what we observed is different from what we’d expect under the null hypothesis, let’s evaluate using the log-rank test

``survdiff(Surv(time, status) ~ rx, data = colon)# of note, you'll see that we used a slightly different way to write the above code. It's a simpler way in which we only use the column names and do not require the preceeding "colon\$", which is actually not needed, since we are specifying the data set``
``````Call:
survdiff(formula = Surv(time, status) ~ rx, data = colon)

N Observed Expected (O-E)^2/E (O-E)^2/V
rx=Obs     630      345      299      7.01     10.40
rx=Lev     620      333      295      4.93      7.26
rx=Lev+5FU 608      242      326     21.61     33.54

Chisq= 33.6  on 2 degrees of freedom, p= 5e-08 ``````

As you can see from the above table, there are indeed much fewer events in the Lev+5FU group than what we’d expect under the null, and it is statistically significant

Now we will generate a Kaplan-Meier curve of those three groups

``````colon_ggSurvplot_rx <- ggsurvplot (fit = colon_Survfit_rx,
data = colon,
title = "Overall Survival",
subtitle = "Stratified by Treatment Group",
font.title = c(22, "bold", "black"),
ggtheme = theme_minimal() + theme(plot.title = element_text(hjust = 0.5, face = "bold"))+
theme(plot.subtitle = element_text(hjust = 0.5, size = 16, face = "italic")),
####### Censor Details ########
censor = TRUE, # logical value. If TRUE, censors will be drawn
censor.shape="|",
censor.size = 5,
####### Confidence Intervals ########
conf.int = TRUE, # To Remove conf intervals use "FALSE"
surv.median.line = "hv", # allowed values include one of c("none", "hv", "h", "v"). v: vertical, h:horizontal
####### Format Axes #######
xlab="Days", # changes xlabel,
ylab = "Survival Probability",
font.x=c(18,"bold"), # changes x axis labels
font.y=c(18,"bold"), # changes y axis labels
font.xtickslab=c(14,"plain"), # changes the tick label on x axis
font.ytickslab=c(14,"plain"),
######## Format Legend #######
legend.title = "All Patients",
legend.labs = c("Observation","Levamisole","Levamisole + 5-FU"), # Change the Strata Legend
legend = c(0.8,0.8), #c(0,0) corresponds to the "bottom left" and c(1,1) corresponds to the "top right" position
######## Plot Dimensions #######
surv.plot.height = 0.75, # Default is 0.75
######## Risk Table #######
risk.table = FALSE, # Adds Risk Table
risk.table.height = 0.35, # Adjusts the height of the risk table (default is 0.25)
######## p-value details #######
pval = TRUE,
pval.size = 5,
pval.coord = c(1,0.1)
)
colon_ggSurvplot_rx`````` ## Visualize Hazard Ratios with Forest Plots

### Overview

• Forest plots allow you to evaluate the Hazard Ratios (HR) of all of the covariates that were used in the model you created
• Let’s build a model incorporating covariates that think might influence or explain (i.e. explanatory variables) the survival curve, such as age, treatment, whether or not the tumor adhered to other organs or if patients had more than 4 nodes positive, differentiation status of the tumor, the extent of local spread and time from surgery to registration
• Biologically, all of these covariates may effect the outcome, so you might want to include them in your model
``````model.colon <- coxph( Surv(time, status) ~ sex + rx + adhere + node4 + extent + surg + differ,
data = colon )
model.colon # Call the model to inspect it's contents``````
``````Call:
coxph(formula = Surv(time, status) ~ sex + rx + adhere + node4 +
extent + surg + differ, data = colon)

coef exp(coef) se(coef)      z        p
sex       -0.04499   0.95601  0.06706 -0.671 0.502361
rxLev     -0.01322   0.98687  0.07836 -0.169 0.866068
rxLev+5FU -0.42424   0.65427  0.08466 -5.011 5.42e-07
adhere     0.16073   1.17437  0.08991  1.788 0.073832
node4      0.84508   2.32818  0.06921 12.210  < 2e-16
extent     0.43480   1.54466  0.08098  5.369 7.90e-08
surg       0.24756   1.28089  0.07288  3.397 0.000682
differ     0.19508   1.21541  0.06839  2.853 0.004337

Likelihood ratio test=249.3  on 8 df, p=< 2.2e-16
n= 1812, number of events= 899
(46 observations deleted due to missingness)``````

From the above table, you see that there are a number of covariates that appear to be related to the survival outcome (here recurrence or death). - The column exp(coef) represents the “Hazard Ratio”, which is the most common way to compare the relationship between an explanatory variable and the outcome variable in medical research
- A HR of 1 is the null hypothesis
- A HR < 1 means that there is a decreased risk of the outcome (in survival analysis this = death) if that specific condition is met by the subject (i.e. they received the treatment)
- A HR > 1 means that there is an increased risk of the outcome (in survival analysis this = death) if that specific condition is met by the subject (i.e. had more than 4 positive lymph nodes)

Let’s view the Forest Plot of that model

``ggforest(model.colon)`` Interpretation from this model could be the following:
- There is no difference in death between males and females
- Treatment with Lev+5FU in this cohort is associated with a 35% decreased risk of recurrence or death compared to subjects that received only observation
- There is a trend to worse outcomes in patients tumors that adhere to nearby organs
- Worse outcomes are associated with more than 4 positive lymph nodes, extention of the tumor, long times between surgery and registration to the trial and poorly differentiated tumors.

Of note, you’ll see that your HRs will change depending on what covariates you include in your model. For example your interpretation of whether or not adherence to the surrounding tissue is associated with worse outcomes will be affected by the covariate you include in the model. Watch what happens if we leave “node4”, “extent” and “differ” out of the model

``````model.colon <- coxph( Surv(time, status) ~ sex + rx + adhere  + surg,
data = colon )
ggforest(model.colon)`````` Now it appears that adherence to organs is indeed associated with worse outcomes with a p-value <0.001), whereas in the first model the p-value of this covariate trended towards statistical significance but the 95% CI overlapped the null and therefore, rejecting the null would have been in appropriate

##### This brings up a critical principle in science.
• You must be very rigorous about your hypothesis testing and prespecify which hypotheses you are going to test before you analyze your data
• What can be tempting - but is inappropriate - is to look at your data from numerous viewpoints, testing one hypothesis in sequence after another in order to find a statistically significant p value
• This practice is probably more common in science than we would like to acknowledge and can lead to inappropriate rejection of the null hypothesis
• This methodology has many pejorative sobriquets including “data dredging”, “data fishing”, “data butchery” or “p-hacking”
• This highlights the need to be very thoughtful about your analysis upfront
• Consulting a biostatistician before carrying out your analysis is ideal
• In fact, consulting one before conducting your research, so you can have a Statistical Analysis Plan before you conduct your research, is part-and-parcel with good science

# Conclusion

• Survival Analyses are some of the most common analysis in clinical research
• R has several nice packages that can faciliate Survival Analysis
• Beware of “p-hacking”: when in doubt, consult a biostatistician before you conduct your science!

As always, please reach out to us with thoughts and feedback

# Session Info

``sessionInfo()``
``````R version 4.2.0 (2022-04-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur/Monterey 10.16

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib

locale:
 en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
 stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
 survminer_0.4.9 ggpubr_0.4.0    knitr_1.41      forcats_0.5.2
 tidyr_1.2.1     tibble_3.1.8    ggplot2_3.4.0   tidyverse_1.3.2
 survival_3.4-2  devtools_2.4.3  usethis_2.1.6

loaded via a namespace (and not attached):
 ggtext_0.1.1        fs_1.5.2            lubridate_1.9.0
 httr_1.4.4          rprojroot_2.0.3     tools_4.2.0
 backports_1.4.1     utf8_1.2.2          R6_2.5.1
 DBI_1.1.3           colorspace_2.0-3    withr_2.5.0
 gridExtra_2.3       tidyselect_1.2.0    prettyunits_1.1.1
 processx_3.8.0      curl_4.3.3          compiler_4.2.0
 cli_3.4.1           rvest_1.0.3         xml2_1.3.3
 desc_1.4.1          labeling_0.4.2      scales_1.2.1
 survMisc_0.5.6      callr_3.7.3         digest_0.6.30
 rmarkdown_2.18      pkgconfig_2.0.3     htmltools_0.5.3
 sessioninfo_1.2.2   highr_0.9           dbplyr_2.2.1
 fastmap_1.1.0       htmlwidgets_1.5.4   rlang_1.0.6
 generics_0.1.3      zoo_1.8-10          jsonlite_1.8.4
 Matrix_1.4-1        Rcpp_1.0.9          munsell_0.5.0
 fansi_1.0.3         abind_1.4-5         lifecycle_1.0.3
 stringi_1.7.8       yaml_2.3.6          carData_3.0-5
 brio_1.1.3          pkgbuild_1.3.1      grid_4.2.0
 crayon_1.5.2        lattice_0.20-45     cowplot_1.1.1
 haven_2.5.1         splines_4.2.0       gridtext_0.1.4
 hms_1.1.2           ps_1.7.2            pillar_1.8.1