The problem:

How do I visualise the flow of patients through a service? For example how do I see what order of therapies patients have received when they undergo Barrett’s endotherapy for early malignancies or dysplasia?

The background:

It is important to understand the sequence of events for any patients. In the above example, an ideal sequence should be that a patient undergoes endoscopic mucosal resection (EMR) prior to radiofrequency ablation and that no further EMR’s are required. If this is not the sequence they we need to know why and determine if it is for example a problem with lesion recognition for a particular endoscopist.

Sankey plots are ideal for this. One of the most famous uses of a Sankey plot is a rather sad but effective visualisation of Napoleonic troop movements in his disasterous 1812 campaign to conquer Russia. There are of course numerous other examples.

The power of the plot comes from the fact that the flow of time of events is usually the X axis, categories that (in my case) patients move into is the y axis and the volume of movement is represented by the weight of the line.

This makes it perfect to understand patient flow through a defined sequence of events and effectively summarises the concept of patient ‘flow’. An example of its use is given here:

The example:

The problem: How do I determine the order of therapies for patients who have undergone endotherapy for Barrett’s related early malignancies or dysplasia?

Input data

    #Generate some sample data:

proc<-sample(c("EMR","RFA","Biopsies"), 100, replace = TRUE)
#Sample dates
dat<-sample(seq(as.Date('2013/01/01'), as.Date('2017/05/01'), by="day"), 100)
#Generate 20 hospital numbers in no particular order:
HospNum_Id<-sample(c("P433224","P633443","K522332","G244224","S553322","D0739033","U873352","P223333","Y763634","I927282","P223311","P029834","U22415","U234252","S141141","O349253","T622722","J322909","F630230","T432452"), 100, replace = TRUE)
df<-data.frame(proc,dat,HospNum_Id)

Data Preparation for Sankey plot

Now we prepare the data and convert it into a spredsheet format using data.table which can order the sequence of events per patient

    Sankey<-dcast(setDT(df)[, if(any(proc=="EMR"|proc=="RFA")) .SD, HospNum_Id], HospNum_Id~rowid(HospNum_Id), value.var ="proc")
Sankey<-data.frame(Sankey)

We need to rename columns and get rid of the first column which was only originally used to get the events patient specific. It won’t be useful in the next steps so lets ditch it.

PtFlow<-Sankey%>%select(2:9)
PtFlow<-data.frame(PtFlow,stringsAsFactors=F)
names(PtFlow)<-c("ord1","ord2","ord3","ord4","ord5","ord6","ord7","ord8")

orders <- PtFlow %>%
  select(ord1, ord2, ord3, ord4, ord5,ord6,ord7,ord8)
orders.plot <- data.frame(from= character(0), to= character(0), n = numeric(0))

So the data is in the current form

     knitr::kable(orders)
ord1 ord2 ord3 ord4 ord5 ord6 ord7 ord8
EMR RFA EMR EMR NA NA NA NA
RFA EMR RFA Biopsies Biopsies Biopsies Biopsies Biopsies
EMR Biopsies EMR NA NA NA NA NA
RFA EMR Biopsies EMR Biopsies Biopsies EMR NA
Biopsies EMR RFA EMR RFA NA NA NA
RFA RFA Biopsies NA NA NA NA NA
Biopsies EMR RFA RFA EMR NA NA NA
EMR RFA RFA EMR EMR NA NA NA
RFA Biopsies RFA NA NA NA NA NA
Biopsies Biopsies RFA Biopsies NA NA NA NA
Biopsies EMR RFA Biopsies RFA Biopsies EMR RFA
EMR EMR NA NA NA NA NA NA
RFA RFA Biopsies EMR EMR Biopsies RFA NA
EMR EMR RFA NA NA NA NA NA
RFA Biopsies EMR EMR Biopsies Biopsies NA NA
RFA EMR EMR EMR RFA Biopsies EMR NA
Biopsies Biopsies RFA NA NA NA NA NA
RFA EMR EMR EMR Biopsies Biopsies Biopsies NA
RFA RFA Biopsies EMR EMR NA NA NA
Biopsies Biopsies EMR RFA NA NA NA NA

So that creates the basic ‘spreadsheet’ for the creation of the Sankey on a per patient basis.

Now we have to summarise the number of ‘movements’ from one state to another for each patient and then for all the patients. The loop below basically looks at each row and then groups by the combination of subsequent events ie how many events were EMR->EMR, how many were Biopsies->EMR etc with each permutation of EMR, Biopsies and RFA. This is stored in ord.cache and then added cumulatively to orders.plot.

Summarise the Data

for (i in 2:ncol(orders)) {
  ord.cache <- orders %>%
    group_by(orders[ , i-1], orders[ , i]) %>%
    summarise(n=n()) %>%
  ungroup()
  colnames(ord.cache)[1:2] <- c('from', 'to') 
  
  # adding tags
  ord.cache$from <- paste(ord.cache$from, '(', i-1, ')', sep='')
  ord.cache$to <- paste(ord.cache$to, '(', i, ')', sep='') 
  orders.plot <- rbind(orders.plot, ord.cache) 
}

Plot the data:

orders.plot<-data.frame(orders.plot)
orders.plot<-orders.plot[grepl("[A-Z]",orders.plot$from)&grepl("[A-Z]",orders.plot$to), ]
orders.plot<-orders.plot[!grepl("NA",orders.plot$from)&!grepl("NA",orders.plot$to), ]

Now it gets plotted with the googleVis package:

plot(gvisSankey(orders.plot, from='from', to='to', weight='n',
                options=list(height=900, width=1800)))

The Sankey plot:

The resulting Sankey is as follows (yours may be different as the data is randomly generated):







Data preparation for Circos Plots

Another way to visualise flow is to circularise the data. This is different to a Sankey plot as a) it’s circular and b) the summarisation is a higher level so that rather than telling you how many patients underwent an eventA->eventB at a particular time point 1, it will tell you how many occurred over all the time points:

The data generation is as above

The data has to be cleaned slightly differently as follows:

############################################################## Data preparation ###############################################################################
proc<-sample(c("EMR","RFA","Biopsies","Nothing"), 100, replace = TRUE)
#Sample dates
dat<-sample(seq(as.Date('2013/01/01'), as.Date('2017/05/01'), by="day"), 100)
#Generate 20 hospital numbers in no particular order:
Id<-sample(c("P43","P63","K52","G24","S55","D07","U87","P22","Y76","I92","P22","P02","U22415","U23","S14","O34","T62","J32","F63","T43"), 100, replace = TRUE)
df<-data.frame(proc,dat,Id)


Now we can summarise the data as we wish as follows:


library(dplyr)
library(reshape2)

mydf<-df %>% 
  arrange(dat) %>% 
  group_by(Id) %>%
  mutate(origin = lag(proc, 1), destination = proc) %>%
  select(origin, destination) %>% 
  group_by(origin, destination) %>% 
  summarise(n = n()) %>% 
  ungroup()


mydf<-data.frame(dcast(mydf, origin ~ destination))
#mydf$origin<-as.character(mydf$origin)

#Get rid of NA's
mydf<-mydf[complete.cases(mydf),]

V1<-c("2","7","3","10")
V2<-c("210,150,12","110,255,233","125,175,0","255,219,0")

mydf<-cbind(V1,V2,mydf)


This creates the dataframe structure with the colour scheme embedded for each grouping. Note how the grouping was achieved with dplyr, where origin and destination columns were created by combining the current and the previous column values and then these columns were selected out, grouped and then summarised.

Now we have to create a matrix


df_format <- mydf %>% select(1:3) %>% rename(order = V1, rgb = V2, region = origin) %>% mutate(region = gsub("_", " ", region))
  #flow matrix. Need to add V1 and V2 to the matrix here

  matmydf <- as.matrix(mydf[,-(1:3)])
  dimnames(matmydf) <- list(orig = df_format$region, dest = df_format$region)
  library("tidyr")
  df_format <- df_format %>% arrange(order) %>% separate(rgb, c("r","g","b")) %>% mutate(col = rgb(r, g, b, max=255), max = rowSums(matmydf)+colSums(matmydf))


Plot the Circos plot

Having created that we are ready to plot a beautiful summary visualisation of the patient flow:


  library("circlize")
  circos.clear()
  par(mar = rep(0, 4), cex=0.9)
  circos.par(start.degree = 90, gap.degree = 4)
  par(cex = 0.8, mar = c(0, 0, 0, 0))
  chordDiagram(x = matmydf, directional = 1, order = df_format$region, 
               grid.col = df_format$col, annotationTrack = "grid", 
               transparency = 0.25,  annotationTrackHeight = c(0.1, 0.1),
               diffHeight  = -0.04)
  
  
  circos.trackPlotRegion(track.index = 1, panel.fun = function(x, y) {
    xlim = get.cell.meta.data("xlim")
    sector.index = get.cell.meta.data("sector.index")
    circos.text(mean(xlim), 3.5, sector.index, facing = "bending")
    circos.axis("top", major.at = seq(0, max(xlim),by=5), minor.ticks=1,labels.cex=0.8, labels.away.percentage = 0.2, labels.niceFacing = FALSE )}, bg.border = NA)