How to group

Its fairly common to need to group data by an aspect of the date. For example, how many patients underwent a colonoscopy this month, or week etc. To do this we need a date object as part of the data as usual. As long as we can extract the aspect we want to group by this should be a breeze with dplyr

The problem

How do I find out how many endoscopies were done by month for the past calendar year, by endoscopy type. We will use some data that is already created. If you want to know how it was created you can check out this page…..

#Create the data

EndoHistoMerge<-source('EndoPathMerged_ExternalCode.R')
EndoHistoMerge<-data.frame(EndoHistoMerge)
#Neaten up the names
names(EndoHistoMerge)<-gsub("value.","",names(EndoHistoMerge),fixed=T)
#Lets just select the columns relevant to this page
GroupDatesExample<-EndoHistoMerge%>%select(EndoHospNumId,Date.x)
###The resulting data looks like this:
kable(head(GroupDatesExample,5))
EndoHospNumId Date.x
S553322 2015-04-20
S553322 2015-04-20
S553322 2015-04-20
S553322 2015-04-20
S553322 2015-04-20

Using lubridate, we can extract the month very simply as follows: This can then be incorporated into dplyr

library(lubridate)

kable(GroupDatesExample %>% group_by(month=month(Date.x)) %>% summarise(Number=n()))%>%
  kable_styling(bootstrap_options = "striped", full_width = F)
month Number
1 1469
2 1374
3 1463
4 1606
5 1196
6 1341
7 1238
8 1169
9 1320
10 1356
11 1126
12 1238

Breaking this down, we are using select() to get the columns we are interested in, then we use group_by() to group according to the two aspects we are interested in. We then use summarise to count each of the groups. The output is then as expected.

Simple numbers per year

What if you simply want to plot out the number of procedures done by year. You don’t have to use lubrudate to do this, you can do this is base RThat needs you to extract the year from the date and then summarise as follows:

Tots<-GroupDatesExample %>%
  mutate(year = format(Date.x, "%Y")) %>%
  group_by(year)%>%
  summarise(n = n())

kable(Tots)
year n
2013 3544
2014 3664
2015 3796
2016 3607
2017 1285






Get difference between two dates in consecutive rows

Often you need to know the time between consecutive tests for a patient. This is done using the difftime() function. Not we use the following functions a lot in the surveillance page so these are worth understanding:

DateBetween<-GroupDatesExample %>% arrange(EndoHospNumId, Date.x) %>% group_by(EndoHospNumId) %>%
  mutate(diffDate = difftime(Date.x, lag(Date.x,1),units="weeks"))

kable(head(DateBetween,10))
EndoHospNumId Date.x diffDate
D0739033 2013-01-02 NA
D0739033 2013-01-03 0.1428571 weeks
D0739033 2013-01-03 0.0000000 weeks
D0739033 2013-01-04 0.1428571 weeks
D0739033 2013-01-05 0.1428571 weeks
D0739033 2013-01-05 0.0000000 weeks
D0739033 2013-01-05 0.0000000 weeks
D0739033 2013-01-05 0.0000000 weeks
D0739033 2013-01-05 0.0000000 weeks
D0739033 2013-01-05 0.0000000 weeks






Get the first date or the last date in a group

It may also be that you just need to know the first or last date in the tests for a patient, again using dplyr and the slice() function:

#To get the first
GroupDatesExample %>% arrange(Date.x) %>% group_by(EndoHospNumId) %>% slice(1)
#To get the last
GroupDatesExample %>% arrange(Date.x) %>% group_by(EndoHospNumId) %>% slice(n())
#To get the first and the last
kable(head(GroupDatesExample %>% arrange(Date.x) %>% group_by(EndoHospNumId) %>% slice(c(1,n())),10))
EndoHospNumId Date.x
D0739033 2013-01-02
D0739033 2017-05-01
F630230 2013-01-01
F630230 2017-04-21
G244224 2013-01-07
G244224 2017-04-26
I927282 2013-01-05
I927282 2017-04-29
J322909 2013-01-02
J322909 2017-04-23






Selecting rows by date position based on a conditional

There are many occasions when simply grouping by dates is not sufficient for what you need. Perhaps you want to order the number of investigations that a patient has had by date so that you are ordering the dates once the grouping by hospital number has already been done, or perhaps you need to know the time difference between one test and another for a particular patient

As always, dplyr has a solution for this: Let’s use a new data set just to make things more interesting:

#Generate some sample data:

proc<-sample(c("EMR","RFA","Biopsies"), 100, replace = TRUE)
#Sample dates
dat<-sample(seq(as.Date('2013/01/01'), as.Date('2017/05/01'), by="day"), 100)
#Generate 20 hospital numbers in no particular order:
HospNum_Id<-sample(c("P433224","P633443","K522332","G244224","S553322","D0739033","U873352","P223333","Y763634","I927282","P223311","P029834","U22415","U234252","S141141","O349253","T622722","J322909","F630230","T432452"), 100, replace = TRUE)
df<-data.frame(proc,dat,HospNum_Id)

So now we group the data according to patient number:

Upstage<-df %>%
  group_by(HospNum_Id) %>%
  arrange(HospNum_Id,dat)
#Only show the first 25 samples
kable(head(Upstage,25))
proc dat HospNum_Id
EMR 2013-10-05 D0739033
Biopsies 2014-02-08 D0739033
EMR 2014-08-08 D0739033
Biopsies 2016-08-29 D0739033
Biopsies 2017-02-12 D0739033
RFA 2013-10-16 F630230
EMR 2014-06-20 F630230
EMR 2014-08-27 F630230
RFA 2014-09-14 F630230
Biopsies 2016-09-14 F630230
Biopsies 2014-07-31 G244224
RFA 2015-04-04 G244224
EMR 2015-06-14 G244224
EMR 2016-12-17 G244224
EMR 2017-04-07 G244224
RFA 2013-09-16 I927282
RFA 2015-02-27 I927282
EMR 2016-07-29 I927282
RFA 2016-08-08 I927282
Biopsies 2016-11-08 I927282
EMR 2013-04-02 J322909
RFA 2014-02-02 J322909
Biopsies 2016-01-26 J322909
RFA 2016-10-12 J322909
RFA 2013-08-22 K522332

But actually we want only those patients who have had and EMR followed by RFA. lead() means the leading row ie the row that leads to the next row (which should contain RFA in the proc column).

Upstage<-df %>%
group_by(HospNum_Id)%>%
mutate(ind = proc=="RFA" & lead(proc)=="EMR") %>%
arrange(HospNum_Id,dat)

#Only show the first 25 samples
kable(head(Upstage,25))
proc dat HospNum_Id ind
EMR 2013-10-05 D0739033 FALSE
Biopsies 2014-02-08 D0739033 FALSE
EMR 2014-08-08 D0739033 FALSE
Biopsies 2016-08-29 D0739033 FALSE
Biopsies 2017-02-12 D0739033 FALSE
RFA 2013-10-16 F630230 NA
EMR 2014-06-20 F630230 FALSE
EMR 2014-08-27 F630230 FALSE
RFA 2014-09-14 F630230 TRUE
Biopsies 2016-09-14 F630230 FALSE
Biopsies 2014-07-31 G244224 FALSE
RFA 2015-04-04 G244224 NA
EMR 2015-06-14 G244224 FALSE
EMR 2016-12-17 G244224 FALSE
EMR 2017-04-07 G244224 FALSE
RFA 2013-09-16 I927282 NA
RFA 2015-02-27 I927282 FALSE
EMR 2016-07-29 I927282 FALSE
RFA 2016-08-08 I927282 FALSE
Biopsies 2016-11-08 I927282 FALSE
EMR 2013-04-02 J322909 FALSE
RFA 2014-02-02 J322909 FALSE
Biopsies 2016-01-26 J322909 FALSE
RFA 2016-10-12 J322909 TRUE
RFA 2013-08-22 K522332 TRUE

But that simply gives those values where consecutive rows show EMR and then RFA but we want it for patients so we have to do something a little more complex. We use the fact that the mutate column is boolean (so gives us a TRUE or FALSE return value) and we ask to return only those hospital numbers where all those values where this is true (and we also sort it).

Upstage<-df %>%
  group_by(HospNum_Id)%>%
 mutate(ind = proc=="EMR" & lead(proc)=="RFA") %>%
   slice(sort(c(which(ind),which(ind)+1)))%>%
  arrange(HospNum_Id,dat)
kable(Upstage)
proc dat HospNum_Id ind
RFA 2013-10-16 F630230 FALSE
EMR 2014-06-20 F630230 TRUE
RFA 2015-04-04 G244224 FALSE
EMR 2016-12-17 G244224 TRUE
EMR 2016-07-29 I927282 TRUE
RFA 2016-08-08 I927282 FALSE
EMR 2015-10-12 O349253 TRUE
RFA 2016-10-14 O349253 FALSE
EMR 2013-04-03 P029834 TRUE
RFA 2014-02-12 P029834 FALSE
EMR 2015-07-01 P029834 TRUE
RFA 2016-09-03 P029834 FALSE
RFA 2013-06-18 P223311 FALSE
EMR 2016-09-28 P223311 TRUE
RFA 2013-10-08 P223333 FALSE
EMR 2014-05-02 P223333 TRUE
EMR 2015-01-30 P223333 TRUE
RFA 2016-08-27 P223333 FALSE
EMR 2015-02-25 P433224 TRUE
RFA 2016-04-07 P433224 FALSE
RFA 2014-01-10 S553322 FALSE
EMR 2014-07-04 S553322 TRUE
EMR 2013-06-13 U873352 TRUE
RFA 2016-11-02 U873352 FALSE