Grouping by dates

How to group

Its fairly common to need to group data by an aspect of the date. For example, how many patients underwent a colonoscopy this month, or week etc. To do this we need a date object as part of the data as usual. As long as we can extract the aspect we want to group by this should be a breeze with dplyr

The problem

How do I find out how many endoscopies were done by month for the past calendar year, by endoscopy type. We will use some data that is already created. If you want to know how it was created you can check out this page…..

#Create the data

EndoHistoMerge<-source('EndoPathMerged_ExternalCode.R')
EndoHistoMerge<-data.frame(EndoHistoMerge)
#Neaten up the names
names(EndoHistoMerge)<-gsub("value.","",names(EndoHistoMerge),fixed=T)
#Lets just select the columns relevant to this page
GroupDatesExample<-EndoHistoMerge%>%select(EndoHospNumId,Date.x)

###The resulting data looks like this:
kable(head(GroupDatesExample,5))

EndoHospNumId	Date.x
S553322	2015-04-20
S553322	2015-04-20
S553322	2015-04-20
S553322	2015-04-20
S553322	2015-04-20

Using lubridate, we can extract the month very simply as follows: This can then be incorporated into dplyr

library(lubridate)

kable(GroupDatesExample %>% group_by(month=month(Date.x)) %>% summarise(Number=n()))%>%
  kable_styling(bootstrap_options = "striped", full_width = F)

month	Number
1	1469
2	1374
3	1463
4	1606
5	1196
6	1341
7	1238
8	1169
9	1320
10	1356
11	1126
12	1238

Breaking this down, we are using select() to get the columns we are interested in, then we use group_by() to group according to the two aspects we are interested in. We then use summarise to count each of the groups. The output is then as expected.

Simple numbers per year

What if you simply want to plot out the number of procedures done by year. You don’t have to use lubrudate to do this, you can do this is base RThat needs you to extract the year from the date and then summarise as follows:

Tots<-GroupDatesExample %>%
  mutate(year = format(Date.x, "%Y")) %>%
  group_by(year)%>%
  summarise(n = n())

kable(Tots)

year	n
2013	3544
2014	3664
2015	3796
2016	3607
2017	1285

Get difference between two dates in consecutive rows

Often you need to know the time between consecutive tests for a patient. This is done using the difftime() function. Not we use the following functions a lot in the surveillance page so these are worth understanding:

DateBetween<-GroupDatesExample %>% arrange(EndoHospNumId, Date.x) %>% group_by(EndoHospNumId) %>%
  mutate(diffDate = difftime(Date.x, lag(Date.x,1),units="weeks"))

kable(head(DateBetween,10))

EndoHospNumId	Date.x	diffDate
D0739033	2013-01-02	NA
D0739033	2013-01-03	0.1428571 weeks
D0739033	2013-01-03	0.0000000 weeks
D0739033	2013-01-04	0.1428571 weeks
D0739033	2013-01-05	0.1428571 weeks
D0739033	2013-01-05	0.0000000 weeks
D0739033	2013-01-05	0.0000000 weeks
D0739033	2013-01-05	0.0000000 weeks
D0739033	2013-01-05	0.0000000 weeks
D0739033	2013-01-05	0.0000000 weeks

Get the first date or the last date in a group

It may also be that you just need to know the first or last date in the tests for a patient, again using dplyr and the slice() function:

#To get the first
GroupDatesExample %>% arrange(Date.x) %>% group_by(EndoHospNumId) %>% slice(1)

#To get the last
GroupDatesExample %>% arrange(Date.x) %>% group_by(EndoHospNumId) %>% slice(n())

#To get the first and the last
kable(head(GroupDatesExample %>% arrange(Date.x) %>% group_by(EndoHospNumId) %>% slice(c(1,n())),10))

EndoHospNumId	Date.x
D0739033	2013-01-02
D0739033	2017-05-01
F630230	2013-01-01
F630230	2017-04-21
G244224	2013-01-07
G244224	2017-04-26
I927282	2013-01-05
I927282	2017-04-29
J322909	2013-01-02
J322909	2017-04-23

Selecting rows by date position based on a conditional

There are many occasions when simply grouping by dates is not sufficient for what you need. Perhaps you want to order the number of investigations that a patient has had by date so that you are ordering the dates once the grouping by hospital number has already been done, or perhaps you need to know the time difference between one test and another for a particular patient

As always, dplyr has a solution for this: Let’s use a new data set just to make things more interesting:

#Generate some sample data:

proc<-sample(c("EMR","RFA","Biopsies"), 100, replace = TRUE)
#Sample dates
dat<-sample(seq(as.Date('2013/01/01'), as.Date('2017/05/01'), by="day"), 100)
#Generate 20 hospital numbers in no particular order:
HospNum_Id<-sample(c("P433224","P633443","K522332","G244224","S553322","D0739033","U873352","P223333","Y763634","I927282","P223311","P029834","U22415","U234252","S141141","O349253","T622722","J322909","F630230","T432452"), 100, replace = TRUE)
df<-data.frame(proc,dat,HospNum_Id)

So now we group the data according to patient number:

Upstage<-df %>%
  group_by(HospNum_Id) %>%
  arrange(HospNum_Id,dat)
#Only show the first 25 samples
kable(head(Upstage,25))

proc	dat	HospNum_Id
EMR	2013-10-05	D0739033
Biopsies	2014-02-08	D0739033
EMR	2014-08-08	D0739033
Biopsies	2016-08-29	D0739033
Biopsies	2017-02-12	D0739033
RFA	2013-10-16	F630230
EMR	2014-06-20	F630230
EMR	2014-08-27	F630230
RFA	2014-09-14	F630230
Biopsies	2016-09-14	F630230
Biopsies	2014-07-31	G244224
RFA	2015-04-04	G244224
EMR	2015-06-14	G244224
EMR	2016-12-17	G244224
EMR	2017-04-07	G244224
RFA	2013-09-16	I927282
RFA	2015-02-27	I927282
EMR	2016-07-29	I927282
RFA	2016-08-08	I927282
Biopsies	2016-11-08	I927282
EMR	2013-04-02	J322909
RFA	2014-02-02	J322909
Biopsies	2016-01-26	J322909
RFA	2016-10-12	J322909
RFA	2013-08-22	K522332

But actually we want only those patients who have had and EMR followed by RFA. lead() means the leading row ie the row that leads to the next row (which should contain RFA in the proc column).

Upstage<-df %>%
group_by(HospNum_Id)%>%
mutate(ind = proc=="RFA" & lead(proc)=="EMR") %>%
arrange(HospNum_Id,dat)

#Only show the first 25 samples
kable(head(Upstage,25))

proc	dat	HospNum_Id	ind
EMR	2013-10-05	D0739033	FALSE
Biopsies	2014-02-08	D0739033	FALSE
EMR	2014-08-08	D0739033	FALSE
Biopsies	2016-08-29	D0739033	FALSE
Biopsies	2017-02-12	D0739033	FALSE
RFA	2013-10-16	F630230	NA
EMR	2014-06-20	F630230	FALSE
EMR	2014-08-27	F630230	FALSE
RFA	2014-09-14	F630230	TRUE
Biopsies	2016-09-14	F630230	FALSE
Biopsies	2014-07-31	G244224	FALSE
RFA	2015-04-04	G244224	NA
EMR	2015-06-14	G244224	FALSE
EMR	2016-12-17	G244224	FALSE
EMR	2017-04-07	G244224	FALSE
RFA	2013-09-16	I927282	NA
RFA	2015-02-27	I927282	FALSE
EMR	2016-07-29	I927282	FALSE
RFA	2016-08-08	I927282	FALSE
Biopsies	2016-11-08	I927282	FALSE
EMR	2013-04-02	J322909	FALSE
RFA	2014-02-02	J322909	FALSE
Biopsies	2016-01-26	J322909	FALSE
RFA	2016-10-12	J322909	TRUE
RFA	2013-08-22	K522332	TRUE

But that simply gives those values where consecutive rows show EMR and then RFA but we want it for patients so we have to do something a little more complex. We use the fact that the mutate column is boolean (so gives us a TRUE or FALSE return value) and we ask to return only those hospital numbers where all those values where this is true (and we also sort it).

Upstage<-df %>%
  group_by(HospNum_Id)%>%
 mutate(ind = proc=="EMR" & lead(proc)=="RFA") %>%
   slice(sort(c(which(ind),which(ind)+1)))%>%
  arrange(HospNum_Id,dat)
kable(Upstage)

proc	dat	HospNum_Id	ind
RFA	2013-10-16	F630230	FALSE
EMR	2014-06-20	F630230	TRUE
RFA	2015-04-04	G244224	FALSE
EMR	2016-12-17	G244224	TRUE
EMR	2016-07-29	I927282	TRUE
RFA	2016-08-08	I927282	FALSE
EMR	2015-10-12	O349253	TRUE
RFA	2016-10-14	O349253	FALSE
EMR	2013-04-03	P029834	TRUE
RFA	2014-02-12	P029834	FALSE
EMR	2015-07-01	P029834	TRUE
RFA	2016-09-03	P029834	FALSE
RFA	2013-06-18	P223311	FALSE
EMR	2016-09-28	P223311	TRUE
RFA	2013-10-08	P223333	FALSE
EMR	2014-05-02	P223333	TRUE
EMR	2015-01-30	P223333	TRUE
RFA	2016-08-27	P223333	FALSE
EMR	2015-02-25	P433224	TRUE
RFA	2016-04-07	P433224	FALSE
RFA	2014-01-10	S553322	FALSE
EMR	2014-07-04	S553322	TRUE
EMR	2013-06-13	U873352	TRUE
RFA	2016-11-02	U873352	FALSE