Its fairly common to need to group data by an aspect of the date. For example, how many patients underwent a colonoscopy this month, or week etc. To do this we need a date object as part of the data as usual. As long as we can extract the aspect we want to group by this should be a breeze with dplyr
How do I find out how many endoscopies were done by month for the past calendar year, by endoscopy type. We will use some data that is already created. If you want to know how it was created you can check out this page…..
#Create the data
EndoHistoMerge<-source('EndoPathMerged_ExternalCode.R')
EndoHistoMerge<-data.frame(EndoHistoMerge)
#Neaten up the names
names(EndoHistoMerge)<-gsub("value.","",names(EndoHistoMerge),fixed=T)
#Lets just select the columns relevant to this page
GroupDatesExample<-EndoHistoMerge%>%select(EndoHospNumId,Date.x)
###The resulting data looks like this:
kable(head(GroupDatesExample,5))
EndoHospNumId | Date.x |
---|---|
S553322 | 2015-04-20 |
S553322 | 2015-04-20 |
S553322 | 2015-04-20 |
S553322 | 2015-04-20 |
S553322 | 2015-04-20 |
Using lubridate, we can extract the month very simply as follows: This can then be incorporated into dplyr
library(lubridate)
kable(GroupDatesExample %>% group_by(month=month(Date.x)) %>% summarise(Number=n()))%>%
kable_styling(bootstrap_options = "striped", full_width = F)
month | Number |
---|---|
1 | 1469 |
2 | 1374 |
3 | 1463 |
4 | 1606 |
5 | 1196 |
6 | 1341 |
7 | 1238 |
8 | 1169 |
9 | 1320 |
10 | 1356 |
11 | 1126 |
12 | 1238 |
Breaking this down, we are using select() to get the columns we are interested in, then we use group_by() to group according to the two aspects we are interested in. We then use summarise to count each of the groups. The output is then as expected.
What if you simply want to plot out the number of procedures done by year. You don’t have to use lubrudate to do this, you can do this is base RThat needs you to extract the year from the date and then summarise as follows:
Tots<-GroupDatesExample %>%
mutate(year = format(Date.x, "%Y")) %>%
group_by(year)%>%
summarise(n = n())
kable(Tots)
year | n |
---|---|
2013 | 3544 |
2014 | 3664 |
2015 | 3796 |
2016 | 3607 |
2017 | 1285 |
Often you need to know the time between consecutive tests for a patient. This is done using the difftime() function. Not we use the following functions a lot in the surveillance page so these are worth understanding:
DateBetween<-GroupDatesExample %>% arrange(EndoHospNumId, Date.x) %>% group_by(EndoHospNumId) %>%
mutate(diffDate = difftime(Date.x, lag(Date.x,1),units="weeks"))
kable(head(DateBetween,10))
EndoHospNumId | Date.x | diffDate |
---|---|---|
D0739033 | 2013-01-02 | NA |
D0739033 | 2013-01-03 | 0.1428571 weeks |
D0739033 | 2013-01-03 | 0.0000000 weeks |
D0739033 | 2013-01-04 | 0.1428571 weeks |
D0739033 | 2013-01-05 | 0.1428571 weeks |
D0739033 | 2013-01-05 | 0.0000000 weeks |
D0739033 | 2013-01-05 | 0.0000000 weeks |
D0739033 | 2013-01-05 | 0.0000000 weeks |
D0739033 | 2013-01-05 | 0.0000000 weeks |
D0739033 | 2013-01-05 | 0.0000000 weeks |
It may also be that you just need to know the first or last date in the tests for a patient, again using dplyr and the slice() function:
#To get the first
GroupDatesExample %>% arrange(Date.x) %>% group_by(EndoHospNumId) %>% slice(1)
#To get the last
GroupDatesExample %>% arrange(Date.x) %>% group_by(EndoHospNumId) %>% slice(n())
#To get the first and the last
kable(head(GroupDatesExample %>% arrange(Date.x) %>% group_by(EndoHospNumId) %>% slice(c(1,n())),10))
EndoHospNumId | Date.x |
---|---|
D0739033 | 2013-01-02 |
D0739033 | 2017-05-01 |
F630230 | 2013-01-01 |
F630230 | 2017-04-21 |
G244224 | 2013-01-07 |
G244224 | 2017-04-26 |
I927282 | 2013-01-05 |
I927282 | 2017-04-29 |
J322909 | 2013-01-02 |
J322909 | 2017-04-23 |
There are many occasions when simply grouping by dates is not sufficient for what you need. Perhaps you want to order the number of investigations that a patient has had by date so that you are ordering the dates once the grouping by hospital number has already been done, or perhaps you need to know the time difference between one test and another for a particular patient
As always, dplyr has a solution for this: Let’s use a new data set just to make things more interesting:
#Generate some sample data:
proc<-sample(c("EMR","RFA","Biopsies"), 100, replace = TRUE)
#Sample dates
dat<-sample(seq(as.Date('2013/01/01'), as.Date('2017/05/01'), by="day"), 100)
#Generate 20 hospital numbers in no particular order:
HospNum_Id<-sample(c("P433224","P633443","K522332","G244224","S553322","D0739033","U873352","P223333","Y763634","I927282","P223311","P029834","U22415","U234252","S141141","O349253","T622722","J322909","F630230","T432452"), 100, replace = TRUE)
df<-data.frame(proc,dat,HospNum_Id)
So now we group the data according to patient number:
Upstage<-df %>%
group_by(HospNum_Id) %>%
arrange(HospNum_Id,dat)
#Only show the first 25 samples
kable(head(Upstage,25))
proc | dat | HospNum_Id |
---|---|---|
EMR | 2013-10-05 | D0739033 |
Biopsies | 2014-02-08 | D0739033 |
EMR | 2014-08-08 | D0739033 |
Biopsies | 2016-08-29 | D0739033 |
Biopsies | 2017-02-12 | D0739033 |
RFA | 2013-10-16 | F630230 |
EMR | 2014-06-20 | F630230 |
EMR | 2014-08-27 | F630230 |
RFA | 2014-09-14 | F630230 |
Biopsies | 2016-09-14 | F630230 |
Biopsies | 2014-07-31 | G244224 |
RFA | 2015-04-04 | G244224 |
EMR | 2015-06-14 | G244224 |
EMR | 2016-12-17 | G244224 |
EMR | 2017-04-07 | G244224 |
RFA | 2013-09-16 | I927282 |
RFA | 2015-02-27 | I927282 |
EMR | 2016-07-29 | I927282 |
RFA | 2016-08-08 | I927282 |
Biopsies | 2016-11-08 | I927282 |
EMR | 2013-04-02 | J322909 |
RFA | 2014-02-02 | J322909 |
Biopsies | 2016-01-26 | J322909 |
RFA | 2016-10-12 | J322909 |
RFA | 2013-08-22 | K522332 |
But actually we want only those patients who have had and EMR followed by RFA. lead() means the leading row ie the row that leads to the next row (which should contain RFA in the proc column).
Upstage<-df %>%
group_by(HospNum_Id)%>%
mutate(ind = proc=="RFA" & lead(proc)=="EMR") %>%
arrange(HospNum_Id,dat)
#Only show the first 25 samples
kable(head(Upstage,25))
proc | dat | HospNum_Id | ind |
---|---|---|---|
EMR | 2013-10-05 | D0739033 | FALSE |
Biopsies | 2014-02-08 | D0739033 | FALSE |
EMR | 2014-08-08 | D0739033 | FALSE |
Biopsies | 2016-08-29 | D0739033 | FALSE |
Biopsies | 2017-02-12 | D0739033 | FALSE |
RFA | 2013-10-16 | F630230 | NA |
EMR | 2014-06-20 | F630230 | FALSE |
EMR | 2014-08-27 | F630230 | FALSE |
RFA | 2014-09-14 | F630230 | TRUE |
Biopsies | 2016-09-14 | F630230 | FALSE |
Biopsies | 2014-07-31 | G244224 | FALSE |
RFA | 2015-04-04 | G244224 | NA |
EMR | 2015-06-14 | G244224 | FALSE |
EMR | 2016-12-17 | G244224 | FALSE |
EMR | 2017-04-07 | G244224 | FALSE |
RFA | 2013-09-16 | I927282 | NA |
RFA | 2015-02-27 | I927282 | FALSE |
EMR | 2016-07-29 | I927282 | FALSE |
RFA | 2016-08-08 | I927282 | FALSE |
Biopsies | 2016-11-08 | I927282 | FALSE |
EMR | 2013-04-02 | J322909 | FALSE |
RFA | 2014-02-02 | J322909 | FALSE |
Biopsies | 2016-01-26 | J322909 | FALSE |
RFA | 2016-10-12 | J322909 | TRUE |
RFA | 2013-08-22 | K522332 | TRUE |
But that simply gives those values where consecutive rows show EMR and then RFA but we want it for patients so we have to do something a little more complex. We use the fact that the mutate column is boolean (so gives us a TRUE or FALSE return value) and we ask to return only those hospital numbers where all those values where this is true (and we also sort it).
Upstage<-df %>%
group_by(HospNum_Id)%>%
mutate(ind = proc=="EMR" & lead(proc)=="RFA") %>%
slice(sort(c(which(ind),which(ind)+1)))%>%
arrange(HospNum_Id,dat)
kable(Upstage)
proc | dat | HospNum_Id | ind |
---|---|---|---|
RFA | 2013-10-16 | F630230 | FALSE |
EMR | 2014-06-20 | F630230 | TRUE |
RFA | 2015-04-04 | G244224 | FALSE |
EMR | 2016-12-17 | G244224 | TRUE |
EMR | 2016-07-29 | I927282 | TRUE |
RFA | 2016-08-08 | I927282 | FALSE |
EMR | 2015-10-12 | O349253 | TRUE |
RFA | 2016-10-14 | O349253 | FALSE |
EMR | 2013-04-03 | P029834 | TRUE |
RFA | 2014-02-12 | P029834 | FALSE |
EMR | 2015-07-01 | P029834 | TRUE |
RFA | 2016-09-03 | P029834 | FALSE |
RFA | 2013-06-18 | P223311 | FALSE |
EMR | 2016-09-28 | P223311 | TRUE |
RFA | 2013-10-08 | P223333 | FALSE |
EMR | 2014-05-02 | P223333 | TRUE |
EMR | 2015-01-30 | P223333 | TRUE |
RFA | 2016-08-27 | P223333 | FALSE |
EMR | 2015-02-25 | P433224 | TRUE |
RFA | 2016-04-07 | P433224 | FALSE |
RFA | 2014-01-10 | S553322 | FALSE |
EMR | 2014-07-04 | S553322 | TRUE |
EMR | 2013-06-13 | U873352 | TRUE |
RFA | 2016-11-02 | U873352 | FALSE |