Dates are fundamental. Dates come in many formats. One of my biggest bug bears are dates. As dull as it sounds, formatting dates correctly is a fundamental part of data cleaning in healthcare. This is especially true if your data objects are event-driven (which is to say you collect events that happen at time-points such as the time a patient enters an endoscopy room or the day the endoscopy was done etc.). In addition, the correct and standardised formatting of dates will allow you to merge data from separate data sources and still understand a unified sequence of events. Eventing, as I call it, is fundamental to understanding processes and its inefficiences.
Given it is so important, perhaps the correct formatting of dates is not so boring after all.
Dates come in may forms, but in R the main formats are as follows:
1.POSIXct or POSIXlt - stores dates and times and can manipulate timezones.
2.Date - date without times.
3.Character.
4.Numeric (the usual way it is stored when imported from Excel and refers to the number of days or seconds from an origin date).
5.On top of the above, dates even within the same format can be of different combinations such as the day or the month being first, or an abbreviated year etc.
Most of the time when importing, the dates will be in character format, but not always.
The finest level of granularity I usually require is day. This means that the as.Date function, part of base R, is usually sufficient Occasionally I need hours which requires a slightly different approach. Below are some examples:
Text to Date
dates <- c("05/22/80", "07/06/01")
#Remember to specify the format of the date in the text otherwise it defaults to yyyy-mm-dd
betterDates <- as.Date(dates, format = "%B/%d/%Y")
betterDates
## [1] NA NA
#Or with the date time format
df <- data.frame(Date = c("10/9/2009 0:00:00", "10/15/2009 0:00:00"))
newDate<-as.Date(df$Date, "%m/%d/%Y %H:%M:%S")
newDate
## [1] "2009-10-09" "2009-10-15"
Numeric to Date (eg from Excel)
This normally requires remembering to specify an origin as the number is usually the number of days from that origin:
dates <- c(30899, 38567)
NewDates <- as.Date(dates, origin = "1899-12-30")
POSIXct to Date
mydate<-c("2013-01-01 07:00")
theDatesinPOSIXct<-as.Date(as.POSIXct(mydate))
NewDates<-as.Date(theDatesinPOSIXct)
Date to character
# convert dates to character data
#chrDate <- as.character(someDate)
Often dates come as the date with time in hours, mins and seconds. Other times you just want the month of a date, or the year. This is where the ‘lubridate’ package can be so useful as follows. Note, lubridate doesnt take text but can take anything else such as: POSIXct, POSIXlt, Date, Period, chron, yearmon, yearqtr, zoo, zooreg, timeDate, xts, its, ti, jul, timeSeries, and fts objects
library(lubridate)
some_date <- c("01/02/1979", "03/04/1980")
month(as.POSIXlt(some_date, format="%d/%m/%Y"))
## [1] 2 4
#The same thing can be done with: day, month , year, hour, minute, second and many others. See https://rpubs.com/davoodastaraky/lubridate
What to do with mixed dates If you are using many data sources, the dates can be very painful to standardise. In reality most will not have H:M:S associated with the date so I would standardise all the dates to %d_%m_%Y. Note I don’t use “/” as the forward slash can be treated oddly in text extraction and particularly regular expressions.
A particularly interesting package is ‘anytime’. This claims to be able to take any date and convert it into a date format. This returns a POSIXct object (or a date object if anyDate is used instead of anytime). This can be seen here: https://cran.r-project.org/web/packages/anytime/anytime.pdf
Time series analysis is worth exploring particularly if you have very numerical data. This type of analysis is often used for the assessment of financial data. The time series object is xts and essentially organises your data with the date as the label of each row.
The limiting factor I have found is the ability to group the time series according to a non numerical variable For example if I want to split my data up according to type of endoscopic procedure performed I cannot use intrinsic time series objects to do this. It is however possible as seen on the page: Grouping by dates:
so to create a timSeries object you should use the package ‘xts’ as follows
library(kableExtra)
#input data
proc<-sample(c("EMR","RFA","Biopsies"), 100, replace = TRUE)
#Sample dates
dat<-sample(seq(as.Date('2013/01/01'), as.Date('2017/05/01'), by="day"), 100)
#Generate 20 hospital numbers in no particular order:
HospNum_Id<-sample(c("P433224","P633443","K522332","G244224","S553322","D0739033","U873352","P223333","Y763634","I927282","P223311","P029834","U22415","U234252","S141141","O349253","T622722","J322909","F630230","T432452"), 100, replace = TRUE)
rndm<-sample(seq(0,40),100,replace=T)
df<-data.frame(proc,dat,HospNum_Id,rndm)
df$proc<-as.character(df$proc)
library(xts)
Myxts<-xts(df, order.by=df$dat)
kable(head(Myxts,25))
proc | dat | HospNum_Id | rndm |
---|---|---|---|
RFA | 2013-01-26 | D0739033 | 13 |
Biopsies | 2013-02-01 | P029834 | 16 |
Biopsies | 2013-02-12 | U22415 | 10 |
Biopsies | 2013-02-16 | K522332 | 10 |
EMR | 2013-03-27 | Y763634 | 40 |
RFA | 2013-04-06 | P433224 | 5 |
Biopsies | 2013-04-08 | S553322 | 24 |
RFA | 2013-04-23 | D0739033 | 8 |
RFA | 2013-05-12 | S553322 | 18 |
RFA | 2013-05-28 | I927282 | 30 |
RFA | 2013-06-24 | P223311 | 22 |
Biopsies | 2013-07-07 | P433224 | 1 |
EMR | 2013-07-08 | G244224 | 21 |
Biopsies | 2013-07-16 | I927282 | 39 |
Biopsies | 2013-07-19 | J322909 | 9 |
RFA | 2013-09-08 | P029834 | 40 |
Biopsies | 2013-09-21 | P223311 | 17 |
RFA | 2013-10-05 | P223311 | 38 |
Biopsies | 2013-10-11 | P223333 | 32 |
Biopsies | 2013-10-12 | P029834 | 4 |
EMR | 2013-10-26 | P223311 | 12 |
EMR | 2013-11-30 | S553322 | 38 |
Biopsies | 2013-12-05 | J322909 | 11 |
Biopsies | 2013-12-08 | P433224 | 6 |
Biopsies | 2013-12-09 | U873352 | 28 |
So you will notice that the dataframe is now called a xts object (get this by typing str(Myxts) into the console) and that the rows are organised with date being used as the index. More on how to analyse this in the Data Analysis section