Principles of date formatting


Dates are fundamental. Dates come in many formats. One of my biggest bug bears are dates. As dull as it sounds, formatting dates correctly is a fundamental part of data cleaning in healthcare. This is especially true if your data objects are event-driven (which is to say you collect events that happen at time-points such as the time a patient enters an endoscopy room or the day the endoscopy was done etc.). In addition, the correct and standardised formatting of dates will allow you to merge data from separate data sources and still understand a unified sequence of events. Eventing, as I call it, is fundamental to understanding processes and its inefficiences.

Given it is so important, perhaps the correct formatting of dates is not so boring after all.

Date formats:

Dates come in may forms, but in R the main formats are as follows:

1.POSIXct or POSIXlt - stores dates and times and can manipulate timezones.

2.Date - date without times.

3.Character.

4.Numeric (the usual way it is stored when imported from Excel and refers to the number of days or seconds from an origin date).

5.On top of the above, dates even within the same format can be of different combinations such as the day or the month being first, or an abbreviated year etc.

Most of the time when importing, the dates will be in character format, but not always.




Conversion of dates

The finest level of granularity I usually require is day. This means that the as.Date function, part of base R, is usually sufficient Occasionally I need hours which requires a slightly different approach. Below are some examples:

Text to Date

dates <- c("05/22/80", "07/06/01")
#Remember to specify the format of the date in the text otherwise it defaults to yyyy-mm-dd
betterDates <- as.Date(dates, format = "%B/%d/%Y")
betterDates
## [1] NA NA
#Or with the date time format
df <- data.frame(Date = c("10/9/2009 0:00:00", "10/15/2009 0:00:00"))
newDate<-as.Date(df$Date, "%m/%d/%Y %H:%M:%S")

newDate
## [1] "2009-10-09" "2009-10-15"

Numeric to Date (eg from Excel)

This normally requires remembering to specify an origin as the number is usually the number of days from that origin:

dates <- c(30899, 38567)
NewDates <- as.Date(dates, origin = "1899-12-30")

POSIXct to Date

mydate<-c("2013-01-01 07:00")
theDatesinPOSIXct<-as.Date(as.POSIXct(mydate))
NewDates<-as.Date(theDatesinPOSIXct)

Date to character

# convert dates to character data
#chrDate <- as.character(someDate)




Extracting parts of a date

Often dates come as the date with time in hours, mins and seconds. Other times you just want the month of a date, or the year. This is where the ‘lubridate’ package can be so useful as follows. Note, lubridate doesnt take text but can take anything else such as: POSIXct, POSIXlt, Date, Period, chron, yearmon, yearqtr, zoo, zooreg, timeDate, xts, its, ti, jul, timeSeries, and fts objects

library(lubridate)
some_date <- c("01/02/1979", "03/04/1980")
month(as.POSIXlt(some_date, format="%d/%m/%Y"))
## [1] 2 4
#The same thing can be done with: day, month , year, hour, minute, second and many others. See https://rpubs.com/davoodastaraky/lubridate

What to do with mixed dates If you are using many data sources, the dates can be very painful to standardise. In reality most will not have H:M:S associated with the date so I would standardise all the dates to %d_%m_%Y. Note I don’t use “/” as the forward slash can be treated oddly in text extraction and particularly regular expressions.

A particularly interesting package is ‘anytime’. This claims to be able to take any date and convert it into a date format. This returns a POSIXct object (or a date object if anyDate is used instead of anytime). This can be seen here: https://cran.r-project.org/web/packages/anytime/anytime.pdf






Time Series objects

Time series analysis is worth exploring particularly if you have very numerical data. This type of analysis is often used for the assessment of financial data. The time series object is xts and essentially organises your data with the date as the label of each row.

The limiting factor I have found is the ability to group the time series according to a non numerical variable For example if I want to split my data up according to type of endoscopic procedure performed I cannot use intrinsic time series objects to do this. It is however possible as seen on the page: Grouping by dates:

so to create a timSeries object you should use the package ‘xts’ as follows

library(kableExtra)

#input data
proc<-sample(c("EMR","RFA","Biopsies"), 100, replace = TRUE)
#Sample dates
dat<-sample(seq(as.Date('2013/01/01'), as.Date('2017/05/01'), by="day"), 100)
#Generate 20 hospital numbers in no particular order:
HospNum_Id<-sample(c("P433224","P633443","K522332","G244224","S553322","D0739033","U873352","P223333","Y763634","I927282","P223311","P029834","U22415","U234252","S141141","O349253","T622722","J322909","F630230","T432452"), 100, replace = TRUE)
rndm<-sample(seq(0,40),100,replace=T)
df<-data.frame(proc,dat,HospNum_Id,rndm)
df$proc<-as.character(df$proc)


library(xts)
Myxts<-xts(df, order.by=df$dat)
kable(head(Myxts,25))
proc dat HospNum_Id rndm
RFA 2013-01-26 D0739033 13
Biopsies 2013-02-01 P029834 16
Biopsies 2013-02-12 U22415 10
Biopsies 2013-02-16 K522332 10
EMR 2013-03-27 Y763634 40
RFA 2013-04-06 P433224 5
Biopsies 2013-04-08 S553322 24
RFA 2013-04-23 D0739033 8
RFA 2013-05-12 S553322 18
RFA 2013-05-28 I927282 30
RFA 2013-06-24 P223311 22
Biopsies 2013-07-07 P433224 1
EMR 2013-07-08 G244224 21
Biopsies 2013-07-16 I927282 39
Biopsies 2013-07-19 J322909 9
RFA 2013-09-08 P029834 40
Biopsies 2013-09-21 P223311 17
RFA 2013-10-05 P223311 38
Biopsies 2013-10-11 P223333 32
Biopsies 2013-10-12 P029834 4
EMR 2013-10-26 P223311 12
EMR 2013-11-30 S553322 38
Biopsies 2013-12-05 J322909 11
Biopsies 2013-12-08 P433224 6
Biopsies 2013-12-09 U873352 28

So you will notice that the dataframe is now called a xts object (get this by typing str(Myxts) into the console) and that the rows are organised with date being used as the index. More on how to analyse this in the Data Analysis section