Formatting Dates

Principles of date formatting

Dates are fundamental. Dates come in many formats. One of my biggest bug bears are dates. As dull as it sounds, formatting dates correctly is a fundamental part of data cleaning in healthcare. This is especially true if your data objects are event-driven (which is to say you collect events that happen at time-points such as the time a patient enters an endoscopy room or the day the endoscopy was done etc.). In addition, the correct and standardised formatting of dates will allow you to merge data from separate data sources and still understand a unified sequence of events. Eventing, as I call it, is fundamental to understanding processes and its inefficiences.

Given it is so important, perhaps the correct formatting of dates is not so boring after all.

Date formats:

Dates come in may forms, but in R the main formats are as follows:

1.POSIXct or POSIXlt - stores dates and times and can manipulate timezones.

2.Date - date without times.

3.Character.

4.Numeric (the usual way it is stored when imported from Excel and refers to the number of days or seconds from an origin date).

5.On top of the above, dates even within the same format can be of different combinations such as the day or the month being first, or an abbreviated year etc.

Most of the time when importing, the dates will be in character format, but not always.

Conversion of dates

The finest level of granularity I usually require is day. This means that the as.Date function, part of base R, is usually sufficient Occasionally I need hours which requires a slightly different approach. Below are some examples:

Text to Date

dates <- c("05/22/80", "07/06/01")
#Remember to specify the format of the date in the text otherwise it defaults to yyyy-mm-dd
betterDates <- as.Date(dates, format = "%B/%d/%Y")
betterDates

## [1] NA NA

#Or with the date time format
df <- data.frame(Date = c("10/9/2009 0:00:00", "10/15/2009 0:00:00"))
newDate<-as.Date(df$Date, "%m/%d/%Y %H:%M:%S")

newDate

## [1] "2009-10-09" "2009-10-15"

Numeric to Date (eg from Excel)

This normally requires remembering to specify an origin as the number is usually the number of days from that origin:

dates <- c(30899, 38567)
NewDates <- as.Date(dates, origin = "1899-12-30")

POSIXct to Date

mydate<-c("2013-01-01 07:00")
theDatesinPOSIXct<-as.Date(as.POSIXct(mydate))
NewDates<-as.Date(theDatesinPOSIXct)

Date to character

# convert dates to character data
#chrDate <- as.character(someDate)

Extracting parts of a date

Often dates come as the date with time in hours, mins and seconds. Other times you just want the month of a date, or the year. This is where the ‘lubridate’ package can be so useful as follows. Note, lubridate doesnt take text but can take anything else such as: POSIXct, POSIXlt, Date, Period, chron, yearmon, yearqtr, zoo, zooreg, timeDate, xts, its, ti, jul, timeSeries, and fts objects

library(lubridate)
some_date <- c("01/02/1979", "03/04/1980")
month(as.POSIXlt(some_date, format="%d/%m/%Y"))

## [1] 2 4

#The same thing can be done with: day, month , year, hour, minute, second and many others. See https://rpubs.com/davoodastaraky/lubridate

What to do with mixed dates If you are using many data sources, the dates can be very painful to standardise. In reality most will not have H:M:S associated with the date so I would standardise all the dates to %d_%m_%Y. Note I don’t use “/” as the forward slash can be treated oddly in text extraction and particularly regular expressions.

A particularly interesting package is ‘anytime’. This claims to be able to take any date and convert it into a date format. This returns a POSIXct object (or a date object if anyDate is used instead of anytime). This can be seen here: https://cran.r-project.org/web/packages/anytime/anytime.pdf

Time Series objects

Time series analysis is worth exploring particularly if you have very numerical data. This type of analysis is often used for the assessment of financial data. The time series object is xts and essentially organises your data with the date as the label of each row.

The limiting factor I have found is the ability to group the time series according to a non numerical variable For example if I want to split my data up according to type of endoscopic procedure performed I cannot use intrinsic time series objects to do this. It is however possible as seen on the page: Grouping by dates:

so to create a timSeries object you should use the package ‘xts’ as follows

library(kableExtra)

#input data
proc<-sample(c("EMR","RFA","Biopsies"), 100, replace = TRUE)
#Sample dates
dat<-sample(seq(as.Date('2013/01/01'), as.Date('2017/05/01'), by="day"), 100)
#Generate 20 hospital numbers in no particular order:
HospNum_Id<-sample(c("P433224","P633443","K522332","G244224","S553322","D0739033","U873352","P223333","Y763634","I927282","P223311","P029834","U22415","U234252","S141141","O349253","T622722","J322909","F630230","T432452"), 100, replace = TRUE)
rndm<-sample(seq(0,40),100,replace=T)
df<-data.frame(proc,dat,HospNum_Id,rndm)
df$proc<-as.character(df$proc)


library(xts)
Myxts<-xts(df, order.by=df$dat)
kable(head(Myxts,25))

proc	dat	HospNum_Id	rndm
RFA	2013-01-26	D0739033	13
Biopsies	2013-02-01	P029834	16
Biopsies	2013-02-12	U22415	10
Biopsies	2013-02-16	K522332	10
EMR	2013-03-27	Y763634	40
RFA	2013-04-06	P433224	5
Biopsies	2013-04-08	S553322	24
RFA	2013-04-23	D0739033	8
RFA	2013-05-12	S553322	18
RFA	2013-05-28	I927282	30
RFA	2013-06-24	P223311	22
Biopsies	2013-07-07	P433224	1
EMR	2013-07-08	G244224	21
Biopsies	2013-07-16	I927282	39
Biopsies	2013-07-19	J322909	9
RFA	2013-09-08	P029834	40
Biopsies	2013-09-21	P223311	17
RFA	2013-10-05	P223311	38
Biopsies	2013-10-11	P223333	32
Biopsies	2013-10-12	P029834	4
EMR	2013-10-26	P223311	12
EMR	2013-11-30	S553322	38
Biopsies	2013-12-05	J322909	11
Biopsies	2013-12-08	P433224	6
Biopsies	2013-12-09	U873352	28

So you will notice that the dataframe is now called a xts object (get this by typing str(Myxts) into the console) and that the rows are organised with date being used as the index. More on how to analyse this in the Data Analysis section