It is unusual to have all data you want to analyse within one dataset. As a result being able to merge data from different datasets is crucial. Merging isn’t as straightforward as you may think. The principles are as follows
Firstly lets create the dataset. You will notice that the dates are slightly different between Endoscopy and path reports for some patients. This is real life where the date received for the pathology numbers can be slightly different to the date of the endoscopy.
#Creating the datasets
#Random letters to fill out a pathology report
pathRep<-replicate(20,paste(sample(LETTERS,20,replace=T),collapse=""))
pathDate<-as.Date(c("1993-12-22","1994-05-16","1992-07-20","1996-06-02","1992-04-20","1996-08-30","1992-01-26","1991-03-23","1995-12-28","1995-07-15","1993-04-04","1994-01-11","1999-08-21","1993-11-10","1994-02-26","1992-08-06","1993-06-29","1997-03-08","1998-03-03","1998-04-17"))
pathHospitalNum<-c("H432243","T662272","G424284","W787634","H432243","Y980037","H432243","W787634","Y980037","E432243","U874287","Y980037","U874287","W787634","Y980037","H432243","Y980037","E432243","W787634","W787634")
#Create the dataframe
pathdf<-data.frame(pathRep,pathDate,pathHospitalNum)
#Random letters to fill out a pathology report
EndoRep<-replicate(20,paste(sample(LETTERS,20,replace=T),collapse=""))
EndoDate<-as.Date(c("1993-12-22","1994-05-14","1992-07-19","1996-06-01","1992-04-20","1996-08-30","1992-01-24","1991-03-21","1995-12-28","1995-07-15","1993-04-02","1994-01-10","1999-08-21","1993-11-10","1994-02-26","1992-08-05","1993-06-29","1997-03-07","1998-03-03","1998-04-17"))
EndoHospitalNum<-c("H432243","T662272","G424284","W787634","H432243","Y980037","H432243","W787634","Y980037","E432243","U874287","Y980037","U874287","W787634","Y980037","H432243","Y980037","E432243","W787634","W787634")
#Create the dataframe:
Endodf<-data.frame(EndoRep,EndoDate,EndoHospitalNum)
A simple way of merging for this ultrasimplified dataset is just to use the function merge() in base R. We will choose the columns date and Hospital number to do this: Before we do this we have to name the columns to merge the same so R can recognise them
names(Endodf)<-c("Report","Date","HospNum")
names(pathdf)<-c("Report","Date","HospNum")
Now we are ready to merge:
themerge<-merge(Endodf,pathdf,by=c("Date","HospNum"))
kable(themerge)
Date | HospNum | Report.x | Report.y |
---|---|---|---|
1992-04-20 | H432243 | IOKHSTVPDISUNNEQVAGP | QZVUSNGAUNBEDIFHPSEE |
1993-06-29 | Y980037 | WKODZTQLYTWHGMPLUZRI | FBDXMVTJUYFMDDRYOKCD |
1993-11-10 | W787634 | CLXBXFRZEIEQEEPBLRIE | KWKPHXJJRCUYBDKHUHRM |
1993-12-22 | H432243 | RUVBTEJSSPCKYURMWNGL | OKMZQKDWLIBEFEFKPQQZ |
1994-02-26 | Y980037 | IYDCAOHKDBHLOROJTQHI | IFULSLEDMQHAASZETFCI |
1995-07-15 | E432243 | OKNOCLWDKNBFIJYAGMHC | HLBXWFQOHJMTEUKLRQAK |
1995-12-28 | Y980037 | GCGJUTIRMYYTJKHBQJTD | XSJQVRUSNACJAYFQIBNZ |
1996-08-30 | Y980037 | TGTPVELHPFCQDYIODUZU | ULRZVUTQKMMOPAMZGYEU |
1998-03-03 | W787634 | NJSRERLELVADXBAGQJTI | LDAQPEXROTZUONDSJQIF |
1998-04-17 | W787634 | YOWAYQEBLRSGASAQYYSD | EUBVODPWCFJVEAOCMTFD |
1999-08-21 | U874287 | WSWTMGEAFBAJHJUPEPUE | MRYSVPTQMDNOILAMERZP |
But note, the result set only includes 11 records. That is because we are merging according to where the records, both Hospital Number and date, are the same. So lets loosen that up a little and lets allow the merge to match if the date is up to 4 days out (to allow delays in transit from the endoscopy unit to the pathology department, samples stored in fridges over bank holidays etc..). To do this we are going to create use data.table
library(data.table)
pathdt<-data.table(pathRep,pathDate,pathHospitalNum)
setkey(pathdt, pathHospitalNum, pathDate)
Endodt<-data.table(EndoRep,EndoDate,EndoHospitalNum)
setkey(Endodt, EndoHospitalNum, EndoDate)
#run the join
themerge2<-Endodt[pathdt,roll="nearest"]
kable(head(themerge2,25))
EndoRep | EndoDate | EndoHospitalNum | pathRep |
---|---|---|---|
OKNOCLWDKNBFIJYAGMHC | 1995-07-15 | E432243 | HLBXWFQOHJMTEUKLRQAK |
GAGVNQVDZINLISBIZVYW | 1997-03-08 | E432243 | QKAVESSKEMGXJDPBWUWZ |
NMVVOSFHYJIJKMLNXBKP | 1992-07-20 | G424284 | IKQJGARHDTYJPSEWRTXU |
PYREFSEEVQZSIPLPZSHA | 1992-01-26 | H432243 | VMUAWSGWVGBFUFTXFXVZ |
IOKHSTVPDISUNNEQVAGP | 1992-04-20 | H432243 | QZVUSNGAUNBEDIFHPSEE |
KWAUINSTNGJZBMYEPNSA | 1992-08-06 | H432243 | UYBTKPBJTZNFDQRGNAUO |
RUVBTEJSSPCKYURMWNGL | 1993-12-22 | H432243 | OKMZQKDWLIBEFEFKPQQZ |
MULEZPZHBGFQOCIDEWEZ | 1994-05-16 | T662272 | HPJPOGPKQLORYMCPTHJF |
SOYZGYYTFVGGGFUMICUC | 1993-04-04 | U874287 | KNDPFWMVTMPHSSZLCRSZ |
WSWTMGEAFBAJHJUPEPUE | 1999-08-21 | U874287 | MRYSVPTQMDNOILAMERZP |
ZNHPURPMFCGDQVHLYTPR | 1991-03-23 | W787634 | YXCIMEJMAKBSDFAZYWRZ |
CLXBXFRZEIEQEEPBLRIE | 1993-11-10 | W787634 | KWKPHXJJRCUYBDKHUHRM |
CMAIJFCNHJURYCHNOAYL | 1996-06-02 | W787634 | VMHPENKNLDGOGLLHIOWG |
NJSRERLELVADXBAGQJTI | 1998-03-03 | W787634 | LDAQPEXROTZUONDSJQIF |
YOWAYQEBLRSGASAQYYSD | 1998-04-17 | W787634 | EUBVODPWCFJVEAOCMTFD |
WKODZTQLYTWHGMPLUZRI | 1993-06-29 | Y980037 | FBDXMVTJUYFMDDRYOKCD |
BGORJLYVJTVRDPCENMJA | 1994-01-11 | Y980037 | HOWWHUTEFOMWCOMUHGPP |
IYDCAOHKDBHLOROJTQHI | 1994-02-26 | Y980037 | IFULSLEDMQHAASZETFCI |
GCGJUTIRMYYTJKHBQJTD | 1995-12-28 | Y980037 | XSJQVRUSNACJAYFQIBNZ |
TGTPVELHPFCQDYIODUZU | 1996-08-30 | Y980037 | ULRZVUTQKMMOPAMZGYEU |
There is of course an alternative way to do this using a package called fuzzyjoin and dplyr. Fuzzyjoin allows you to be less exact about the parameters of the join. It really is a very useful package
library(fuzzyjoin)
library(dplyr)
#Rename the columns so can do the join
names(Endodf)<-c("EndoRep","Date","EndoHospitalNum")
names(pathdf)<-c("pathRep","Date","pathHospitalNum")
themerge3 <-
fuzzyjoin::difference_full_join(Endodf, pathdf, by = 'Date', max_dist = 2, distance_col = 'Days') %>%
filter(EndoHospitalNum == pathHospitalNum)
kable(head(themerge3))
EndoRep | Date.x | EndoHospitalNum | pathRep | Date.y | pathHospitalNum | Days |
---|---|---|---|---|---|---|
RUVBTEJSSPCKYURMWNGL | 1993-12-22 | H432243 | OKMZQKDWLIBEFEFKPQQZ | 1993-12-22 | H432243 | 0 days |
MULEZPZHBGFQOCIDEWEZ | 1994-05-14 | T662272 | HPJPOGPKQLORYMCPTHJF | 1994-05-16 | T662272 | 2 days |
NMVVOSFHYJIJKMLNXBKP | 1992-07-19 | G424284 | IKQJGARHDTYJPSEWRTXU | 1992-07-20 | G424284 | 1 days |
CMAIJFCNHJURYCHNOAYL | 1996-06-01 | W787634 | VMHPENKNLDGOGLLHIOWG | 1996-06-02 | W787634 | 1 days |
IOKHSTVPDISUNNEQVAGP | 1992-04-20 | H432243 | QZVUSNGAUNBEDIFHPSEE | 1992-04-20 | H432243 | 0 days |
TGTPVELHPFCQDYIODUZU | 1996-08-30 | Y980037 | ULRZVUTQKMMOPAMZGYEU | 1996-08-30 | Y980037 | 0 days |