Principles of data merging:

It is unusual to have all data you want to analyse within one dataset. As a result being able to merge data from different datasets is crucial. Merging isn’t as straightforward as you may think. The principles are as follows

  1. When merging different datasets, choose columns to merge on where the resulting merge will result in a unique row. For example let’s say you have two datasets, an endoscopy and a pathology dataset. Both datasets will contain the Hospital number of the patient the report is written for. However, one patient may have many pathology reports, or perhaps many endoscopies but only one where any pathology was taken. You’re job is to figure out which pathology report should be associated with which endoscopy report









  1. Merge all vs merging some When merging you should also be careful to understand what you are including in the merge. Lets say you have a pathology dataset with 10 records but an endoscopy dataset with 100 records. Do you want to merge so you include only the data which have matched records, or do you want to include all endoscopy regardless of whether a matched pathology record is present as well?






  1. Merge records from the same dataset Merging can also happen when you have two subsets from the same dataset and you need to add them together. This is different to classic merging as above and is usually done with the base R functions cbind and rbind. I won’t discuss these as these are basic functions and only written here to illustrate an alternative form of merging that is available.











Situation 1- merging different datasets

I want to merge a pathology dataset with an endoscopy dataset so that the endoscopy has the correct associated pathology for that dataset

Firstly lets create the dataset. You will notice that the dates are slightly different between Endoscopy and path reports for some patients. This is real life where the date received for the pathology numbers can be slightly different to the date of the endoscopy.

#Creating the datasets

#Random letters to fill out a pathology report
pathRep<-replicate(20,paste(sample(LETTERS,20,replace=T),collapse=""))
pathDate<-as.Date(c("1993-12-22","1994-05-16","1992-07-20","1996-06-02","1992-04-20","1996-08-30","1992-01-26","1991-03-23","1995-12-28","1995-07-15","1993-04-04","1994-01-11","1999-08-21","1993-11-10","1994-02-26","1992-08-06","1993-06-29","1997-03-08","1998-03-03","1998-04-17"))
pathHospitalNum<-c("H432243","T662272","G424284","W787634","H432243","Y980037","H432243","W787634","Y980037","E432243","U874287","Y980037","U874287","W787634","Y980037","H432243","Y980037","E432243","W787634","W787634")
#Create the dataframe
pathdf<-data.frame(pathRep,pathDate,pathHospitalNum)

#Random letters to fill out a pathology report
EndoRep<-replicate(20,paste(sample(LETTERS,20,replace=T),collapse=""))
EndoDate<-as.Date(c("1993-12-22","1994-05-14","1992-07-19","1996-06-01","1992-04-20","1996-08-30","1992-01-24","1991-03-21","1995-12-28","1995-07-15","1993-04-02","1994-01-10","1999-08-21","1993-11-10","1994-02-26","1992-08-05","1993-06-29","1997-03-07","1998-03-03","1998-04-17"))
EndoHospitalNum<-c("H432243","T662272","G424284","W787634","H432243","Y980037","H432243","W787634","Y980037","E432243","U874287","Y980037","U874287","W787634","Y980037","H432243","Y980037","E432243","W787634","W787634")
#Create the dataframe:
Endodf<-data.frame(EndoRep,EndoDate,EndoHospitalNum)

A simple way of merging for this ultrasimplified dataset is just to use the function merge() in base R. We will choose the columns date and Hospital number to do this: Before we do this we have to name the columns to merge the same so R can recognise them

names(Endodf)<-c("Report","Date","HospNum")
names(pathdf)<-c("Report","Date","HospNum")

Now we are ready to merge:

themerge<-merge(Endodf,pathdf,by=c("Date","HospNum"))
kable(themerge)
Date HospNum Report.x Report.y
1992-04-20 H432243 IOKHSTVPDISUNNEQVAGP QZVUSNGAUNBEDIFHPSEE
1993-06-29 Y980037 WKODZTQLYTWHGMPLUZRI FBDXMVTJUYFMDDRYOKCD
1993-11-10 W787634 CLXBXFRZEIEQEEPBLRIE KWKPHXJJRCUYBDKHUHRM
1993-12-22 H432243 RUVBTEJSSPCKYURMWNGL OKMZQKDWLIBEFEFKPQQZ
1994-02-26 Y980037 IYDCAOHKDBHLOROJTQHI IFULSLEDMQHAASZETFCI
1995-07-15 E432243 OKNOCLWDKNBFIJYAGMHC HLBXWFQOHJMTEUKLRQAK
1995-12-28 Y980037 GCGJUTIRMYYTJKHBQJTD XSJQVRUSNACJAYFQIBNZ
1996-08-30 Y980037 TGTPVELHPFCQDYIODUZU ULRZVUTQKMMOPAMZGYEU
1998-03-03 W787634 NJSRERLELVADXBAGQJTI LDAQPEXROTZUONDSJQIF
1998-04-17 W787634 YOWAYQEBLRSGASAQYYSD EUBVODPWCFJVEAOCMTFD
1999-08-21 U874287 WSWTMGEAFBAJHJUPEPUE MRYSVPTQMDNOILAMERZP






Situation 2- merging different datasets but merging all instead of some:

But note, the result set only includes 11 records. That is because we are merging according to where the records, both Hospital Number and date, are the same. So lets loosen that up a little and lets allow the merge to match if the date is up to 4 days out (to allow delays in transit from the endoscopy unit to the pathology department, samples stored in fridges over bank holidays etc..). To do this we are going to create use data.table

library(data.table)
pathdt<-data.table(pathRep,pathDate,pathHospitalNum)
setkey(pathdt, pathHospitalNum, pathDate)

Endodt<-data.table(EndoRep,EndoDate,EndoHospitalNum)
setkey(Endodt, EndoHospitalNum, EndoDate)

#run the join
themerge2<-Endodt[pathdt,roll="nearest"]
kable(head(themerge2,25))
EndoRep EndoDate EndoHospitalNum pathRep
OKNOCLWDKNBFIJYAGMHC 1995-07-15 E432243 HLBXWFQOHJMTEUKLRQAK
GAGVNQVDZINLISBIZVYW 1997-03-08 E432243 QKAVESSKEMGXJDPBWUWZ
NMVVOSFHYJIJKMLNXBKP 1992-07-20 G424284 IKQJGARHDTYJPSEWRTXU
PYREFSEEVQZSIPLPZSHA 1992-01-26 H432243 VMUAWSGWVGBFUFTXFXVZ
IOKHSTVPDISUNNEQVAGP 1992-04-20 H432243 QZVUSNGAUNBEDIFHPSEE
KWAUINSTNGJZBMYEPNSA 1992-08-06 H432243 UYBTKPBJTZNFDQRGNAUO
RUVBTEJSSPCKYURMWNGL 1993-12-22 H432243 OKMZQKDWLIBEFEFKPQQZ
MULEZPZHBGFQOCIDEWEZ 1994-05-16 T662272 HPJPOGPKQLORYMCPTHJF
SOYZGYYTFVGGGFUMICUC 1993-04-04 U874287 KNDPFWMVTMPHSSZLCRSZ
WSWTMGEAFBAJHJUPEPUE 1999-08-21 U874287 MRYSVPTQMDNOILAMERZP
ZNHPURPMFCGDQVHLYTPR 1991-03-23 W787634 YXCIMEJMAKBSDFAZYWRZ
CLXBXFRZEIEQEEPBLRIE 1993-11-10 W787634 KWKPHXJJRCUYBDKHUHRM
CMAIJFCNHJURYCHNOAYL 1996-06-02 W787634 VMHPENKNLDGOGLLHIOWG
NJSRERLELVADXBAGQJTI 1998-03-03 W787634 LDAQPEXROTZUONDSJQIF
YOWAYQEBLRSGASAQYYSD 1998-04-17 W787634 EUBVODPWCFJVEAOCMTFD
WKODZTQLYTWHGMPLUZRI 1993-06-29 Y980037 FBDXMVTJUYFMDDRYOKCD
BGORJLYVJTVRDPCENMJA 1994-01-11 Y980037 HOWWHUTEFOMWCOMUHGPP
IYDCAOHKDBHLOROJTQHI 1994-02-26 Y980037 IFULSLEDMQHAASZETFCI
GCGJUTIRMYYTJKHBQJTD 1995-12-28 Y980037 XSJQVRUSNACJAYFQIBNZ
TGTPVELHPFCQDYIODUZU 1996-08-30 Y980037 ULRZVUTQKMMOPAMZGYEU






Joining when the join is fuzzy

There is of course an alternative way to do this using a package called fuzzyjoin and dplyr. Fuzzyjoin allows you to be less exact about the parameters of the join. It really is a very useful package

library(fuzzyjoin)
library(dplyr)
#Rename the columns so can do the join
names(Endodf)<-c("EndoRep","Date","EndoHospitalNum")
names(pathdf)<-c("pathRep","Date","pathHospitalNum")
themerge3 <- 
  fuzzyjoin::difference_full_join(Endodf, pathdf, by = 'Date', max_dist = 2, distance_col = 'Days') %>%
  filter(EndoHospitalNum == pathHospitalNum) 
kable(head(themerge3))
EndoRep Date.x EndoHospitalNum pathRep Date.y pathHospitalNum Days
RUVBTEJSSPCKYURMWNGL 1993-12-22 H432243 OKMZQKDWLIBEFEFKPQQZ 1993-12-22 H432243 0 days
MULEZPZHBGFQOCIDEWEZ 1994-05-14 T662272 HPJPOGPKQLORYMCPTHJF 1994-05-16 T662272 2 days
NMVVOSFHYJIJKMLNXBKP 1992-07-19 G424284 IKQJGARHDTYJPSEWRTXU 1992-07-20 G424284 1 days
CMAIJFCNHJURYCHNOAYL 1996-06-01 W787634 VMHPENKNLDGOGLLHIOWG 1996-06-02 W787634 1 days
IOKHSTVPDISUNNEQVAGP 1992-04-20 H432243 QZVUSNGAUNBEDIFHPSEE 1992-04-20 H432243 0 days
TGTPVELHPFCQDYIODUZU 1996-08-30 Y980037 ULRZVUTQKMMOPAMZGYEU 1996-08-30 Y980037 0 days