Merging data

Principles of data merging:

It is unusual to have all data you want to analyse within one dataset. As a result being able to merge data from different datasets is crucial. Merging isn’t as straightforward as you may think. The principles are as follows

When merging different datasets, choose columns to merge on where the resulting merge will result in a unique row. For example let’s say you have two datasets, an endoscopy and a pathology dataset. Both datasets will contain the Hospital number of the patient the report is written for. However, one patient may have many pathology reports, or perhaps many endoscopies but only one where any pathology was taken. You’re job is to figure out which pathology report should be associated with which endoscopy report

Merge all vs merging some When merging you should also be careful to understand what you are including in the merge. Lets say you have a pathology dataset with 10 records but an endoscopy dataset with 100 records. Do you want to merge so you include only the data which have matched records, or do you want to include all endoscopy regardless of whether a matched pathology record is present as well?

Merge records from the same dataset Merging can also happen when you have two subsets from the same dataset and you need to add them together. This is different to classic merging as above and is usually done with the base R functions cbind and rbind. I won’t discuss these as these are basic functions and only written here to illustrate an alternative form of merging that is available.

Situation 1- merging different datasets

I want to merge a pathology dataset with an endoscopy dataset so that the endoscopy has the correct associated pathology for that dataset

Firstly lets create the dataset. You will notice that the dates are slightly different between Endoscopy and path reports for some patients. This is real life where the date received for the pathology numbers can be slightly different to the date of the endoscopy.

#Creating the datasets

#Random letters to fill out a pathology report
pathRep<-replicate(20,paste(sample(LETTERS,20,replace=T),collapse=""))
pathDate<-as.Date(c("1993-12-22","1994-05-16","1992-07-20","1996-06-02","1992-04-20","1996-08-30","1992-01-26","1991-03-23","1995-12-28","1995-07-15","1993-04-04","1994-01-11","1999-08-21","1993-11-10","1994-02-26","1992-08-06","1993-06-29","1997-03-08","1998-03-03","1998-04-17"))
pathHospitalNum<-c("H432243","T662272","G424284","W787634","H432243","Y980037","H432243","W787634","Y980037","E432243","U874287","Y980037","U874287","W787634","Y980037","H432243","Y980037","E432243","W787634","W787634")
#Create the dataframe
pathdf<-data.frame(pathRep,pathDate,pathHospitalNum)

#Random letters to fill out a pathology report
EndoRep<-replicate(20,paste(sample(LETTERS,20,replace=T),collapse=""))
EndoDate<-as.Date(c("1993-12-22","1994-05-14","1992-07-19","1996-06-01","1992-04-20","1996-08-30","1992-01-24","1991-03-21","1995-12-28","1995-07-15","1993-04-02","1994-01-10","1999-08-21","1993-11-10","1994-02-26","1992-08-05","1993-06-29","1997-03-07","1998-03-03","1998-04-17"))
EndoHospitalNum<-c("H432243","T662272","G424284","W787634","H432243","Y980037","H432243","W787634","Y980037","E432243","U874287","Y980037","U874287","W787634","Y980037","H432243","Y980037","E432243","W787634","W787634")
#Create the dataframe:
Endodf<-data.frame(EndoRep,EndoDate,EndoHospitalNum)

A simple way of merging for this ultrasimplified dataset is just to use the function merge() in base R. We will choose the columns date and Hospital number to do this: Before we do this we have to name the columns to merge the same so R can recognise them

names(Endodf)<-c("Report","Date","HospNum")
names(pathdf)<-c("Report","Date","HospNum")

Now we are ready to merge:

themerge<-merge(Endodf,pathdf,by=c("Date","HospNum"))
kable(themerge)

Date	HospNum	Report.x	Report.y
1992-04-20	H432243	IOKHSTVPDISUNNEQVAGP	QZVUSNGAUNBEDIFHPSEE
1993-06-29	Y980037	WKODZTQLYTWHGMPLUZRI	FBDXMVTJUYFMDDRYOKCD
1993-11-10	W787634	CLXBXFRZEIEQEEPBLRIE	KWKPHXJJRCUYBDKHUHRM
1993-12-22	H432243	RUVBTEJSSPCKYURMWNGL	OKMZQKDWLIBEFEFKPQQZ
1994-02-26	Y980037	IYDCAOHKDBHLOROJTQHI	IFULSLEDMQHAASZETFCI
1995-07-15	E432243	OKNOCLWDKNBFIJYAGMHC	HLBXWFQOHJMTEUKLRQAK
1995-12-28	Y980037	GCGJUTIRMYYTJKHBQJTD	XSJQVRUSNACJAYFQIBNZ
1996-08-30	Y980037	TGTPVELHPFCQDYIODUZU	ULRZVUTQKMMOPAMZGYEU
1998-03-03	W787634	NJSRERLELVADXBAGQJTI	LDAQPEXROTZUONDSJQIF
1998-04-17	W787634	YOWAYQEBLRSGASAQYYSD	EUBVODPWCFJVEAOCMTFD
1999-08-21	U874287	WSWTMGEAFBAJHJUPEPUE	MRYSVPTQMDNOILAMERZP

Situation 2- merging different datasets but merging all instead of some:

But note, the result set only includes 11 records. That is because we are merging according to where the records, both Hospital Number and date, are the same. So lets loosen that up a little and lets allow the merge to match if the date is up to 4 days out (to allow delays in transit from the endoscopy unit to the pathology department, samples stored in fridges over bank holidays etc..). To do this we are going to create use data.table

library(data.table)
pathdt<-data.table(pathRep,pathDate,pathHospitalNum)
setkey(pathdt, pathHospitalNum, pathDate)

Endodt<-data.table(EndoRep,EndoDate,EndoHospitalNum)
setkey(Endodt, EndoHospitalNum, EndoDate)

#run the join
themerge2<-Endodt[pathdt,roll="nearest"]
kable(head(themerge2,25))

EndoRep	EndoDate	EndoHospitalNum	pathRep
OKNOCLWDKNBFIJYAGMHC	1995-07-15	E432243	HLBXWFQOHJMTEUKLRQAK
GAGVNQVDZINLISBIZVYW	1997-03-08	E432243	QKAVESSKEMGXJDPBWUWZ
NMVVOSFHYJIJKMLNXBKP	1992-07-20	G424284	IKQJGARHDTYJPSEWRTXU
PYREFSEEVQZSIPLPZSHA	1992-01-26	H432243	VMUAWSGWVGBFUFTXFXVZ
IOKHSTVPDISUNNEQVAGP	1992-04-20	H432243	QZVUSNGAUNBEDIFHPSEE
KWAUINSTNGJZBMYEPNSA	1992-08-06	H432243	UYBTKPBJTZNFDQRGNAUO
RUVBTEJSSPCKYURMWNGL	1993-12-22	H432243	OKMZQKDWLIBEFEFKPQQZ
MULEZPZHBGFQOCIDEWEZ	1994-05-16	T662272	HPJPOGPKQLORYMCPTHJF
SOYZGYYTFVGGGFUMICUC	1993-04-04	U874287	KNDPFWMVTMPHSSZLCRSZ
WSWTMGEAFBAJHJUPEPUE	1999-08-21	U874287	MRYSVPTQMDNOILAMERZP
ZNHPURPMFCGDQVHLYTPR	1991-03-23	W787634	YXCIMEJMAKBSDFAZYWRZ
CLXBXFRZEIEQEEPBLRIE	1993-11-10	W787634	KWKPHXJJRCUYBDKHUHRM
CMAIJFCNHJURYCHNOAYL	1996-06-02	W787634	VMHPENKNLDGOGLLHIOWG
NJSRERLELVADXBAGQJTI	1998-03-03	W787634	LDAQPEXROTZUONDSJQIF
YOWAYQEBLRSGASAQYYSD	1998-04-17	W787634	EUBVODPWCFJVEAOCMTFD
WKODZTQLYTWHGMPLUZRI	1993-06-29	Y980037	FBDXMVTJUYFMDDRYOKCD
BGORJLYVJTVRDPCENMJA	1994-01-11	Y980037	HOWWHUTEFOMWCOMUHGPP
IYDCAOHKDBHLOROJTQHI	1994-02-26	Y980037	IFULSLEDMQHAASZETFCI
GCGJUTIRMYYTJKHBQJTD	1995-12-28	Y980037	XSJQVRUSNACJAYFQIBNZ
TGTPVELHPFCQDYIODUZU	1996-08-30	Y980037	ULRZVUTQKMMOPAMZGYEU

Joining when the join is fuzzy

There is of course an alternative way to do this using a package called fuzzyjoin and dplyr. Fuzzyjoin allows you to be less exact about the parameters of the join. It really is a very useful package

library(fuzzyjoin)
library(dplyr)
#Rename the columns so can do the join
names(Endodf)<-c("EndoRep","Date","EndoHospitalNum")
names(pathdf)<-c("pathRep","Date","pathHospitalNum")
themerge3 <- 
  fuzzyjoin::difference_full_join(Endodf, pathdf, by = 'Date', max_dist = 2, distance_col = 'Days') %>%
  filter(EndoHospitalNum == pathHospitalNum) 
kable(head(themerge3))

EndoRep	Date.x	EndoHospitalNum	pathRep	Date.y	pathHospitalNum	Days
RUVBTEJSSPCKYURMWNGL	1993-12-22	H432243	OKMZQKDWLIBEFEFKPQQZ	1993-12-22	H432243	0 days
MULEZPZHBGFQOCIDEWEZ	1994-05-14	T662272	HPJPOGPKQLORYMCPTHJF	1994-05-16	T662272	2 days
NMVVOSFHYJIJKMLNXBKP	1992-07-19	G424284	IKQJGARHDTYJPSEWRTXU	1992-07-20	G424284	1 days
CMAIJFCNHJURYCHNOAYL	1996-06-01	W787634	VMHPENKNLDGOGLLHIOWG	1996-06-02	W787634	1 days
IOKHSTVPDISUNNEQVAGP	1992-04-20	H432243	QZVUSNGAUNBEDIFHPSEE	1992-04-20	H432243	0 days
TGTPVELHPFCQDYIODUZU	1996-08-30	Y980037	ULRZVUTQKMMOPAMZGYEU	1996-08-30	Y980037	0 days