Making Data from Data with dplyr::Mutate

There are many occasions when a column of data needs to be created from an already existing column for ease of data manipulation. For example, perhaps you have a body of text as a pathology report and you want to extract all the reports where the diagnosis is dysplasia. You could just subset the data using grepl so that you only get the reports that mention this word…but what if the data needs to be cleaned prior to subsetting like excluding reports where the diagnosis is normal but the phrase ‘No evidence of dysplasia’ is present. Or perhaps there are other manipulations needed prior to subsetting.

This is where data accordionisation is useful. This simply means the creation of data from (usually) a column into another column in the same dataframe.

The neatest way to do this is with the mutate function from the ‘dplyr’ package which is devoted to this. There are also other ways which I will demonstrate at the end.

The input data here will be an endoscopy data set:

Age<-sample(1:100, 130, replace=TRUE)
Dx<-sample(c("NDBE","LGD","HGD","IMC"), 130, replace = TRUE)
TimeOfEndoscopy<-sample(1:60, 130, replace=TRUE)


EMRdf<-data.frame(Age,Dx,TimeOfEndoscopy,stringsAsFactors=F)

Perhaps you need to calculate the number of hours spent doing each endoscopy rather than the number of minutes

EMRdftbb<-EMRdf%>%mutate(TimeOfEndoscopy/60)
#Just show the top 20 results
kable(head(EMRdftbb,20))

Age	Dx	TimeOfEndoscopy	TimeOfEndoscopy/60
64	LGD	40	0.6666667
17	HGD	60	1.0000000
96	IMC	43	0.7166667
41	LGD	13	0.2166667
15	NDBE	5	0.0833333
61	NDBE	13	0.2166667
10	HGD	41	0.6833333
28	LGD	42	0.7000000
79	NDBE	60	1.0000000
27	LGD	27	0.4500000
2	IMC	5	0.0833333
99	IMC	8	0.1333333
22	LGD	42	0.7000000
38	NDBE	3	0.0500000
37	HGD	2	0.0333333
64	LGD	15	0.2500000
51	LGD	14	0.2333333
71	NDBE	33	0.5500000
55	HGD	17	0.2833333
76	IMC	10	0.1666667

That is useful but what if you want to classify the amount of time spent doing each endoscopy as follows: <0.4 hours is too little time and >0.4 hours is too long.

Using ifelse() with mutate for conditional accordionisation

For this we would use ifelse(). However this can be combined with mutate() so that the result gets put in another column as follows

EMRdf2<-EMRdf%>%mutate(TimeInHours=TimeOfEndoscopy/60)%>%mutate(TimeClassification = ifelse(TimeInHours>0.4, "Too Long", "Too Short"))
#Just show the top 20 results
kable(head(EMRdf2,20))

Age	Dx	TimeOfEndoscopy	TimeInHours	TimeClassification
64	LGD	40	0.6666667	Too Long
17	HGD	60	1.0000000	Too Long
96	IMC	43	0.7166667	Too Long
41	LGD	13	0.2166667	Too Short
15	NDBE	5	0.0833333	Too Short
61	NDBE	13	0.2166667	Too Short
10	HGD	41	0.6833333	Too Long
28	LGD	42	0.7000000	Too Long
79	NDBE	60	1.0000000	Too Long
27	LGD	27	0.4500000	Too Long
2	IMC	5	0.0833333	Too Short
99	IMC	8	0.1333333	Too Short
22	LGD	42	0.7000000	Too Long
38	NDBE	3	0.0500000	Too Short
37	HGD	2	0.0333333	Too Short
64	LGD	15	0.2500000	Too Short
51	LGD	14	0.2333333	Too Short
71	NDBE	33	0.5500000	Too Long
55	HGD	17	0.2833333	Too Short
76	IMC	10	0.1666667	Too Short

Note how we can chain the mutate() function together.

Using multiple ifelse()

What if we want to get more complex and put several classifiers in? We just use more ifelse’s:

EMRdf2<-EMRdf%>%mutate(TimeInHours=TimeOfEndoscopy/60)%>%mutate(TimeClassification = ifelse(TimeInHours>0.8, "Too Long", ifelse(TimeInHours<0.5,"Too Short",ifelse(TimeInHours>=0.5&TimeInHours<=0.8,"Just Right","N"))))
#Just show the top 20 results
kable(head(EMRdf2,20))

Age	Dx	TimeOfEndoscopy	TimeInHours	TimeClassification
64	LGD	40	0.6666667	Just Right
17	HGD	60	1.0000000	Too Long
96	IMC	43	0.7166667	Just Right
41	LGD	13	0.2166667	Too Short
15	NDBE	5	0.0833333	Too Short
61	NDBE	13	0.2166667	Too Short
10	HGD	41	0.6833333	Just Right
28	LGD	42	0.7000000	Just Right
79	NDBE	60	1.0000000	Too Long
27	LGD	27	0.4500000	Too Short
2	IMC	5	0.0833333	Too Short
99	IMC	8	0.1333333	Too Short
22	LGD	42	0.7000000	Just Right
38	NDBE	3	0.0500000	Too Short
37	HGD	2	0.0333333	Too Short
64	LGD	15	0.2500000	Too Short
51	LGD	14	0.2333333	Too Short
71	NDBE	33	0.5500000	Just Right
55	HGD	17	0.2833333	Too Short
76	IMC	10	0.1666667	Too Short

Using multiple ifelse() with grepl() or string_extract

Of course we need to extract information from text as well as numeric data. We can do this using grepl or string_extract from the library(stringr). We have used this before here so you may want to refresh yourself.

Let’s say we want to extract all the samples that had IMC. We don’t want to subset the data, just extract IMC into a column that says IMC and the rest say ’Non-IMC

Using the dataset above:

library(stringr)
EMRdf$MyIMC_Column<-str_extract(EMRdf$Dx,"IMC")

#to fill the NA's we would do:
EMRdf$MyIMC_Column<-ifelse(grepl("IMC",EMRdf$Dx),"IMC","NoIMC")
#Another way to do this (really should be for more complex examples when you want to extract the entire contents of the cell that has the match)

EMRdf$MyIMC_Column<-ifelse(grepl("IMC",EMRdf$Dx),str_extract(EMRdf$Dx,"IMC"),"NoIMC")

So data can be usefully created from data for further analysis