Data is best presented in a tidy format. This is the optimal way to organise data for later processing.
What is Tidy Data?
A dataset is said to be tidy if it satisfies the following conditions 1.observations are in rows 2.variables are in columns 3.contained in a single dataset
An example of messy data is as follows:
library(knitr)
library(kableExtra)
df<-data.frame(c(5,6),c(4,1))
names(df)<-c("male","female")
kable(df)
male | female |
---|---|
5 | 4 |
6 | 1 |
The optimal way to reorganise the dataset above is with reshape2, a package which works well with dplyr (because it was made by the same people).
This time we will use a bigger dataset as follows:
pew <- read.delim(
file = "http://stat405.had.co.nz/data/pew.txt",
header = TRUE,
stringsAsFactors = FALSE,
check.names = F
)
The dataset now has to be melted. To do this you need to specify the dataset (data=pew) the id column (which is the one you are organising the data around, in this case it is “religion”), the variable name you are collecting all the columns together under (“income”) and the value name which is what each number represents
library(reshape2)
pew_tidy <- melt(
data = pew,
id = "religion",
variable.name = "income",
value.name = "frequency"
)
kable(head(pew_tidy,25))
religion | income | frequency |
---|---|---|
Agnostic | <$10k | 27 |
Atheist | <$10k | 12 |
Buddhist | <$10k | 27 |
Catholic | <$10k | 418 |
Don’t know/refused | <$10k | 15 |
Evangelical Prot | <$10k | 575 |
Hindu | <$10k | 1 |
Historically Black Prot | <$10k | 228 |
Jehovah’s Witness | <$10k | 20 |
Jewish | <$10k | 19 |
Mainline Prot | <$10k | 289 |
Mormon | <$10k | 29 |
Muslim | <$10k | 6 |
Orthodox | <$10k | 13 |
Other Christian | <$10k | 9 |
Other Faiths | <$10k | 20 |
Other World Religions | <$10k | 5 |
Unaffiliated | <$10k | 217 |
Agnostic | $10-20k | 34 |
Atheist | $10-20k | 27 |
Buddhist | $10-20k | 21 |
Catholic | $10-20k | 617 |
Don’t know/refused | $10-20k | 14 |
Evangelical Prot | $10-20k | 869 |
Hindu | $10-20k | 9 |
If the data is even messier such that it has more columns in it that are not part of the variable you are trying to create a length of, then you can still use it. Consider the following dataset:
billboards <- read.csv(
file = "http://stat405.had.co.nz/data/billboard.csv",
stringsAsFactors = FALSE
)
names(billboards) <- gsub("\\.", "_", names(billboards))
It has columns in it like ‘track’ and ‘artist_inverted’. We can get around this by saying the data has to revolve around several columns as so:
billboards_tidy <- melt(billboards,
id = 1:7,
variable.name = "week",
value.name = "rank",
na.rm = TRUE
)
Once the data is tidy, you can use a wealth of different packages on it. Most importantly ggplot loves tidy data.