Data is best presented in a tidy format. This is the optimal way to organise data for later processing.

What is Tidy Data?

A dataset is said to be tidy if it satisfies the following conditions 1.observations are in rows 2.variables are in columns 3.contained in a single dataset

An example of messy data is as follows:

library(knitr)
library(kableExtra)
df<-data.frame(c(5,6),c(4,1))
names(df)<-c("male","female")
kable(df)
male female
5 4
6 1

The optimal way to reorganise the dataset above is with reshape2, a package which works well with dplyr (because it was made by the same people).

This time we will use a bigger dataset as follows:

pew <- read.delim(
file = "http://stat405.had.co.nz/data/pew.txt",
header = TRUE,
stringsAsFactors = FALSE,
check.names = F
)

The dataset now has to be melted. To do this you need to specify the dataset (data=pew) the id column (which is the one you are organising the data around, in this case it is “religion”), the variable name you are collecting all the columns together under (“income”) and the value name which is what each number represents

library(reshape2)
pew_tidy <- melt(
data = pew,
id = "religion",
variable.name = "income",
value.name = "frequency"
)
kable(head(pew_tidy,25))
religion income frequency
Agnostic <$10k 27 Atheist <$10k 12
Buddhist <$10k 27 Catholic <$10k 418
Don’t know/refused <$10k 15 Evangelical Prot <$10k 575
Hindu <$10k 1 Historically Black Prot <$10k 228
Jehovah’s Witness <$10k 20 Jewish <$10k 19
Mainline Prot <$10k 289 Mormon <$10k 29
Muslim <$10k 6 Orthodox <$10k 13
Other Christian <$10k 9 Other Faiths <$10k 20
Other World Religions <$10k 5 Unaffiliated <$10k 217
Agnostic $10-20k 34 Atheist$10-20k 27
Buddhist $10-20k 21 Catholic$10-20k 617
Don’t know/refused $10-20k 14 Evangelical Prot$10-20k 869
Hindu \$10-20k 9

If the data is even messier such that it has more columns in it that are not part of the variable you are trying to create a length of, then you can still use it. Consider the following dataset:

billboards <- read.csv(
file = "http://stat405.had.co.nz/data/billboard.csv",
stringsAsFactors = FALSE
)
names(billboards) <- gsub("\\.", "_", names(billboards))

It has columns in it like ‘track’ and ‘artist_inverted’. We can get around this by saying the data has to revolve around several columns as so:

billboards_tidy <- melt(billboards,
id = 1:7,
variable.name = "week",
value.name = "rank",
na.rm = TRUE
)

Once the data is tidy, you can use a wealth of different packages on it. Most importantly ggplot loves tidy data.