# Attach packages and set chunk options
xfun::pkg_attach(
"knitr",
"DT",
"data.table",
"cdata",
"tidyr",
"tibble"
)
opts_chunk$set(
message = FALSE,
warning = FALSE,
echo = TRUE
)
# create standard datatable function to be used in all tables in this post
show_datatable <- function(data) {
data %>%
DT::datatable(
rownames = FALSE,
extensions = 'FixedColumns',
options = list(
pageLength = 20,
scrollX = TRUE,
scrollCollapse = TRUE,
columnDefs = list(
list(
className = 'dt-center',
targets = 0:(ncol(data) - 1)
)
)
)
)
}
1 Inevitable
With my current work as a junior data scientist, I have to manipulate a lot of data to get to a desired output or a possible answer to a question. Almost all data that I have handled required me to convert a wide table to a long format or a long table to a wide format.
The usual packages I use for length and cross wise transformations are {tidyr}(Wickham and Henry 2019) and {data.table}(Dowle and Srinivasan 2019). For wide to long1, one can use either tidyr::gather() or data.table::melt(); for long to wide, one can use tidyr::spread()2 and data.table::dcast()3.
The functions from both packages have the similar disadvantage of their use cases not really being intuitive in terms of how to transform the data. In my opinion, data is transformed in the mind of the user, i.e. the transformation is hidden within the code that the future user and collaborators have to decipher. It will be better if the user can work with the data and just allow the code to obey its master. I think this concern is addressed by {cdata}.
2 Coordinatized
While I advocate the use and am a heavy user of {cdata}(Mount and Zumel 2019a), the best resource to learn it is from its vignettes. The main concept behind {cdata} is the theory of coordinatized4 or fluid data, each element of the data can be located using coordinates defined by the table name, key column/s and value/s, and the value column.
I really like the idea of coordinatized data because all data is coordinatized. We do not need to identify which data is “tidy” or “messy”. Data is data; and the form a data has in the beginning is not important as long as we know how to interpret it in terms of its coordinates. Once we understand the data, we can break it down and shape it into the form that will be most useful for our purposes.
{cdata} is also the first R package that talked about row records and blocks which may seem confusing at first (at least, for me) but it makes more sense to think about data transformations from row records to blocks, and vice versa. Long and wide might be simpler and maybe even correct when a set of columns are placed into two or when two columns are distributed into multiple columns, but for more complicated transformations, I believe, row records and blocks are more appropriate.
How does {cdata} perform fluid data transformation? How is it different from {tidyr}’s gather() and spread() or {data.table}’s melt() and dcast()?
3 One Data to Compare Them All
I will compare data transformation using {tidyr}, {data.table}, and {cdata}, and I hope I will highlight the power of {cdata}. I will use the Lord of the Rings data from Hiroaki Yutani’s blog post5.
lotr_tbl <- tribble(
~Race , ~Female_FoTR, ~Male_FoTR,
~Female_TT , ~Male_TT ,
~Female_RoTK, ~Male_RoTK,
"Elf" , 1229 , 971 ,
331 , 513 ,
183 , 510 ,
"Hobbit", 14 , 3644 ,
0 , 2463 ,
2 , 2673 ,
"Man" , 0 , 1995 ,
401 , 3589 ,
268 , 2459
)
lotr_dt <- setDT(lotr_tbl)
My first goal is to convert the data from this (Table 1)
head(lotr_tbl) %>%
show_datatable()
Figure 3.1: Table 1
to this (Table 2)
lotr_tbl_2 <- tribble(
~Race , ~key , ~FoTR, ~TT, ~RoTK,
"Elf" , "Female", 1229 , 331 , 183 ,
"Hobbit", "Female", 14 , 0 , 2 ,
"Man" , "Female", 0 , 401 , 268 ,
"Elf" , "Male" , 971 , 513 , 510 ,
"Hobbit", "Male" , 3644 , 2463, 2673,
"Man" , "Male" , 1995 , 3589, 2459
)
lotr_tbl_2 %>%
show_datatable()
Figure 3.2: Table 2
And then back to Table 1.
3.1 Gather and Spread
To convert from Table 1 to 2, a single use of {tidyr}’s gather() will not work because we want to take multiple columns and put it in multiple columns rather than only two. We will need the full functionality of {tidyr}.
My solution to convert Table 1 to 2 is
lotr_tbl %>%
gather(key = "key", value = "count", - Race) %>%
separate(key, into = c("key", "book"), sep = "_") %>%
spread(key = book, value = count) %>%
show_datatable()
I collected all columns first, except the Race column into two columns. The key column is a combination of the gender and book. Since we want book to be column headers, we need to separate this single key column into two (key and book) using tidyr::separate(). With the book only in one column, we can spread this column and get to Table 2.
For Table 2 to 1
lotr_tbl_2 %>%
gather(key = "book", value = "count", -Race, -key) %>%
unite(col = "key", key, book, sep = "_") %>%
spread(key = key, value = count) %>%
show_datatable()
I gathered all the book/count columns and put it in two columns. I used tidyr::unite() to combine key and book. I finished the transformation with a spread() of the key and count column6.
3.2 Melt and (D)Cast
The steps we used in the previous section can also be translated in {data.table}’s melt() and dcast().
lotr_dt[
,
melt(.SD, id.vars = "Race", value.name = "count")
][
,
tstrsplit(variable, split = "_", names = c("key", "book")),
.(Race, count)
][
,
dcast(.SD, Race + key ~ book, value.var = "count")
][
,
show_datatable(.SD)
]
Note that I only specified id.vars since this is the only column that will not be collected. In more complicated cases, I think it will be necessary to specify both id.vars and measure.vars as a vector of column names. Using melt() with .SD7 is a matter of personal preference. The data.table::tstrsplit() helps in splitting one column into two. The formula argument in dcast() has always been confusing to me. I use melt() more often and still check its documentation and arguments. With dcast() it will require me more study and a bit of trial and error to make it work. In my limited understanding, The formula argument wants to know which id columns (Race and key) are a function of which value column. Specifying value.var in this example is not necessary because it is the only column left for casting, but it is better to be clear.
# data.table::copy() is necessary so that lotr_tbl_2 is not affected.
lotr_dt_2 <- setDT(copy(lotr_tbl_2))
lotr_dt_2[
,
melt(
.SD, id.vars = c("Race", "key"),
measure.vars = c("FoTR", "TT", "RoTK"),
variable.name = "book", value.name = "count"
)
][
,
.(key = paste(key, book, sep = "_")),
.(Race, count)
][
,
dcast(.SD, Race ~ key, value.var = "count")
][
,
show_datatable(.SD)
]
We can also convert from Table 2 to 1 using {data.table} by reversing our process. I do not know if tstrsplit() has a complementary function in {data.table}, but paste() seems to work really fine.
3.3 Reshape
Since I am in full comparison mode, let us include a data transformation method using stats::reshape().
key_book <- names(lotr_tbl)[names(lotr_tbl) != "Race"]
lotr_df <-
reshape(
lotr_dt,
direction = "long",
varying = list(key_book),
idvar = "Race", timevar = "key",
times = key_book, v.names = "count"
)
lotr_df[, c("key", "book")] <- transpose(strsplit(lotr_df$key, split = "_"))
lotr_df_2 <-
reshape(
lotr_df,
direction = "wide",
timevar = "book", v.names = "count",
idvar = c("Race", "key"), sep = ""
)
names(lotr_df_2) <- gsub("count", "", names(lotr_df_2))
show_datatable(lotr_df_2)
To be honest, this is the only opportunity that I have used reshape(), so this may not be the best data transformation method available in R without using external packages. The argument names are still confusing to me, but with more practice I think I can feel comfortable using it. However, I find that varying as a list, the need to specify times as a vector of column names, and the extra step to remove v.names string after the “widening” to be features that would need more getting used to.
Reversing the above process,
books <- names(lotr_df_2)[!names(lotr_df_2) %in% c("Race", "key")]
lotr_df_1 <-
reshape(
lotr_df_2,
direction = "long", idvar = c("Race", "key"),
varying = list(books), times = books,
timevar = "book", v.names = "count"
)
lotr_df_1 <- within(lotr_df_1, key_book <- paste(key, book, sep = "_"))
lotr_df_1 <-
reshape(
lotr_df_1,
direction = "wide", idvar = "Race",
timevar = "key_book", v.names = "count",
drop = c("key", "book"), sep = ""
)
names(lotr_df_1) <- gsub("count", "", names(lotr_df_1))
show_datatable(lotr_df_1)
I think I really like reshape() even with its quirks. I did not know {stats} has this powerful tool all along. Exploring available tools in R before looking in external packages provides a better understanding and awareness of R’s power.
3.4 Rowrecs and Blocks
As we have seen, data transformations with {tidyr}, {data.table}, and stats::reshape() requires multiple steps and multiple functions to get from Table 1 to 2 and vice versa. With {cdata}, we only need one function, but it will require us to do one step that, as far as I know, {cdata} pioneered: control tables8.
Control tables are like a summary what the data should look like in a different form. This is the power of coordinatized data. Once we understand the “coordinates” of a value, we can translate that data into using different coordinates.
In Table 1, given the coordinates (Race == “Elf”, Female_RoTK) we find the value 183. In Table 2, we find the same value, 183, using the coordinates (Race = “Elf”, key = “Female”, RoTK). So, if we want to convert from Table 1 to 2, we need a control table that says: Using the id column Race (since it exist in both tables), when key = “Female” in Table 2, the value should be the same as the Female_RoTK in Table 1.
We can make a control table using qchar_frame(), a function exported from {wrapr}(Mount and Zumel 2019b), a package from the same creators of {cdata}.
lotr_control_table <-
qchar_frame(
key , FoTR , TT , RoTK |
Female, "Female_FoTR", "Female_TT", "Female_RoTK" |
Male , "Male_FoTR" , "Male_TT" , "Male_RoTK"
)
Note that you really don’t need qchar_frame() to make a control table. It is only a data frame that bridges the gap between two forms of the same data, so there are multiple ways to make such a data frame. This is important when typing becomes burdensome and code will help make the process more efficient. The important aspect about a control table is that it forces the user to focus more on how the currently looks like and what form the user wants it to be.
Control tables have different meanings when used in {cdata}’s rowrecs_to_blocks(), and blocks_to_rowrecs(). Inside rowrecs_to_blocks(), a control table is what the user wants the data to look like. In blocks_to_rowrecs(), a control table is what the data currently looks like. For more information please refer to the {cdata} vignettes.
Now that we have a control table, we can go from Table 1 to 2,
show_datatable(
rowrecs_to_blocks(
lotr_dt,
controlTable = lotr_control_table,
controlTableKeys = "key",
columnsToCopy = "Race"
)
)
And from Table 2 to 1,
show_datatable(
blocks_to_rowrecs(
lotr_dt_2,
controlTable = lotr_control_table,
controlTableKeys = "key",
keyColumns = "Race"
)
)
Very seamless, isn’t it?
As an additional note, controlTableKeys allow for multiple9 columns, which makes {cdata} more powerful and more general in its implementation. The keyColumns argument is necessary to go from blocks to row records, otherwise, how will the code know which rows belong to the same groups or keys. In this example, Race was an existing id column. In some data sets, an id column may not be available, so we need to make one beforehand to ensure smooth data transformations.
4 For Future Me and Collaborators
After looking at all the different ways to do data transformation using different packages, why am I an advocate of {cdata}?
For the first three methods using {tidyr}, {data.table} and stats::reshape() requires users to understand the data then code a series of transformations to get from Table 1 to 2. On the other hand, {cdata} only requires step one, understanding the data, and emphasizes that understanding is very important by requiring users to translate that understanding into a control table.
I think future me will have lesser problems understanding a control table than a series of code. I also believe that once everyone understands the idea of coordinatized data and control tables, these two concepts are quite intuitive when studied in the context of data, then sharing of control tables offer better collaboration, at least in the realm of data transformation, than a series of codes.
# Grateful
I will always appreciate {cdata}. In just a few months, it became one of my favorite packages in R. When the opportunity arises for data transformation, I will always use it.
I urge you, dear reader, to use {cdata} for your data transformation tasks. I assure you, it will not fail you (unless, bugs of course). With the advent of {tidyr}’s pivot_longer() and pivot_wider(), some of you may prefer to use it. In my opinion, that is alright, just don’t forget to thank and cite John Mount and Nina Zumel for their ideas on coordinatized data and fluid transformation. So, before I forget.
Thank you John Mount and Nina Zumel for {cdata}!
5 Code
The following codes show what I did in this post. I added comments here and there to clarify my thought process.
# Attach packages and set chunk options
xfun::pkg_attach(
"knitr",
"DT",
"data.table",
"cdata",
"tidyr",
"tibble"
)
opts_chunk$set(
message = FALSE,
warning = FALSE,
echo = TRUE
)
# create standard datatable function to be used in all tables in this post
show_datatable <- function(data) {
data %>%
DT::datatable(
rownames = FALSE,
extensions = 'FixedColumns',
options = list(
pageLength = 20,
scrollX = TRUE,
scrollCollapse = TRUE,
columnDefs = list(
list(
className = 'dt-center',
targets = 0:(ncol(data) - 1)
)
)
)
)
}
lotr_tbl <- tribble(
~Race , ~Female_FoTR, ~Male_FoTR,
~Female_TT , ~Male_TT ,
~Female_RoTK, ~Male_RoTK,
"Elf" , 1229 , 971 ,
331 , 513 ,
183 , 510 ,
"Hobbit", 14 , 3644 ,
0 , 2463 ,
2 , 2673 ,
"Man" , 0 , 1995 ,
401 , 3589 ,
268 , 2459
)
lotr_dt <- setDT(lotr_tbl)
head(lotr_tbl) %>%
show_datatable()
lotr_tbl_2 <- tribble(
~Race , ~key , ~FoTR, ~TT, ~RoTK,
"Elf" , "Female", 1229 , 331 , 183 ,
"Hobbit", "Female", 14 , 0 , 2 ,
"Man" , "Female", 0 , 401 , 268 ,
"Elf" , "Male" , 971 , 513 , 510 ,
"Hobbit", "Male" , 3644 , 2463, 2673,
"Man" , "Male" , 1995 , 3589, 2459
)
lotr_tbl_2 %>%
show_datatable()
lotr_tbl %>%
gather(key = "key", value = "count", - Race) %>%
separate(key, into = c("key", "book"), sep = "_") %>%
spread(key = book, value = count) %>%
show_datatable()
lotr_tbl_2 %>%
gather(key = "book", value = "count", -Race, -key) %>%
unite(col = "key", key, book, sep = "_") %>%
spread(key = key, value = count) %>%
show_datatable()
lotr_dt[
,
melt(.SD, id.vars = "Race", value.name = "count")
][
,
tstrsplit(variable, split = "_", names = c("key", "book")),
.(Race, count)
][
,
dcast(.SD, Race + key ~ book, value.var = "count")
][
,
show_datatable(.SD)
]
# data.table::copy() is necessary so that lotr_tbl_2 is not affected.
lotr_dt_2 <- setDT(copy(lotr_tbl_2))
lotr_dt_2[
,
melt(
.SD, id.vars = c("Race", "key"),
measure.vars = c("FoTR", "TT", "RoTK"),
variable.name = "book", value.name = "count"
)
][
,
.(key = paste(key, book, sep = "_")),
.(Race, count)
][
,
dcast(.SD, Race ~ key, value.var = "count")
][
,
show_datatable(.SD)
]
key_book <- names(lotr_tbl)[names(lotr_tbl) != "Race"]
lotr_df <-
reshape(
lotr_dt,
direction = "long",
varying = list(key_book),
idvar = "Race", timevar = "key",
times = key_book, v.names = "count"
)
lotr_df[, c("key", "book")] <- transpose(strsplit(lotr_df$key, split = "_"))
lotr_df_2 <-
reshape(
lotr_df,
direction = "wide",
timevar = "book", v.names = "count",
idvar = c("Race", "key"), sep = ""
)
names(lotr_df_2) <- gsub("count", "", names(lotr_df_2))
show_datatable(lotr_df_2)
books <- names(lotr_df_2)[!names(lotr_df_2) %in% c("Race", "key")]
lotr_df_1 <-
reshape(
lotr_df_2,
direction = "long", idvar = c("Race", "key"),
varying = list(books), times = books,
timevar = "book", v.names = "count"
)
lotr_df_1 <- within(lotr_df_1, key_book <- paste(key, book, sep = "_"))
lotr_df_1 <-
reshape(
lotr_df_1,
direction = "wide", idvar = "Race",
timevar = "key_book", v.names = "count",
drop = c("key", "book"), sep = ""
)
names(lotr_df_1) <- gsub("count", "", names(lotr_df_1))
show_datatable(lotr_df_1)
lotr_control_table <-
qchar_frame(
key , FoTR , TT , RoTK |
Female, "Female_FoTR", "Female_TT", "Female_RoTK" |
Male , "Male_FoTR" , "Male_TT" , "Male_RoTK"
)
show_datatable(
rowrecs_to_blocks(
lotr_dt,
controlTable = lotr_control_table,
controlTableKeys = "key",
columnsToCopy = "Race"
)
)
show_datatable(
blocks_to_rowrecs(
lotr_dt_2,
controlTable = lotr_control_table,
controlTableKeys = "key",
keyColumns = "Race"
)
)
sessioninfo::session_info(include_base = TRUE)
My session info for reproducibility.
sessioninfo::session_info(include_base = TRUE)
## - Session info ----------------------------------------------------------
## setting value
## version R version 3.5.3 (2019-03-11)
## os Windows 10 x64
## system x86_64, mingw32
## ui RTerm
## language (EN)
## collate English_Philippines.1252
## ctype English_Philippines.1252
## tz Asia/Kuala_Lumpur
## date 2019-03-25
##
## - Packages --------------------------------------------------------------
## ! package * version date lib source
## assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.5.3)
## base * 3.5.3 2019-03-11 [?] local
## blogdown 0.11 2019-03-11 [1] CRAN (R 3.5.2)
## bookdown 0.9 2018-12-21 [1] CRAN (R 3.5.2)
## cdata * 1.0.7 2019-03-23 [1] CRAN (R 3.5.3)
## cli 1.1.0 2019-03-19 [1] CRAN (R 3.5.3)
## P compiler 3.5.3 2019-03-11 [1] local
## crayon 1.3.4 2017-09-16 [1] CRAN (R 3.5.2)
## crosstalk 1.0.0 2016-12-21 [1] CRAN (R 3.5.2)
## data.table * 1.12.0 2019-01-13 [1] CRAN (R 3.5.2)
## P datasets * 3.5.3 2019-03-11 [1] local
## digest 0.6.18 2018-10-10 [1] CRAN (R 3.5.2)
## dplyr 0.8.0.1 2019-02-15 [1] CRAN (R 3.5.2)
## DT * 0.5 2018-11-05 [1] CRAN (R 3.5.2)
## evaluate 0.13 2019-02-12 [1] CRAN (R 3.5.2)
## glue 1.3.1 2019-03-12 [1] CRAN (R 3.5.2)
## P graphics * 3.5.3 2019-03-11 [1] local
## P grDevices * 3.5.3 2019-03-11 [1] local
## htmltools 0.3.6 2017-04-28 [1] CRAN (R 3.5.2)
## htmlwidgets 1.3 2018-09-30 [1] CRAN (R 3.5.2)
## httpuv 1.5.0 2019-03-15 [1] CRAN (R 3.5.3)
## jsonlite 1.6 2018-12-07 [1] CRAN (R 3.5.2)
## knitr * 1.22 2019-03-08 [1] CRAN (R 3.5.3)
## later 0.8.0 2019-02-11 [1] CRAN (R 3.5.2)
## magrittr 1.5 2014-11-22 [1] CRAN (R 3.5.2)
## P methods * 3.5.3 2019-03-11 [1] local
## mime 0.6 2018-10-05 [1] CRAN (R 3.5.2)
## pillar 1.3.1 2018-12-15 [1] CRAN (R 3.5.2)
## pkgconfig 2.0.2 2018-08-16 [1] CRAN (R 3.5.2)
## promises 1.0.1 2018-04-13 [1] CRAN (R 3.5.2)
## purrr 0.3.2 2019-03-15 [1] CRAN (R 3.5.3)
## R6 2.4.0 2019-02-14 [1] CRAN (R 3.5.2)
## Rcpp 1.0.1 2019-03-17 [1] CRAN (R 3.5.3)
## rlang 0.3.2 2019-03-21 [1] CRAN (R 3.5.3)
## rmarkdown 1.12 2019-03-14 [1] CRAN (R 3.5.3)
## sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.5.2)
## shiny 1.2.0 2018-11-02 [1] CRAN (R 3.5.2)
## P stats * 3.5.3 2019-03-11 [1] local
## stringi 1.4.3 2019-03-12 [1] CRAN (R 3.5.2)
## stringr 1.4.0 2019-02-10 [1] CRAN (R 3.5.2)
## tibble * 2.1.1 2019-03-16 [1] CRAN (R 3.5.3)
## tidyr * 0.8.3 2019-03-01 [1] CRAN (R 3.5.2)
## tidyselect 0.2.5 2018-10-11 [1] CRAN (R 3.5.2)
## P tools 3.5.3 2019-03-11 [1] local
## P utils * 3.5.3 2019-03-11 [1] local
## withr 2.1.2 2018-03-15 [1] CRAN (R 3.5.2)
## wrapr 1.8.4 2019-02-19 [1] CRAN (R 3.5.2)
## xfun 0.5 2019-02-20 [1] CRAN (R 3.5.2)
## xtable 1.8-3 2018-08-29 [1] CRAN (R 3.5.2)
## yaml 2.2.0 2018-07-25 [1] CRAN (R 3.5.2)
##
## [1] C:/Users/reclev/R/R-3.5.3/library
##
## P -- Loaded and on-disk path mismatch.
6 References
For more information on tidy principles, techniques, and good practices, and the packages and functions I used in this post, please read
Dowle, Matt, and Arun Srinivasan. 2019. Data.table: Extension of ‘Data.frame‘. https://CRAN.R-project.org/package=data.table.
Mount, John, and Nina Zumel. 2019a. Cdata: Fluid Data Transformations. https://CRAN.R-project.org/package=cdata.
———. 2019b. Wrapr: Wrap R Tools for Debugging and Parametric Programming. https://CRAN.R-project.org/package=wrapr.
Wickham, Hadley, and Lionel Henry. 2019. Tidyr: Easily Tidy Data with ’Spread()’ and ’Gather()’ Functions. https://CRAN.R-project.org/package=tidyr.
I find going from wide to long much easier and more intuitive than going from long to wide, but that might be because I am more exposed and have more experience converting from wide to long as this is the more common data transformation that I have encountered so far.↩
I just heard about
{tidyr}’s new data transformation functions (as I was writing this post, I learned that the new functions will be calledpivot_longer()andpivot_wider()) as possible replacements forgather()andspread(). The vignette mentions{cdata}and{data.table}’smelt()anddcast()as inspiration for the improvements. I am excited to givepivot_longer()andpivot_wider()a spin, but my preference will stay with{cdata}.↩I must confess that
dcast()is my weak point in using{data.table}. I felt its power during the times I have used it successfully, but I somehow have to spend time reading the documentation and do test transformation every time I try to use it. Again, most of my experience involves transforming data from wide to long. Maybe I would have mastereddcast()better with more practice.↩Observe that my reply included a suggestion to study
{cdata}. I changed the data a little. The first book of The Lord of the Rings (a Tolkien fan here) is The Fellowship of the Ring, so I think FoTR will be a better column name.↩I offered a different solution in Hiroaki’s blog that also used
tidyr::nest()andtidyr::unnest(), but that was only for comparison with the suggestedbundle()function. The solution I offer here withseparate()andunite()might be more intuitive and simpler.↩I like how much of
{data.table}’s functionality happens within[. I find functions that force me to get out of that “box” hindering my workflow. As a personal remedy, I use functions with.SDat some loss in efficiency. It is my personal wish that.SDbecomes more efficient as I use it a lot.↩In the implementation of
{tidyr}’spivot_long()andpivot_wide(),specwill be the control table equivalent.↩I requested this feature, and I was very happy John Mount and Nina Zumel accommodated my request. I was able to use it in my work, and I am sure I will be able to use this specific feature again in the future.↩