Tables as places where data are recorded can be pretty messy. The ‘tidy’ paradigm in R proposes that data are organised so that variables are recorded in columns, observations in rows and that there is only one value per cell (Wickham 2014). This, however, is only one interpretation of how data should be organised and especially when scraping data from multiple heterogeneous sources, one frequently encounters messy data that don’t follow this paradigm.
tidyr package is one of the most popular tools to bring data into a tidy format.
However, up until today it is limited to tables that are already organised into topologically coherent (rectangular) chunks and any messiness beyond that requires dedicated scripts for reorganisation.
tabshiftr we try to describe and work with a further dimension of messiness, where data are available as so-called disorganised (messy) data, data that are not yet arranged into rectangular form.
The approach of
tabshiftr is based on describing the arrangement of such tables in a so-called schema description, which is then the basis for automatic reorganisation via the function
Typically there is an input and an output schema, describing the arrangement of the input and output tables, respectively.
The advantage of this procedure is that input and output tables exist explicitly and the schema maps the transformation of the data.
As we want to end up with tidy tables, the output schema is pre-determined by a tidy table of the included variables and the input schema needs to be put together by you, the user.
Data can be disorganised according to multiple dimensions. To understand those dimensions, we first need to understand the nature of data. Data "of the same kind" are collected in a variable, which is always a combination of a name and the values. In a table, names are typically in the topmost row and values are in the column below that name (Tab. 1). Conceptually, there are two types of variables in any table:
Moreover, a table is part of a series when other tables of that series contain the same variables, irrespective of how the distinct tables of that series are arranged.
|identifying variable||observed variable (categorical)||observed variable (continuous)|
Even though data in many spreadsheet are "disorganised", they are mostly not non-systematic. Especially in complex spreadsheets, one often encounters a situation where a set of variables occurs more than once with the same or very similar organisation, which we call cluster here. Data that are part of clusters are split up by another, typically categorical variable (the cluster ID), with the aim to increase the visual accessibility or direct the focus for human readers (Tab. 2). This may also be the case where data are split up across several files or spreadsheets of a file, where the cluster ID can be found in the file or spreasheet name or the meta-data. In that case, the cluster ID would be an implicit variable.
|sample 1||sample 2|
|sample 3||sample 4|
To set up a schema description, go through the following questions step by step and provide the respective answer in the respective function. Linked tables in the next section can serve as examples to compare against.
Variables: Clarify which are the identifying variables and which are the observed variables. Make sure not to mistake a listed observed variable (Tab. 10) as an identifying variable.
Meta-data: Provide potentially information about the table format in
Clusters: In case there are clusters, provide in
Identifying variables: provide in
Observed variable: provide in
As an example, we show here how to build a schema description for a table that has a wide identifying variable and a listed observed variable. This table contains additionally some dummy information one would typically encounter in tables, such as
empty_cols and rows and data that are not immediately of interest (
|unit 1||year 1||.||harvested||xyz||1111||1121||.|
|unit 1||year 1||.||production||xyz||1112||1122||.|
|unit 1||year 2||.||harvested||xyz||1211||1221||.|
|unit 1||year 2||.||production||xyz||1212||1222||.|
|unit 2||year 1||.||harvested||xyz||2111||2121||.|
|unit 2||year 1||.||production||xyz||2112||2122||.|
|unit 2||year 2||.||harvested||xyz||2211||2221||.|
|unit 2||year 2||.||production||xyz||2212||2222||.|
In this case we don’t need to set clusters and can start immediately with setting the first id variable
territories, which is in the first column and otherwise tidy. The order by which we set the variables determines where they ocurr in the output table. Any of the setters start by default with an empty schema, in case none is provided to them from a previous setter, thus none needs to be provided at the beginning of a schema.
Since version 0.3.0,
tabshiftr comes with getters that allow to debug the current schema description. To do this, however, the schema first needs to be validated. This is in order to make sure that all the generic information are evaluated with the respective input. After that, a getter can be used to extract the respective information, for example the reorganised id variables with
After seeing that our specification results in a meaningful output, we can continue setting the other id variables
years (tidy and in column 2) and
commodities (spread over two columns and the values are in the first row). Note, how we pipe the previous schema into the next setter. This results in the next variable being added to that schema.
Validating and checking for id variables again results in the following
validateSchema(schema = schema, input = input) %>% getIDVars(input = input) #> [] #> []$territories #> # A tibble: 8 x 1 #> X1 #> <chr> #> 1 unit 1 #> 2 unit 1 #> 3 unit 1 #> 4 unit 1 #> 5 unit 2 #> 6 unit 2 #> 7 unit 2 #> 8 unit 2 #> #> []$year #> # A tibble: 8 x 1 #> X2 #> <chr> #> 1 year 1 #> 2 year 1 #> 3 year 2 #> 4 year 2 #> 5 year 1 #> 6 year 1 #> 7 year 2 #> 8 year 2 #> #> []$commodities #> # A tibble: 1 x 2 #> X6 X7 #> <chr> <chr> #> 1 soybean maize
The id variable
commodities is clearly wide (more than one column) and its’ values are not repeated four times, as it should be, judging by the combination of the other variables.
However, this is an expected tentative output that will be handled in a later step and the id variables have been specified correctly.
Next, we set the listed observed variables. Listed means that the column names of the observed variables are treated as if they were the values of an identifying variable (in column 4), while the values are in the columns 6 and 7. Moreover, the values need to be filtered by
value (in the key in column 4).
We then get the following observed variables, which is also an expected tentative output.
validateSchema(schema = schema, input = input) %>% getObsVars(input = input) #> [] #> []$harvested #> # A tibble: 4 x 2 #> X6 X7 #> <chr> <chr> #> 1 1111 1121 #> 2 1211 1221 #> 3 2111 2121 #> 4 2211 2221 #> #> #> [] #> []$production #> # A tibble: 4 x 2 #> X6 X7 #> <chr> <chr> #> 1 1112 1122 #> 2 1212 1222 #> 3 2112 2122 #> 4 2212 2222
reorganise() function carries out the steps of validating, extracting the variables, pivoting the tentative output and putting the final table together automatically, so it merely requires the finalised
schema and the
schema # has a pretty print function #> 1 cluster (whole spreadsheet) #> #> variable type row col key value #> ------------- ---------- ----- ----- ----- ------------ #> territories id 1 #> year id 2 #> commodities id 1 6:7 #> harvested observed 6:7 4 harvested #> production observed 6:7 4 production reorganise(input = input, schema = schema) #> # A tibble: 8 x 5 #> territories year commodities harvested production #> <chr> <chr> <chr> <dbl> <dbl> #> 1 unit 1 year 1 maize 1121 1122 #> 2 unit 1 year 1 soybean 1111 1112 #> 3 unit 1 year 2 maize 1221 1222 #> 4 unit 1 year 2 soybean 1211 1212 #> 5 unit 2 year 1 maize 2121 2122 #> 6 unit 2 year 1 soybean 2111 2112 #> 7 unit 2 year 2 maize 2221 2222 #> 8 unit 2 year 2 soybean 2211 2212
In this section we look at some examples of disorganised data, discuss the dimension along which they are disorganised and show which schema description should be used to reorganise them.
To work with
tabshiftr, tables need to be read in while treating any header rows as data, i.e., by not setting the first row as header, because in disorganised tables it’s often not only the first row that is part of the header.
Moreover, each column should be treated as a character data type, because some columns might contain data with both numeric and character cells.
reorganise() takes care of reformatting the data into the most permissive data type that does not introduce
NAs where there should be data, i.e, if a variable can be numeric, it is formatted as numeric column.
All of the following examples contain an
empty_col column and an empty row, which serve the purpose of dummy information or formating that could be found in any table and should not disturb the process of reorganising.
You can run all the examples by simply loading the schema and calling
reorganise(input = tabs2shift$..., schema = schema) with the respective table that is plotted for this example.
In case the observed variables are arranged into individual columns (Tab. 3), we have tidy data (Wickham 2014), which are largely already in the target arrangement. The tidy table may however, still contain unneeded data, need different names, or transformation factors for the values.
|unit 1||year 1||soybean||xyz||1111||1112||.|
|unit 1||year 1||maize||xyz||1121||1122||.|
|unit 1||year 2||soybean||xyz||1211||1212||.|
|unit 1||year 2||maize||xyz||1221||1222||.|
|unit 2||year 1||soybean||xyz||2111||2112||.|
|unit 2||year 1||maize||xyz||2121||2122||.|
|unit 2||year 2||soybean||xyz||2211||2212||.|
|unit 2||year 2||maize||xyz||2221||2222||.|
schema <- setIDVar(name = "territories", columns = 1) %>% setIDVar(name = "year", columns = 2) %>% setIDVar(name = "commodities", columns = 3) %>% setObsVar(name = "harvested", columns = 5) %>% setObsVar(name = "production", columns = 6, factor = 0.1) reorganise(input = tabs2shift$tidy, schema = schema) #> # A tibble: 8 x 5 #> territories year commodities harvested production #> <chr> <chr> <chr> <dbl> <dbl> #> 1 unit 1 year 1 maize 1121 112. #> 2 unit 1 year 1 soybean 1111 111. #> 3 unit 1 year 2 maize 1221 122. #> 4 unit 1 year 2 soybean 1211 121. #> 5 unit 2 year 1 maize 2121 212. #> 6 unit 2 year 1 soybean 2111 211. #> 7 unit 2 year 2 maize 2221 222. #> 8 unit 2 year 2 soybean 2211 221.
Sometimes it may be the case that the number variables is not the same as there are columns because either one variable is spread over several column, or one column contains several variables.
In the former case, columns need to be merged (Tab. 4) and in the latter case, columns need to be split via regular expressions (Tab. 5).
.+?(?=_) gives everything up until the first
(?<=\\_).* everything after the
|unit 1||year 1_soybean||xyz||1111||1112||.|
|unit 1||year 1_maize||xyz||1121||1122||.|
|unit 1||year 2_soybean||xyz||1211||1212||.|
|unit 1||year 2_maize||xyz||1221||1222||.|
|unit 2||year 1_soybean||xyz||2111||2112||.|
|unit 2||year 1_maize||xyz||2121||2122||.|
|unit 2||year 2_soybean||xyz||2211||2212||.|
|unit 2||year 2_maize||xyz||2221||2222||.|
When data are split up into clusters that are stored in separate files or spreadsheets, the cluster ID is often recorded not in the table as an explicit variable, but is only provided in the file or table name. In those cases, we have to register this cluster ID as an identifying variable nevertheless, to output a consistent table.
|here would be some metadata, which does however not tell us that this is in fact unit 1||.||.||.|
|we get that information from the context, for example the filename||.||.||.|
In case identifying variables are factors with a small number of levels, those levels may be falsely used as names of other variables, where they would be next to each other and thus "wide" (Tab. 7). Those other variables (both identifying and observed variables) would then be "nested" in the wide identifying variables. In those cases we have to record for the identifying variable(s) the columns and the row in which the values of the identifying variable are found (they will look like they are part of the header). For the observed variable(s) we need to record also the columns and the topmost row of that variable, i.e., the row in which the name of the nested observed variable is found.
|unit 1||.||year 1||1111||1112||1121||1122||.|
|unit 1||.||year 2||1211||1212||1221||1222||.|
|unit 2||.||year 1||2111||2112||2121||2122||.|
|unit 2||.||year 2||2211||2212||2221||2222||.|