Downloading and using data from bdl

Krzysztof Kania

2023-02-24

The bdl package is an interface to Local Data Bank(Bank Danych Lokalnych - bdl) API with a set of useful tools like quick plotting using data from the data bank.

Intro

Working with bdl is based on id codes. Most of the data downloading functions require specifying one or vector of multiple unit or variable ids as a string.

It is recommended to use a private API key which u can get here. To apply it use: options(bdl.api_private_key ="your_key")

Also, every function returns data in Polish by default. If you would like to get data in English, just add lang = "en" to any function.

Any metadata information (unit levels, aggregates, NUTS code explanation, etc.) can be found here.

Searching unit id

When searching for unit id, we can use two methods:

Units consist of 6 levels:

get_levels()

The lowest - seventh level has its own separate functions with suffix localities. Warning - the localities functions have a different set of arguments. Check package or API documentation for more info.

Tree listing

To get all units available in local data bank run get_units() without any argument(warning - it can eat data limit very fast around 4.5k rows):

To narrow the list add unitParentId. The function will return all children units for a given parent at all levels. Add level argument to filter units even further.

get_units(parentId = "000000000000", level = 5)

Searching subject and variable id

Subjects are themed directories of variables.

We have two searching methods for both subjects and variables:

Subjects

To directly search for subject we just provide search phrase:

search_subjects("lud")

Subjects consist of 3 levels (categories, groups, subgroups) - K, G and P respectively. The fourth level of the subject (child of a subgroup) would be variables.

To list all top level subjects use get_subjects():

get_subjects()

To list sub-subjects to given category or group use get_subjects() with parentId argument:

get_subjects(parentId = "K3")
get_subjects(parentId = "G7")

Variables

Firstly you can list variables for given subject (subgroup):

get_variables("P2425")

Secondly, you can direct search variables with search_variables(). You can use an empty string as name to list all variables but I strongly advise against as it has around 40 000 rows and you will probably hit data limit.

search_variables("samochod")

You can narrow the search to the given subject - subgroup:

search_variables("lud", subjectId = "P2425")

Downloading data

If you picked unit and variable codes, you are ready to download data. You can do this two ways:

Single unit, multiple variables

We will use get_data_by_unit(). We specify our single unit as unitId string argument and variables by a vector of strings. Optionally we can specify years of data. If not all available years are used.

get_data_by_unit(unitId = "023200000000", varId =  "3643")
get_data_by_unit(unitId = "023200000000", varId =  c("3643", "2137", "148190"))

To get more information about data we can add type argument and set it to "label" to add an additional column with the variable info.

get_data_by_unit(unitId = "023200000000", varId = "3643", type = "label")

Multiple units, single variable

We will use get_data_by_variable(). We specify our single variable as varId string argument. If no unitParentId is provided, the function will return all available units for a given variable. Setting unitParentId will return all available children units (on all levels). To narrow unit level set unitLevel. Optionally we can specify years of data. If not all available years are used.

get_data_by_variable("420", unitParentId = "011210000000", year = 2013:2016)
get_data_by_variable("420", unitLevel = "2", year = 2013:2016)

Useful tools

The bdl package provides a couple of additional functions for summarizing and visualizing data.

Summary

Data downloaded via get_data_by_unit() or get_data_by_variable() and their locality versions can be easily summarized by summary():

df <- get_data_by_variable(varId = "3643", unitParentId = "010000000000")
summary(df)

Plotting

Plotting functions in this package are interfaces to the data downloading functions. Some of them require specifying data_type - a method for downloading data, and the rest of the arguments will be relevant to specify data_type function. Check documentation for more details.

line_plot(data_type = "unit", unitId = "000000000000", varId = c("415","420"))
pie_plot(data_type ="variable" ,"1", "2018",unitParentId="042214300000", unitLevel = "6")

Scatter plot is unique - requires vector of only 2 variables.

scatter_2var_plot(data_type ="variable" ,c("60559","415"), unitLevel = "2")

Map generation

The bdl package comes with the bdl.maps dataset containing spatial maps for each Poland’s level. generate_map() use them to generate maps filled with the bdl data. Use unitLevel to change the type of map. When the lower level is chosen, the map generation can be more time consuming as it has more spatial data to process. This function will download and load maps automatically. In case of any errors you can download them manually here.

Download data file and double-click to load it to environment.

generate_map(varId = "60559", year = "2017", unitLevel = 3)

Multi download

Downloading functions get_data_by_unit() and get_data_by_variable() have alternative “multi” downloading mode. Function that would work for example single unit, if provided a vector will make additional column with values for each unit provided:

get_data_by_unit(unitId = c("023200000000", "020800000000"), varId =  c("3643", "2137", "148190"))

Or multiple variables for get_data_by_variable():

get_data_by_variable(varId =c("3643","420"), unitParentId = "010000000000")

This mode works for the locality version as well.

More consistent method of downloading multiple variables for multiple units is provided by get_panel_data() function:

get_panel_data(unitId = c("030210101000", "030210105000", "030210106000"), varId =  c("60270", "461668"), year = c(2015:2016))

It offers also parameter ggplot = TRUE which produces output in the long form suitable for plotting with ggplot package:

library(ggplot2)
df <- get_panel_data(unitId = c("030210101000", "030210105000", "030210106000"), varId =  c("60270", "461668"), year = c(2015:2018), ggplot = TRUE)
ggplot(df,aes(x=year, y= values, color = unit)) + geom_line(aes(linetype = variables)) + scale_color_discrete(labels = c("A", "B", "C")) + scale_linetype_discrete(labels = c("X", "Y"))