Eurostat tools for Python: eust v0.4.0

tl;dr:

$ pip install eust
$ eust download table apro_cpsh1

Then, in Python:

import eust
data = eust.read_table_data('apro_cpsh1', version='2019-05-13 23:00:00')
meta = eust.read_table_metadata('apro_cpsh1', version='2019-05-13 23:00:00')

Bulk downloading and parsing Eurostat tables

In the last years I have been working a lot with tables from Eurostat.

There are several ways to download Eurostat data, but for me the most appealing way has always been to download a whole table to disk and then generate whatever extractions I need as I go, keeping the whole original table for future reference.

Unfortunately, Eurostat's data query interfaces limit query sizes so much that you usually cannot come anywhere close to a full table download in one query. So although there are nice tools like pandaSDMX and pyrostat to interact with Eurostat's query interfaces, I much prefer a method that lets me download the whole table at once.

To my knowledge, the most straightforward way to download a whole table at once is to go to Eurostat's bulk download facility, download a .tsv file and parse that. The parsing is not super fun since these .tsv files are not tidy. But with a moderate amount of patience and some pandas hacking I made a reader for these files already several years ago.

Why we need to manage multiple versions of Eurostat tables

As Eurostat updates its databases, things change. Of course new data are added, but already published data can also be removed or changed as corrections become available. Whole new tables are created and old ones retired. I was actually approached by a colleague a couple of years ago who wanted to rerun my old calculation but could no longer even find a table with the same name I had referred to.

The reasonable reaction, which I wish I had adopted from the beginning, is to neatly archive every table version I am using, so I can dig it out later, rerun old calculations, and of course share the indata with colleagues. This is really the only way if we want to do reproducible research with these data. However, of course I have been lazy and disorganized, saving various versions of Eurostat files here and there in different project directories, and never kept track of the files as well as I should have.

Finally built and published a tool to keep things tidy. Why?

This situation has now finally changed thanks to the little Python library I've written. I'm releasing an early, not-so-polished version of this library on PyPI and with source code on github. The reason? Honestly mainly to make my own life easier in the long run:

  • First of all, by publishing to PyPI I can easily keep the different versions of my little library. I require eust==0.4.0 in my current project, and then that project should keep on working for the forseeable future, although in a year I might have moved on to eust=0.5.3 for another project, etc.
  • Second, this public announcement of a library seems to be a good way to force myself into maintaining a decent level of documentation and stability of the tool. Although I don't really expect anyone else to use this library, the mere possibility that some fellow Eurostat/Python hacker could find it useful makes me so happy and so frightened that I've now even written some basic documentation for the thing. Funny thing is I will probably have much more use of that documentation than anyone else.