Skip to content

Where we get data from

Peter Inglesby edited this page Nov 9, 2018 · 2 revisions

See sources.json for an up-to-date list.

In each case, we download the data automatically using a Python script that runs as part of our data pipeline, rather than doing so manually via a web browser.

Data hosted on the ISP (the BNF codes and the prescribing data) is on a website that's protected by a captcha. To download these datasets, a human has to solve the captcha in their browser. This sets a cookie in the user's browser, which is then passed as a parameter to the Python script.

Similarly, the TRUD data (just the dm+d dataset for now) is on a website that's protected by a password. Again, a human must log in with a password, and then pass the corresponding cookie to the Python script.