g0vse

g0vse is a project aiming to make the information available on the Swedish government's website (regeringen.se) more accessible.

You can read more about the project on the website g0v.se. The following documentation on Github focuses on the technical aspects of the project.

What data is available?

g0vse uses the regeringen.se search API to fetch the vast majority of the website's pages. For each page, the following information is saved:

title: the page title, which often contains the name of the report, bill, etc.
url: the url of the page, which also contains information about the page's object ("/remisser/", "/proposition/")
id: the subtitle of the page, which often contains the serial number of the object (beteckningsnummer)
summary: the excerpt of the page describing its content
published & updated: the date when the page was published or updated
types: the types of the page's object as codes*
senders: the senders (often ministers) as codes*
categories: the page's categories as codes*
shortcuts: links to related pages that can be found on the right side of the page
attachments: links to documents, usually the most interesting part if you want to download public investigations (sou, ds, pm), public feedback letters (remissvar), etc.

*The codes are used on the website to categorize content and a list of codes can be found here.

API routes

The logic behind the API routes is to follow the structure of the government's website as much as possible. For instance, a list of the bills is available at regeringen.se/rapporter and the corresponding API route is g0v.se/rapporter.json. A complete list of routes is available at g0v.se.

The route /api/latest_updated.json can be used to know when the data was last updated.

In addition, the text of each page is available as Markdown. For example, regeringen.se/artiklar/2024/10/sjukvardsminister-[...]-sjukvarden/ is available at g0v.se/artiklar/2024/10/sjukvardsminister-[...]-sjukvarden.md.

If you are unsure of what is available, you can try the URL converter or browse the files on Github.

License for the data is unclear as Sweden doesn't have a modern law for access to public information where a default license could be specified and the government chancellery hasn't provided one either. In practice, it's safe to reuse.

Data quality issues

The data was fetched through webscraping from a website used by thousands of civil servants from various departments, with their own methods and who never thought about the information being digitally reused by others.

As a result, the data quality is not necessarily that good. Here are a few examples of issues that you will need to address when reusing:

the field id needs to be cleaned in order to extract an actual identifier for the document of the page
the field title will also use different norms
attachments are just provided with their name and a URL to fetch them. File names do not follow any convention, often contain typos and links are sometimes dead if the files have been removed.
some pages are not marked as updated although they have new content (some remiss-pages, for instance, although far from the majority)
departments have different methods when it comes to connecting their documents. Some use a component on the page called the "lagstiftnings-/beslutkedjan", others simply add the previous documents of the chain as a shortcut (in the genvägar box)
the logic for parsing the lagstiftningskedjor is already written, but unfortunately, the lack of reliable page identifiers and consistency in the data makes it hard to use for now. It's coming though!

How does g0vse work?

g0vse is composed of three components:

a webscraper able to download a list of pages and the content of these pages
a parser to convert the content in the HTML pages into Markdown and structured JSON objects
a website to present the project and make it easier to understand what is available.

g0vse can be run by anyone on a local machine but the webscraper and parser are executed each night at 03:00 CET in order to fetch new content that was published the day before.

The Webscraper

The webscraper's logic can be found mainly in downloader.py. It uses Selenium and a headless browser.

The parser

The parser's logic can be found mainly in web_parser.py. It uses the Python framework beautifulsoup4.

The scheduled workflows

Each night at 03:00 CET, the code is executed through a Github Action, the data is updated on the branch data and deployed at g0v.se.

The frontend

To present the project, a lightweight static website has been built using NextJS and TailwindCSS. Its code is in the frontend folder.

What if I want to reuse the code?

Go ahead! You'll need Python 3 and to install dependencies using:

pip install -r requirements.txt

After that, you can fetch the latext 20 items and associated codes:

from services.downloader import Downloader

items, codes = downloader.get_latest_items(20)

You can also download a page and parse its content using:

from services.downloader import Downloader
from services.web_parser import extract_page

page = downloader.get_webpage(url)
md_content, metadata = extract_page(page)

Have a look at fetch.py for more complex logic that can download all articles or just the missing ones.

License

The code is licensed under AGPLv3.

Name		Name	Last commit message	Last commit date
Latest commit History 209 Commits
.github/workflows		.github/workflows
frontend		frontend
services		services
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
fetch.py		fetch.py
g0vse.png		g0vse.png
g0vse.svg		g0vse.svg
requirements.txt		requirements.txt
types.json		types.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

g0vse

What data is available?

API routes

Data quality issues

How does g0vse work?

The Webscraper

The parser

The scheduled workflows

The frontend

What if I want to reuse the code?

License

About

Contributors 3

Languages

License

civictechsweden/g0vse

Folders and files

Latest commit

History

Repository files navigation

g0vse

What data is available?

API routes

Data quality issues

How does g0vse work?

The Webscraper

The parser

The scheduled workflows

The frontend

What if I want to reuse the code?

License

About

Topics

Resources

License

Stars

Watchers

Forks

Contributors 3

Languages