Dataset Name: The South African Gov-ZA multilingual corpus
Citation: Vukosi Marivate, Matimba Shingange, Richard Lastrucci. Cabinet statements from the SA governemnt in multiple languages
Link to dataset: https://github.com/dsfsi/gov-za-multilingual
Data set Developer(s): Vukosi Marivate, Matimba Shingange, Richard Lastrucci
Data statement author(s): Vukosi Marivate, Matimba Shingange, Richard Lastrucci
The data set contains cabinet statements from the South African government. Data was scraped from the governments website: https://www.gov.za/cabinet-statements
The datasets contain government cabinet statements in 11 languages, see next section for details.
The dataset contains the full data in a JSON file (/data/govza-cabinet-statements.json), as well as CSV’s split by each language, eg: “govza-cabinet-statements-en.csv” for english. The dataset does not contain special characters like unicode or ascii.
All recorded cabinet statements have translations to 11 languages:
Language | Code | Language | Code |
---|---|---|---|
English | (eng) | Sepedi | (nso) |
Afrikaans | (afr) | Setswana | (tsn) |
isiNdebele | (nbl) | Siswati | (ssw) |
isiXhosa | (xho) | Tshivenda | (ven) |
isiZulu | (zul) | Xitstonga | (tso) |
Sesotho | (sot) |
The data is issued by the government communications department. The datasets are composed of the topics covered in different cabinet meetings.
The data is comprised of cabinet statements from 2013 - Present It is written in formal language and is often split into different topical sections.
All data is relating to the cabinet meetings of the government, so a variety of topics like energy, labour, service delivery, crime, COVID, international relations, the environment, and government affairs like government appointments, cabinet decisions, etc are included.
Data was scraped from the governments website: https://www.gov.za/cabinet-statements