Skip to content

SeanFobbe/cd-icj

Repository files navigation

README: Corpus of Decisions: International Court of Justice (CD-ICJ)

Overview

This R script downloads and processes the full set of decisions and appended opinions rendered by the International Court of Justice (ICJ) as published on https://www.icj-cij.org into a rich and structured human- and machine-readable data set. It is the basis for the Corpus of Decisions: International Court of Justice (CD-ICJ).

All data sets created with this script will always be hosted permanently open access and freely available at Zenodo, the scientific repository of CERN. Each version is uniquely identified with a persistent Digitial Object Identifier (DOI), the Version DOI. The newest version of the data set will always available via the link of the Concept DOI: https://doi.org/10.5281/zenodo.3826444

Functionality

This script will produce 21 ZIP archives:

  • 2 archives of CSV files containing the full machine-readable data set (English/French)
  • 2 archives of CSV files containing the full machine-readable metadata (English/French)
  • 2 archives of TXT files containing all machine-readable texts with a reduced set of metadata encoded in the filenames (English/French)
  • 2 archives of PDF files containing all human-readable texts with enhanced OCR (English/French)
  • 2 archives of PDF files containing all human-readable majority opinions with enhanced OCR (English/French)
  • 2 archives of PDF files of documents dated 2004 and earlier containing monolingual documents with enhanced OCR (English/French)
  • 2 archives of PDF files as originally published by the ICJ (English/French)
  • 2 archives of TXT files containing text as generated by Tesseract for documents dated 2004 or earlier (English/French)
  • 2 archives of TXT files containing extracted text from the original documents (English/French)
  • 1 archive PDF files that were unlabelled on the website (intended for replication and review only)
  • 1 archive of analysis data and diagrams
  • 1 archive containing all source files

The integrity and veracity of each ZIP archive is documented with cryptographically secure hash signatures (SHA2-256 and SHA3-512). Hashes are stored in a separate CSV file created during the data set compilation process.

Please refer to the Codebook regarding the relative merits of each variant. Unless you have very specific needs you should only use the variants denoted 'BEST' for serious work.

System Requirements

  • Docker
  • Docker Compose
  • 25 GB disk space on hard drive
  • Parallelization will automatically be customized to your machine by detecting the maximum number of cores
  • A full run of this script takes approximately 11 hours on a machine with a Ryzen 3700X CPU using 16 threads, 64 GB DDR4 RAM and a fast SSD.

Instructions

Step 1: Prepare Folder

Copy the full source code to an empty folder, for example by executing:

$ git clone https://github.com/seanfobbe/cd-icj

Always use a dedicated and empty folder for compiling the data set. The scripts will automatically delete all PDF, TXT and many other file types in its working directory to ensure a clean run.

Step 2: Create Docker Image

The Dockerfile contains automated instructions to create a full operation system with all necessary dependencies. To create the image from the Dockerfile, please execute:

$ bash docker-build-image.sh

Step 3: Compile Dataset

If you have previously compiled the data set, whether successfuly or not, you can delete all output and temporary files by executing:

$ Rscript delete_all_data.R

You can compile the full data set by executing:

$ bash docker-run-project.sh

Results

The data set and all associated files are now saved in the output/ folder of your working directory.

Open Access Publications (Fobbe)

Website --- https://www.seanfobbe.com

Open Data --- https://zenodo.org/communities/sean-fobbe-data

Code Repository --- https://zenodo.org/communities/sean-fobbe-code

Regular Publications --- https://zenodo.org/communities/sean-fobbe-publications

Contact

Did you discover any errors? Do you have suggestions on how to improve the data set? You can either post these to the Issue Tracker on GitHub or write me an e-mail at [email protected]