Skip to content
Change the repository type filter

All

    Repositories list

    • Scripts for running bitextor/paracrawl/europat jobs on cirrus.ac.uk
      Shell
      1781Updated Sep 26, 2024Sep 26, 2024
    • giashard

      Public
      Sharding program for Paracrawl
      Go
      1200Updated Sep 17, 2024Sep 17, 2024
    • giawarc

      Public
      Processing utilities for Internet Archive
      C++
      0141Updated Apr 19, 2024Apr 19, 2024
    • corset

      Public
      Corset is a web-based data selection portal that helps you getting relevant data from massive amounts of parallel data.
      SCSS
      GNU General Public License v3.0
      31710Updated Nov 6, 2023Nov 6, 2023
    • keops

      Public
      Tool for manual evaluation of parallel sentences.
      PHP
      GNU General Public License v3.0
      41400Updated Oct 19, 2023Oct 19, 2023
    • Scripts for obtaining patent data
      Java
      2411Updated Apr 14, 2023Apr 14, 2023
    • tmxutil

      Public
      Tools to generate & filter Europat tmx files.
      Python
      MIT License
      1410Updated Jan 17, 2023Jan 17, 2023
    • synthesis

      Public
      Data synthesis by contextualizing glossary translations
      Python
      3600Updated Jul 1, 2021Jul 1, 2021
    • Automate download and training with OPUS corpora
      Shell
      MIT License
      2200Updated Jan 28, 2021Jan 28, 2021
    • Results of the human evaluation
      Rich Text Format
      3500Updated Dec 9, 2020Dec 9, 2020
    • Open here any Paracrawl corpus related issue
      0000Updated Nov 18, 2020Nov 18, 2020
    • Creative Commons Zero v1.0 Universal
      0000Updated Nov 13, 2020Nov 13, 2020
    • b64filter

      Public archive
      Program for operating on one document per Base 64 encoded line files
      Go
      0110Updated Aug 4, 2020Aug 4, 2020
    • InDomain detection is a tool designed to extract in-domain data from a large collections of data.
      Python
      GNU General Public License v3.0
      1100Updated Jun 5, 2020Jun 5, 2020
    • Python
      0000Updated Mar 6, 2020Mar 6, 2020
    • Python
      0100Updated Mar 6, 2020Mar 6, 2020
    • go-warc

      Public
      A golang library to work with WARC files from the common crawl
      Go
      GNU General Public License v2.0
      7000Updated Aug 4, 2019Aug 4, 2019
    • extractor

      Public
      C++
      Apache License 2.0
      42310Updated Nov 29, 2017Nov 29, 2017
    • embedding

      Public
      Mine parallel corpora with embeddings
      Perl
      0400Updated Sep 2, 2017Sep 2, 2017
    • Data collection, alignment and TAUS repository
      Python
      Apache License 2.0
      8830Updated Aug 1, 2017Aug 1, 2017