Skip to content

LUMI setup

Jaume Zaragoza edited this page Oct 2, 2025 · 24 revisions

Recommended reads:

Installation and Configuration

Please ignore "Compiling Software" section in README, instead follow these steps. The lumi branch now runs all of the slurm jobs inside a Singularity container that has all the software installed.

Clone the repo (do not clone recursively) and swith to lumi branch

git clone https://github.com/hplt-project/bitextor-slurm
cd bitextor-slurm
git checkout lumi

Add a symlink from the latest version of the container file located at /scratch/project_462000688/bitextor-slurm-containers to the bitextor-slurm directory:

ln -s /project/project_465001864/bitextor-slurm-containers/bitextor-slurm-v4.sif ./bitextor-slurm.sif

Edit config.d/10.lumi.sh to set up working directories for processed data:

  • Change SCRATCH_DIR to the bitexting working directory /scratch/project_462000688/dayyan/sharded_data. BEWARE that this is a shared directory that other uses will be working on. So, make sure you execute only the languages you have assigned.
  • Change SLURM_LOGS to your own logs directory to avoid storing logs in the same directory as other users. THIS DIRECTORY MUST EXIST before running jobs, otherwise they will fail.
  • Set up collection names and directories.
        export DATA_CLEANING=$SCRATCH_DIR/clean
        export COLLECTION_ROOT="$SCRATCH_DIR"
        declare -A COLLECTIONS=(
                ["archivebot"]="$SCRATCH_DIR/archivebot"
                ["cc13"]="$SCRATCH_DIR/cc13"
                ["cc14"]="$SCRATCH_DIR/cc14"
                ["cc15"]="$SCRATCH_DIR/cc15"
                ["cc16"]="$SCRATCH_DIR/cc16"
                ["cc17"]="$SCRATCH_DIR/cc17"
                ["cc18"]="$SCRATCH_DIR/cc18"
                ["cc19"]="$SCRATCH_DIR/cc19"
                ["cc20"]="$SCRATCH_DIR/cc20"
                ["cc21"]="$SCRATCH_DIR/cc21"
                ["cc22"]="$SCRATCH_DIR/cc22"
                ["cc23"]="$SCRATCH_DIR/cc23"
                ["survey3"]="$SCRATCH_DIR/survey3"
                ["wide10"]="$SCRATCH_DIR/wide10"
                ["wide11"]="$SCRATCH_DIR/wide11"
                ["wide12"]="$SCRATCH_DIR/wide12"
                ["wide15"]="$SCRATCH_DIR/wide15"
                ["wide16"]="$SCRATCH_DIR/wide16"
                ["wide17"]="$SCRATCH_DIR/wide17"
                ["wide5"]="$SCRATCH_DIR/wide5"
                ["wide6"]="$SCRATCH_DIR/wide6"
        )
  • Other relevant variables that may not need modifications:
    • SBATCH_ACCOUNT specifies the project that will be billed for the computing hours.
    • SBATCH_PARTITION: we will be using small since we do not have multi-node jobs and it allows more jobs running.
    • SBATCH_MEM_PER_CPU: only needed for small partition. Comment this line for standard partition.

Container creation

If you want to build your own container, run the following steps in a machine with Docker and Singularity installed inside the cloned repo:

sudo docker build -t bitextor-slurm:latest .
sudo singularity build -F bitextor-slurm.sif bitextor-slurm.def 

Configure translation

To configure translation step with a Bergamot student, the following steps are required:

  • Create the language pair directory like models/es-en.
  • Download the student model files to models/es-en/esen.student.tiny11 and create a symlink.
  • Create a symlink to models/translate-bergamot.
cd models/es-en
ln -s ../translate-bergamot.sh translate.sh # relative symblink seems to work better
zaragoza2@uan01:~/proj_462000252/zaragoza/cirrus-scripts> ll models/es-en/
total 8.0K
drwxrws--- 2 zaragoza2 project_462000252 4.0K May 11 13:03 esen.student.tiny11
lrwxrwxrwx 1 zaragoza2 project_462000252   19 May 11 13:14 model -> esen.student.tiny11
lrwxrwxrwx 1 zaragoza2 project_462000252   84 May 11 13:00 translate.sh -> ../translate-bergamot.sh

Note that translate-bergamot.sh will look for marian-decoder config at models/es-en/model/config.yml. This is an optimized example for bergamot models:

quiet-translation: true
relative-paths: true
models:
    - model.intgemm.alphas.bin
vocabs:
    - vocab.esen.spm
    - vocab.esen.spm
shortlist:
    - lex.s2t.bin
    - false

beam-size: 1
normalize: 1.0
word-penalty: 0
mini-batch: 8
maxi-batch: 500
maxi-batch-sort: src
workspace: 256
max-length: 400
max-length-crop: true
gemm-precision: int8shiftAlphaAll

max-length-crop is avoids super long lines freezing Marian.

Marian Bergamot CPU version is compiled and configured in translate-bergamot.sh, so there is no need to compile it.

To use other types of translators, you will need to compile/install it by yourself and configure translate.sh. Take a look at translation template scripts in models/ directory to have an idea of what is needed. Note the use of foldfilter wrapper to chop very long lines before translation and rejoin them at the output.

WARNING: foldfilter can mess up with spaces in some cases, can't handle languages without spaces, for example. Check outputs before using it.

Run teacher models with CTranslate2 on CPU

To use an OpusMT model (or any Marian model that is not a Bergamot student) with CTranslate2, download the zip, extract it, convert the model and copy the vocab files like this:

cd bitextor-slurm
singularity exec -B $(pwd -P) --pwd $(pwd -P) bitextor-slurm.sif bash
ct2-marian-converter \
   --quantization int8 \
   --model_path spa-eng/opus1m+bt.spm32k-spm32k.transformer-align.model1.npz.best-perplexity.npz \
   --vocab_paths spa-eng/opus1m+bt.spm32k-spm32k.vocab.yml spa-eng/opus1m+bt.spm32k-spm32k.vocab.yml \
   --output_dir models/spa-eng-ct2
exit
cp spa-eng/{source,target}.spm models/spa-eng-ct2 # the ctranslate scripts expects source.spm and target.spm as SP models
cd models/es-en
ln -s ../spa-eng-ct2 model
ln -s ../translate-opus-ct2.sh translate.sh

To submit jobs use:

# Run each ctranslate2 process (4 in total per task) with 32 threads seems more efficient than 64
TPB=64 TPN=4 SBATCH_CPUS_PER_TASK=32 ./04.translate.sh $collection $lang

Run the pipeline

After all the configuration, all the steps can be followed as the README explains.

General recommendations

Each processing step follows the scheme "run process writing to output file with a temporary suffix" then remove the suffix to mark it as finished. So every time a job fails or does not finish properly, it will leave temp files all over the place. Cleaning them regularly if the jobs fail, is advised in order to reduce the number of files.

For example, to list the temporary files:

find *-shards/$lang/ -name "*.gz.[0-9]*"

Example execution for one language

It is advised to read https://github.com/hplt-project/bitextor-slurm#environment-variables first.

These are the parameters used to process Basque.

To run for all collections the submission scripts can be wrapped inside a loop

for collection in $(./collections.sh); do
    ./step-script.sh $collection $lang
done

collections.sh lists all the collections in the config, so make sure there are no collections used for test in the config or comment them out before using the script.

For 03.split

TPB=1024 TPN=128 ./03.split-text.sh $collection eu

For 04.translate

TPB=32 TPN=2 ./04.translate.sh $collection eu

note that 04.translate uses 64 cores by default for each batch. Therefore TPN=2 uses all the 128 physical cores on each node.

For 05.tokenise

TPB=1024 TPN=128 ./05.tokenise.sh $collection eu

For 06.align

TPB=128 TPN=32 ./06.align.sh $collection eu

For 07.fix

TPB=256 TPN=32 ./07.fix.sh $collection eu

For 08.score (jobs are set to use one gpu per slurm job, as there is no parallelization inside 08.score, to use more resources in parallel, reduce the TPB so more jobs are allocated)

TPB=256 TPN=32 ./08.score.sh $collection eu

For 09.clean

TPB=512 TPN=32 ./09.clean.sh $collection eu

For 10.reduce-classified

# collections.sh will list all collection names in the config
# make sure collections for testing are commented out in the config or removed before using it
./10.reduce-classified.sh eu $(./collections.sh)

For 11.reduce-filtered

./11.reduce-filtered.sh eu $(./collections.sh)

For 12.reduce-tmx

./12.reduce-tmx.sh eu $(./collections.sh)