-
Notifications
You must be signed in to change notification settings - Fork 0
LUMI setup
Recommended reads:
- https://docs.lumi-supercomputer.eu/storage/
- https://docs.lumi-supercomputer.eu/runjobs/
- https://docs.lumi-supercomputer.eu/software/installing/container-wrapper/
- https://github.com/hplt-project/bitextor-slurm#readme
- https://docs.google.com/document/d/1YyjdWofZ65ib9qTnGiJ8n0Rvgm4PKRhwvnFYfXrSMRg/edit?usp=sharing
- https://docs.google.com/presentation/d/1fADyGoYIi1XL8_iY2cW5_cqaAeF3LNSZ4c1GsP00J2A/edit?usp=sharing
Please ignore "Compiling Software" section in README, instead follow these steps. The lumi branch now runs all of the slurm jobs inside a Singularity container that has all the software installed.
Clone the repo (do not clone recursively) and swith to lumi branch
git clone https://github.com/hplt-project/bitextor-slurm
cd bitextor-slurm
git checkout lumi
Add a symlink from the latest version of the container file located at /scratch/project_462000688/bitextor-slurm-containers to the bitextor-slurm directory:
ln -s /project/project_465001864/bitextor-slurm-containers/bitextor-slurm-v4.sif ./bitextor-slurm.sif
Edit config.d/10.lumi.sh to set up working directories for processed data:
- Change
SCRATCH_DIRto the bitexting working directory/scratch/project_462000688/dayyan/sharded_data. BEWARE that this is a shared directory that other uses will be working on. So, make sure you execute only the languages you have assigned. - Change
SLURM_LOGSto your own logs directory to avoid storing logs in the same directory as other users. THIS DIRECTORY MUST EXIST before running jobs, otherwise they will fail. - Set up collection names and directories.
export DATA_CLEANING=$SCRATCH_DIR/clean
export COLLECTION_ROOT="$SCRATCH_DIR"
declare -A COLLECTIONS=(
["archivebot"]="$SCRATCH_DIR/archivebot"
["cc13"]="$SCRATCH_DIR/cc13"
["cc14"]="$SCRATCH_DIR/cc14"
["cc15"]="$SCRATCH_DIR/cc15"
["cc16"]="$SCRATCH_DIR/cc16"
["cc17"]="$SCRATCH_DIR/cc17"
["cc18"]="$SCRATCH_DIR/cc18"
["cc19"]="$SCRATCH_DIR/cc19"
["cc20"]="$SCRATCH_DIR/cc20"
["cc21"]="$SCRATCH_DIR/cc21"
["cc22"]="$SCRATCH_DIR/cc22"
["cc23"]="$SCRATCH_DIR/cc23"
["survey3"]="$SCRATCH_DIR/survey3"
["wide10"]="$SCRATCH_DIR/wide10"
["wide11"]="$SCRATCH_DIR/wide11"
["wide12"]="$SCRATCH_DIR/wide12"
["wide15"]="$SCRATCH_DIR/wide15"
["wide16"]="$SCRATCH_DIR/wide16"
["wide17"]="$SCRATCH_DIR/wide17"
["wide5"]="$SCRATCH_DIR/wide5"
["wide6"]="$SCRATCH_DIR/wide6"
)- Other relevant variables that may not need modifications:
-
SBATCH_ACCOUNTspecifies the project that will be billed for the computing hours. -
SBATCH_PARTITION: we will be usingsmallsince we do not have multi-node jobs and it allows more jobs running. -
SBATCH_MEM_PER_CPU: only needed forsmallpartition. Comment this line forstandardpartition.
-
If you want to build your own container, run the following steps in a machine with Docker and Singularity installed inside the cloned repo:
sudo docker build -t bitextor-slurm:latest .
sudo singularity build -F bitextor-slurm.sif bitextor-slurm.def
To configure translation step with a Bergamot student, the following steps are required:
- Create the language pair directory like
models/es-en. - Download the student model files to
models/es-en/esen.student.tiny11and create a symlink. - Create a symlink to
models/translate-bergamot.
cd models/es-en
ln -s ../translate-bergamot.sh translate.sh # relative symblink seems to work betterzaragoza2@uan01:~/proj_462000252/zaragoza/cirrus-scripts> ll models/es-en/
total 8.0K
drwxrws--- 2 zaragoza2 project_462000252 4.0K May 11 13:03 esen.student.tiny11
lrwxrwxrwx 1 zaragoza2 project_462000252 19 May 11 13:14 model -> esen.student.tiny11
lrwxrwxrwx 1 zaragoza2 project_462000252 84 May 11 13:00 translate.sh -> ../translate-bergamot.sh
Note that translate-bergamot.sh will look for marian-decoder config at models/es-en/model/config.yml. This is an optimized example for bergamot models:
quiet-translation: true
relative-paths: true
models:
- model.intgemm.alphas.bin
vocabs:
- vocab.esen.spm
- vocab.esen.spm
shortlist:
- lex.s2t.bin
- false
beam-size: 1
normalize: 1.0
word-penalty: 0
mini-batch: 8
maxi-batch: 500
maxi-batch-sort: src
workspace: 256
max-length: 400
max-length-crop: true
gemm-precision: int8shiftAlphaAllmax-length-crop is avoids super long lines freezing Marian.
Marian Bergamot CPU version is compiled and configured in translate-bergamot.sh, so there is no need to compile it.
To use other types of translators, you will need to compile/install it by yourself and configure translate.sh.
Take a look at translation template scripts in models/ directory to have an idea of what is needed.
Note the use of foldfilter wrapper to chop very long lines before translation and rejoin them at the output.
WARNING: foldfilter can mess up with spaces in some cases, can't handle languages without spaces, for example.
Check outputs before using it.
To use an OpusMT model (or any Marian model that is not a Bergamot student) with CTranslate2, download the zip, extract it, convert the model and copy the vocab files like this:
cd bitextor-slurm
singularity exec -B $(pwd -P) --pwd $(pwd -P) bitextor-slurm.sif bash
ct2-marian-converter \
--quantization int8 \
--model_path spa-eng/opus1m+bt.spm32k-spm32k.transformer-align.model1.npz.best-perplexity.npz \
--vocab_paths spa-eng/opus1m+bt.spm32k-spm32k.vocab.yml spa-eng/opus1m+bt.spm32k-spm32k.vocab.yml \
--output_dir models/spa-eng-ct2
exit
cp spa-eng/{source,target}.spm models/spa-eng-ct2 # the ctranslate scripts expects source.spm and target.spm as SP models
cd models/es-en
ln -s ../spa-eng-ct2 model
ln -s ../translate-opus-ct2.sh translate.sh
To submit jobs use:
# Run each ctranslate2 process (4 in total per task) with 32 threads seems more efficient than 64
TPB=64 TPN=4 SBATCH_CPUS_PER_TASK=32 ./04.translate.sh $collection $lang
After all the configuration, all the steps can be followed as the README explains.
Each processing step follows the scheme "run process writing to output file with a temporary suffix" then remove the suffix to mark it as finished. So every time a job fails or does not finish properly, it will leave temp files all over the place. Cleaning them regularly if the jobs fail, is advised in order to reduce the number of files.
For example, to list the temporary files:
find *-shards/$lang/ -name "*.gz.[0-9]*"
It is advised to read https://github.com/hplt-project/bitextor-slurm#environment-variables first.
These are the parameters used to process Basque.
To run for all collections the submission scripts can be wrapped inside a loop
for collection in $(./collections.sh); do
./step-script.sh $collection $lang
done
collections.sh lists all the collections in the config, so make sure there are no collections used for test in the config or comment them out before using the script.
For 03.split
TPB=1024 TPN=128 ./03.split-text.sh $collection eu
For 04.translate
TPB=32 TPN=2 ./04.translate.sh $collection eu
note that 04.translate uses 64 cores by default for each batch. Therefore TPN=2 uses all the 128 physical cores on each node.
For 05.tokenise
TPB=1024 TPN=128 ./05.tokenise.sh $collection eu
For 06.align
TPB=128 TPN=32 ./06.align.sh $collection eu
For 07.fix
TPB=256 TPN=32 ./07.fix.sh $collection eu
For 08.score (jobs are set to use one gpu per slurm job, as there is no parallelization inside 08.score, to use more resources in parallel, reduce the TPB so more jobs are allocated)
TPB=256 TPN=32 ./08.score.sh $collection eu
For 09.clean
TPB=512 TPN=32 ./09.clean.sh $collection eu
For 10.reduce-classified
# collections.sh will list all collection names in the config
# make sure collections for testing are commented out in the config or removed before using it
./10.reduce-classified.sh eu $(./collections.sh)
For 11.reduce-filtered
./11.reduce-filtered.sh eu $(./collections.sh)
For 12.reduce-tmx
./12.reduce-tmx.sh eu $(./collections.sh)