Device-side assertion not passed when training on cuda device and when there are added tokens to the tokenizer #53

RaymondUoE · 2024-01-25T12:42:22Z

Bug description

If using cuda, the transformer model will fail a device-side assertion if there are additional special tokens in the tokenizer. This does not happen when device='cpu' or device='mps' are specified, suggesting that this might be a PyTorch issue. However, workaround cannot be done using small-text API and require modification to its source code.

Steps to reproduce

Using small-text/tree/main/examples/examplecode/transformers_multiclass_classification.py as an example:

Change
tokenizer = AutoTokenizer.from_pretrained(TRANSFORMER_MODEL.model, cache_dir='.cache/')
to

tokenizer = AutoTokenizer.from_pretrained(TRANSFORMER_MODEL.model, cache_dir='.cache/')
tokenizer.add_special_tokens({'additional_special_tokens': ['[SPECIAL1]', '[SPECIAL2]']})

This will cause the device-side assertion to fail when using cuda:

clf_factory = TransformerBasedClassificationFactory(TRANSFORMER_MODEL,
                                                        num_classes,
                                                        kwargs=dict({
                                                            'device': 'cuda'
                                                        }))

due to embedding size mismatch.

Expected behavior

The model adjusts new vocab size automatically.

Workaround:

In file small_text/integrations/transformers/utils/classification.py function _initialize_transformer_components, change the following

model = AutoModelForSequenceClassification.from_pretrained(
        transformer_model.model,
        from_tf=False,
        config=config,
        cache_dir=cache_dir,
        force_download=from_pretrained_options.force_download,
        local_files_only=from_pretrained_options.local_files_only
    )

to

model = AutoModelForSequenceClassification.from_pretrained(
        transformer_model.model,
        from_tf=False,
        config=config,
        cache_dir=cache_dir,
        force_download=from_pretrained_options.force_download,
        local_files_only=from_pretrained_options.local_files_only
    )
    model.resize_token_embeddings(new_num_tokens=NEW_VOCAB_SIZE)

adding the final line. This requires the new vocab size to be hard-coded because the customised tokenizer is inaccessible in this function. If the tokenizer is accessible, the final line can simply be changed to model.resize_token_embeddings(new_num_tokens=len(tokenizer))

Environment:

Python version: 3.11.7
small-text version: 1.3.3
small-text integrations (e.g., transformers): transformers 4.36.2
PyTorch version: 2.1.2
PyTorch-cuda: 11.8

The text was updated successfully, but these errors were encountered:

chschroeder · 2024-01-25T17:25:11Z

Thanks for reporting this! I will look into it.

chschroeder · 2024-01-28T12:59:26Z

@RaymondUoE With just the additional tokenizer.add_special_tokens() call, I cannot reproduce the error. Can you provide details on the assertion output?

RaymondUoE added the bug Something isn't working label Jan 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Device-side assertion not passed when training on cuda device and when there are added tokens to the tokenizer #53

Device-side assertion not passed when training on cuda device and when there are added tokens to the tokenizer #53

RaymondUoE commented Jan 25, 2024

chschroeder commented Jan 25, 2024

chschroeder commented Jan 28, 2024

Device-side assertion not passed when training on cuda device and when there are added tokens to the tokenizer #53

Device-side assertion not passed when training on cuda device and when there are added tokens to the tokenizer #53

Comments

RaymondUoE commented Jan 25, 2024

Bug description

Steps to reproduce

Expected behavior

Workaround:

Environment:

chschroeder commented Jan 25, 2024

chschroeder commented Jan 28, 2024