Implement the `ModernBert` model #459

kozistr · 2024-12-25T08:21:05Z

What does this PR do?

Close #457

upgrade the tokenizer crate from 0.19.1 to 0.21.0 to address a ModernBert tokenizer issue.
implement ModernBert model
- it may work on CPU, CUDA (w/o FA), and MPS.
- ModernBert uses local attention. however, I'm unfamiliar with candle_flash_attn and don't have any GPU to test FA2 w/ local attn, so the FlashModernBert implementation remains unsupported at this time.
implement a classification head for ModernBert

Log

$ ./target/release/text-embeddings-router --model-id ./ModernBERT-base/ --port 8888 --pooling cls --dtype float32
2024-12-25T07:34:46.753673Z  INFO text_embeddings_router: router/src/main.rs:175: Args { model_id: "./Mod*******-*ase/", revision: None, tokenization_workers: None, dtype: Some(Float32), pooling: Some(Cls), max_concurrent_requests: 512, max_batch_tokens: 16384, max_batch_requests: None, max_client_batch_size: 32, auto_truncate: false, default_prompt_name: None, default_prompt: None, hf_api_token: None, hostname: "0.0.0.0", port: 8888, uds_path: "/tmp/text-embeddings-inference-server", huggingface_hub_cache: None, payload_limit: 2000000, api_key: None, json_output: false, otlp_endpoint: None, otlp_service_name: "text-embeddings-inference.server", cors_allow_origin: None }
2024-12-25T07:34:46.817444Z  WARN text_embeddings_router: router/src/lib.rs:184: Could not find a Sentence Transformers config
2024-12-25T07:34:46.817472Z  INFO text_embeddings_router: router/src/lib.rs:188: Maximum number of tokens per request: 8192
2024-12-25T07:34:46.817622Z  INFO text_embeddings_core::tokenization: core/src/tokenization.rs:28: Starting 8 tokenization workers
2024-12-25T07:34:46.883933Z  INFO text_embeddings_router: router/src/lib.rs:230: Starting model backend
2024-12-25T07:34:46.884247Z  INFO text_embeddings_backend_candle: backends/candle/src/lib.rs:239: Starting ModernBert model on Cpu
2024-12-25T07:34:47.138974Z  WARN text_embeddings_router: router/src/lib.rs:258: Backend does not support a batch size > 4
2024-12-25T07:34:47.139002Z  WARN text_embeddings_router: router/src/lib.rs:259: forcing `max_batch_requests=4`
2024-12-25T07:34:47.139930Z  INFO text_embeddings_router::http::server: router/src/http/server.rs:1812: Starting HTTP server: 0.0.0.0:8888
2024-12-25T07:34:47.139955Z  INFO text_embeddings_router::http::server: router/src/http/server.rs:1813: Ready
2024-12-25T07:34:52.701893Z  INFO embed{total_time="115.486302ms" tokenization_time="322.4µs" queue_time="363.6µs" inference_time="114.688702ms"}: text_embeddings_router::http::server: router/src/http/server.rs:714: Success

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@OlivierDehaene OR @Narsil

michaelfeil · 2025-01-04T03:16:31Z

backends/candle/src/models/flash_modernbert.rs

+                    candle::bail!("`splade` is not supported for ModernBert")
+                }
+
+                if pool == Pool::LastToken {


Should be implemented below, right?

thanks for pointing this! I had mistakenly disabled support for LastToken pooling, even though it was already implemented. I've removed the line blocking its support, allowing LastToken pooling to be used again.

1fe761f

michaelfeil · 2025-01-04T03:17:48Z

FYI, there is now https://huggingface.co/nomic-ai/modernbert-embed-base.
Let me know if you need GPU access @kozistr

michaelfeil · 2025-01-06T03:22:20Z

FYI, running the nomic/modernbert-base model yields an error as the safetensors are not under model.embeddings.* but embeddings.*

kozistr · 2025-01-06T03:25:03Z

FYI, there is now https://huggingface.co/nomic-ai/modernbert-embed-base. Let me know if you need GPU access @kozistr

thanks! I've just worked on supporting nomic-ai/modernbert-embed-base and it seems to be working well too. 3b20211

also appreciate your offer of the GPU support! currently, I'm kinda a lot on my plate so, I'll reach out later to you :) anyway, thanks again for your support

$ ./target/release/text-embeddings-router --model-id ./modernbert-embed-base --port 8888 --pooling mean --dtype float32
2025-01-06T03:09:26.039864Z  INFO text_embeddings_router: router/src/main.rs:175: Args { model_id: "./mod*******-*****-*ase", revision: None, tokenization_workers: None, dtype: Some(Float32), pooling: Some(Mean), max_concurrent_requests: 512, max_batch_tokens: 16384, max_batch_requests: None, max_client_batch_size: 32, auto_truncate: false, default_prompt_name: None, default_prompt: None, hf_api_token: None, hostname: "0.0.0.0", port: 8888, uds_path: "/tmp/text-embeddings-inference-server", huggingface_hub_cache: None, payload_limit: 2000000, api_key: None, json_output: false, otlp_endpoint: None, otlp_service_name: "text-embeddings-inference.server", cors_allow_origin: None }
2025-01-06T03:09:26.126234Z  INFO text_embeddings_router: router/src/lib.rs:188: Maximum number of tokens per request: 8192
2025-01-06T03:09:26.126419Z  INFO text_embeddings_core::tokenization: core/src/tokenization.rs:28: Starting 8 tokenization workers
2025-01-06T03:09:26.196076Z  INFO text_embeddings_router: router/src/lib.rs:230: Starting model backend
2025-01-06T03:09:26.196763Z  INFO text_embeddings_backend_candle: backends/candle/src/lib.rs:239: Starting ModernBert model on Cpu
2025-01-06T03:09:26.459153Z  WARN text_embeddings_router: router/src/lib.rs:258: Backend does not support a batch size > 4
2025-01-06T03:09:26.459182Z  WARN text_embeddings_router: router/src/lib.rs:259: forcing `max_batch_requests=4`
2025-01-06T03:09:26.460282Z  INFO text_embeddings_router::http::server: router/src/http/server.rs:1812: Starting HTTP server: 0.0.0.0:8888
2025-01-06T03:09:26.460306Z  INFO text_embeddings_router::http::server: router/src/http/server.rs:1813: Ready
2025-01-06T03:09:31.426262Z  INFO embed{total_time="121.542397ms" tokenization_time="356.3µs" queue_time="418.8µs" inference_time="120.695897ms"}: text_embeddings_router::http::server: router/src/http/server.rs:714: Success

touhi99 · 2025-01-09T09:38:38Z

is it also supported in the same architecture https://huggingface.co/Parallia/Fairly-Multilingual-ModernBERT-Embed-BE ?

kozistr · 2025-01-11T11:23:56Z

is it also supported in the same architecture https://huggingface.co/Parallia/Fairly-Multilingual-ModernBERT-Embed-BE ?

It looks like it uses custom tokenizing logic that uses multiple tokenizers and determines one tokenizer on the fly, depending on the input text. the architecture in and of itself is supported, but it would be hard to use with TEI I guess.

touhi99 · 2025-01-14T11:07:42Z

is it also supported in the same architecture https://huggingface.co/Parallia/Fairly-Multilingual-ModernBERT-Embed-BE ?

It looks like it uses custom tokenizing logic that uses multiple tokenizers and determines one tokenizer on the fly, depending on the input text. the architecture in and of itself is supported, but it would be hard to use with TEI I guess.

I have found another fine-tune from them which is specifically for German (DE), https://huggingface.co/Parallia/Fairly-Multilingual-ModernBERT-Embed-BE-DE/blob/main/config.json

but i am having the issue as their config says pad_token_id null. I tried to follow through your implementation but this is where i stuck where the model is expecting a pad_token_id

kozistr · 2025-01-14T11:39:20Z

is it also supported in the same architecture https://huggingface.co/Parallia/Fairly-Multilingual-ModernBERT-Embed-BE ?

It looks like it uses custom tokenizing logic that uses multiple tokenizers and determines one tokenizer on the fly, depending on the input text. the architecture in and of itself is supported, but it would be hard to use with TEI I guess.

I have found another fine-tune from them which is specifically for German (DE), https://huggingface.co/Parallia/Fairly-Multilingual-ModernBERT-Embed-BE-DE/blob/main/config.json

but i am having the issue as their config says pad_token_id null. I tried to follow through your implementation but this is where i stuck where the model is expecting a pad_token_id

it seems like it uses </s> as a pad token and the token id is 2. link.

bos_token_id: 1
eos_token_id: 2
pad_token_id: 2
cls_token_id: 0 // dummy
sep_token_id: 0 // dummy

You should fill in missing configs with proper values in config.json. you can check the full config from here

touhi99 · 2025-01-15T13:31:10Z

is it also supported in the same architecture https://huggingface.co/Parallia/Fairly-Multilingual-ModernBERT-Embed-BE ?

It looks like it uses custom tokenizing logic that uses multiple tokenizers and determines one tokenizer on the fly, depending on the input text. the architecture in and of itself is supported, but it would be hard to use with TEI I guess.

I have found another fine-tune from them which is specifically for German (DE), https://huggingface.co/Parallia/Fairly-Multilingual-ModernBERT-Embed-BE-DE/blob/main/config.json
but i am having the issue as their config says pad_token_id null. I tried to follow through your implementation but this is where i stuck where the model is expecting a pad_token_id

it seems like it uses </s> as a pad token and the token id is 2. link.

bos_token_id: 1

eos_token_id: 2

pad_token_id: 2

cls_token_id: 0 // dummy

sep_token_id: 0 // dummy

You should fill in missing configs with proper values in config.json. you can check the full config from here

Thank you. I was able to run nomicai/modernbert-base following your instruction. The other fine-tuned one i mentioned already had some changes as you suggested. but seems still struggling for longer text (more than 128 tokens). I wrote to them directly.

kozistr · 2025-01-19T12:15:56Z

is it also supported in the same architecture https://huggingface.co/Parallia/Fairly-Multilingual-ModernBERT-Embed-BE ?

It looks like it uses custom tokenizing logic that uses multiple tokenizers and determines one tokenizer on the fly, depending on the input text. the architecture in and of itself is supported, but it would be hard to use with TEI I guess.

I have found another fine-tune from them which is specifically for German (DE), https://huggingface.co/Parallia/Fairly-Multilingual-ModernBERT-Embed-BE-DE/blob/main/config.json
but i am having the issue as their config says pad_token_id null. I tried to follow through your implementation but this is where i stuck where the model is expecting a pad_token_id

it seems like it uses </s> as a pad token and the token id is 2. link.

bos_token_id: 1

eos_token_id: 2

pad_token_id: 2

cls_token_id: 0 // dummy

sep_token_id: 0 // dummy

You should fill in missing configs with proper values in config.json. you can check the full config from here

Thank you. I was able to run nomicai/modernbert-base following your instruction. The other fine-tuned one i mentioned already had some changes as you suggested. but seems still struggling for longer text (more than 128 tokens). I wrote to them directly.

great to hear!

If you encounter an issue, index-select invalid index 128 with dim size 128, it's a bug in my ModernBert implementation, which is the rotary encoding part. I'm currently working on it, and I'll let you know when it's fixed!

--- updated

I just fixed the bug 63c4224, could you please test with the latest commit?

kozistr added 9 commits December 24, 2024 23:36

feature: ModernBert

3b0701e

fix: modernbert

ad25832

update: tests

ddac414

feature: flashmodernbert

8cc7120

docs: README

eb5932c

feature: flashmodernbert

3253fc7

feature: flashmodernbert

9de3025

feature: flashmodernbert

d29dec6

feature: flashmodernbert

d970c48

kozistr mentioned this pull request Dec 26, 2024

support for answerdotai/ModernBERT-base #457

Open

2 tasks

michaelfeil reviewed Jan 4, 2025

View reviewed changes

kozistr added 4 commits January 6, 2025 11:34

update: attention_mask

3df381a

update: enable last_token pooling

1fe761f

update: disable LastToken pooling for ModernBert

ccb633c

update: support nomic-ai/modernbert-embed-base

3b20211

fix: rotary embedding

63c4224

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement the `ModernBert` model #459

Implement the `ModernBert` model #459

kozistr commented Dec 25, 2024

michaelfeil Jan 4, 2025

kozistr Jan 6, 2025

michaelfeil commented Jan 4, 2025

michaelfeil commented Jan 6, 2025

kozistr commented Jan 6, 2025

touhi99 commented Jan 9, 2025

kozistr commented Jan 11, 2025

touhi99 commented Jan 14, 2025

kozistr commented Jan 14, 2025

touhi99 commented Jan 15, 2025 •

edited

Loading

kozistr commented Jan 19, 2025 •

edited

Loading

Implement the ModernBert model #459

Are you sure you want to change the base?

Implement the ModernBert model #459

Conversation

kozistr commented Dec 25, 2024

What does this PR do?

Log

Before submitting

Who can review?

michaelfeil Jan 4, 2025

Choose a reason for hiding this comment

kozistr Jan 6, 2025

Choose a reason for hiding this comment

michaelfeil commented Jan 4, 2025

michaelfeil commented Jan 6, 2025

kozistr commented Jan 6, 2025

touhi99 commented Jan 9, 2025

kozistr commented Jan 11, 2025

touhi99 commented Jan 14, 2025

kozistr commented Jan 14, 2025

touhi99 commented Jan 15, 2025 • edited Loading

kozistr commented Jan 19, 2025 • edited Loading

Implement the `ModernBert` model #459

Implement the `ModernBert` model #459

touhi99 commented Jan 15, 2025 •

edited

Loading

kozistr commented Jan 19, 2025 •

edited

Loading