Using CLIP ViT-B-32, getting errors about invalid input dimensions #154

beyarkay · 2024-02-06T20:37:28Z

beyarkay
Feb 6, 2024

Hi!

I'm busy trying to use ort to run the ONNX models of CLIP hosted here:

https://clip-as-service.s3.us-east-2.amazonaws.com/models/onnx/ViT-B-32/visual.onnx
https://clip-as-service.s3.us-east-2.amazonaws.com/models/onnx/ViT-B-32/textual.onnx

Which I found in https://github.com/Lednik7/CLIP-ONNX.

I can get the model to work, but only if I provide it exactly 77 tokens. I'm hoping someone can help me figure out how to get it to work with arbitrary numbers of tokens?

Here's the code that works, but I've had to make the input string be exactly 77 tokens long:

use instant_clip_tokenizer::{Token, Tokenizer};
use ndarray::{Array1, Array2, Axis};
use ort::{inputs, GraphOptimizationLevel, Session};
                                                                                                                                                                                                                                                                                                                                                                        
pub fn load_text_model() -> ort::Result<()> {
    // The ONNX file has already been saved to models/textual.onnx
    let text_model = Session::builder()?
        .with_optimization_level(GraphOptimizationLevel::Level3)?
        .with_intra_threads(1)?
        .with_model_from_file("models/textual.onnx")?;
                                                                                                                                                                                                                                                                                                                                                                        
    // The tokenizer comes from
    // https://docs.rs/instant-clip-tokenizer/0.1.0/instant_clip_tokenizer
    let tokenizer = Tokenizer::new();
    // See `tokenize(...)` below. The string I give here is just a dummy piece of text that
    // ends up being 77 tokens long.
    let tokens = tokenize(tokenizer, "Hi there my name is john and I like to walk in the park with my son and daughter. when we go walking in the sun I like to feel it warm my neck and I like to hold their hands as they tell me about their day. sometimes they have had a poor day and it makes me sad to hear about their poor day but other times I hear about")
        .iter()
        .map(|tk| *tk as i64)
        .collect::<Vec<_>>();
    let mut tokens = Array1::from_iter(tokens);
                                                                                                                                                                                                                                                                                                                                                                        
    // Preprocess the tokens into the right shape
    let array = tokens.view().insert_axis(Axis(0));
    let inputs = inputs!["input" => array]?;
    // Pass the inputs through the model
    let model_output = text_model.run(inputs)?;
    // Extract the embedding from the model
    let outputs = model_output["output"].extract_tensor::<f32>()?;
                                                                                                                                                                                                                                                                                                                                                                        
    // This tensor is correct, I've verified it with a Python CLIP model
    println!("Output Tensor: {:?}", outputs);
                                                                                                                                                                                                                                                                                                                                                                        
    Ok(())
}
                                                                                                                                                                                                                                                                                                                                                                        
fn tokenize(tokenizer: Tokenizer, text: &str) -> Vec<u16> {
    let mut tokens = vec![tokenizer.start_of_text()];
    tokenizer.encode(text, &mut tokens);
    tokens.push(tokenizer.end_of_text());
    tokens.into_iter().map(Token::to_u16).collect()
}

If I change the string to be a bit shorter or a bit longer:

    ...
    let tokens = tokenize(tokenizer, "short string")
    ...

Then I get this error:

called `Result::unwrap()` on an `Err` value: SessionRun(Msg("Got invalid dimensions for input: input for the following indices\n index: 1 Got: 4 Expected: 77\n Please fix either the inputs or the model."))

I know this isn't quite the right place to ask, because this seems like an ONNX issue, but I'm hoping you can point me in the right direction

Answered by decahedron1

Feb 6, 2024

Use Tokenizer from the tokenizers crate, which supports padding.

See here for an example: https://github.com/pykeio/diffusers/blob/67158927fb847ebfb63986c14b31fdbb6a2569e7/src/clip.rs#L40-L54

View full answer

decahedron1 · 2024-02-06T20:46:35Z

decahedron1
Feb 6, 2024
Maintainer

Use Tokenizer from the tokenizers crate, which supports padding.

See here for an example: https://github.com/pykeio/diffusers/blob/67158927fb847ebfb63986c14b31fdbb6a2569e7/src/clip.rs#L40-L54

5 replies

beyarkay Feb 6, 2024
Author

Thanks for the swift reply! Am I not going to run into problems if I use a different tokenizer than that which was used during training? I thought that using a different tokenizer would result in different tokens which would confused the transformer because the token 42 (for example) doesn't mean what it thought it meant?

decahedron1 Feb 6, 2024
Maintainer

If it's vanilla or fine-tuned CLIP ViT-B/32 then the tokenizer here will work fine. Otherwise you will absolutely get issues if you use a different tokenizer.

beyarkay Feb 6, 2024
Author

Okay thanks! If I get CLIP working would you like me to write up something for the examples/ folder? (also, do you have recommendations for learning about the more recent transformer things? I knew neural networks like the back of my hand but nowadays I'm feeling out of my depth )

decahedron1 Feb 6, 2024
Maintainer

If I get CLIP working would you like me to write up something for the examples/ folder?

That would be great!! 😀

(also, do you have recommendations for learning about the more recent transformer things? I knew neural networks like the back of my hand but nowadays I'm feeling out of my depth )

Funnily enough I feel I have the opposite problem. Have you seen this section of the Hugging Face NLP Course? It gives a nice high-level overview of transformer models.

beyarkay Feb 7, 2024
Author

If I can get CLIP working I'll submit the PR for an example (:

Funnily enough I feel I have the opposite problem

Haha, you wanna call and trade notes sometime? 😆

Have you seen this section of the Hugging Face NLP Course

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using CLIP ViT-B-32, getting errors about invalid input dimensions #154

{{title}}

Replies: 1 comment 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Using CLIP ViT-B-32, getting errors about invalid input dimensions #154

beyarkay Feb 6, 2024

Replies: 1 comment · 5 replies

decahedron1 Feb 6, 2024 Maintainer

beyarkay Feb 6, 2024 Author

decahedron1 Feb 6, 2024 Maintainer

beyarkay Feb 6, 2024 Author

decahedron1 Feb 6, 2024 Maintainer

beyarkay Feb 7, 2024 Author

beyarkay
Feb 6, 2024

Replies: 1 comment 5 replies

decahedron1
Feb 6, 2024
Maintainer

beyarkay Feb 6, 2024
Author

decahedron1 Feb 6, 2024
Maintainer

beyarkay Feb 6, 2024
Author

decahedron1 Feb 6, 2024
Maintainer

beyarkay Feb 7, 2024
Author