Release wasm version #152

do-me · 2024-04-23T13:06:55Z

Hey Ben,

I would love to use a wasm version of text-splitter in the web application https://github.com/do-me/SemanticFinder. Currently it only supports chars, words, sentences, regex and tokens but all of these separators are too "stiff". I found that your unicode-based approach generally works quite well which would give users more flexibility and hopefully even better results.

Do you think you could release a wasm-compiled version for the web?

benbrandt · 2024-04-23T20:40:52Z

Hi @do-me cool project! I would definitely love to do support this.

Are you ok if it only supports character-based chunking? The reason is I think I need to do some workarounds or even see if it is possible to use tokenizer libs in wasm...

If character-based is fine, then I think it could be possible. I also would need to check if markdown can also be supported, but I guess anything is better than nothing perhaps for your use case.

do-me · 2024-04-24T06:29:42Z

Yes absolutely! Token-based chunking for my use case is an absolute overkill.

However if you'd still want to offer a way to include it for some reason, trasformers.js offers a very convenient tokenizing API out of the box. See here for example: https://huggingface.co/docs/transformers.js/api/tokenizers

import { AutoTokenizer } from '@xenova/transformers';

const tokenizer = await AutoTokenizer.from_pretrained('Xenova/bert-base-uncased');
const { input_ids } = await tokenizer('I love transformers!');
// Tensor {
//   data: BigInt64Array(6) [101n, 1045n, 2293n, 19081n, 999n, 102n],
//   dims: [1, 6],
//   type: 'int64',
//   size: 6,
// }

So if you would shift the task of calculating tokens to the user and not include it directly in Rust/wasm, maybe that would make most sense. But again, for me it's not really necessary.

If character-based is fine, then I think it could be possible. I also would need to check if markdown can also be supported, but I guess anything is better than nothing perhaps for your use case.

For me certainly - whatever is feasible to you. If markdown was supported that would allow for a really great pipeline, as I just discovered https://r.jina.ai/ that converts any web input to LLM-ready markdown. So pairing this tool with your performant chunking and SemanticFinder would deliver a great user experience :)

benbrandt · 2024-04-25T06:53:45Z

Awesome. Yeah I think I'd likely do something similar to what I have in the Python bindings and accept a callback/lambda function so the user can bring custom logic that isn't compiled in. It has the downside of having to do an FFI call quite often, which isn't always performant, but at least provides the functionality.

Well cool, assuming the markdown crate works, it should be quite easy to support a wasm target I think for this use case. It would also enable building a playground of sorts so people can play with the effect of different chunk settings and see it visually, which is something I've been wanting to do anyway.

benbrandt added this to text-splitter Roadmap Apr 24, 2024

github-project-automation bot moved this to Backlog in text-splitter Roadmap Apr 24, 2024

benbrandt added the enhancement New feature or request label Apr 24, 2024

benbrandt moved this from Backlog to Ready in text-splitter Roadmap Dec 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release wasm version #152

Release wasm version #152

do-me commented Apr 23, 2024

benbrandt commented Apr 23, 2024

do-me commented Apr 24, 2024 •

edited

Loading

benbrandt commented Apr 25, 2024

Release wasm version #152

Release wasm version #152

Comments

do-me commented Apr 23, 2024

benbrandt commented Apr 23, 2024

do-me commented Apr 24, 2024 • edited Loading

benbrandt commented Apr 25, 2024

do-me commented Apr 24, 2024 •

edited

Loading