-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Release wasm version #152
Comments
Hi @do-me cool project! I would definitely love to do support this. Are you ok if it only supports character-based chunking? The reason is I think I need to do some workarounds or even see if it is possible to use tokenizer libs in wasm... If character-based is fine, then I think it could be possible. I also would need to check if markdown can also be supported, but I guess anything is better than nothing perhaps for your use case. |
Yes absolutely! Token-based chunking for my use case is an absolute overkill. However if you'd still want to offer a way to include it for some reason, trasformers.js offers a very convenient tokenizing API out of the box. See here for example: https://huggingface.co/docs/transformers.js/api/tokenizers import { AutoTokenizer } from '@xenova/transformers';
const tokenizer = await AutoTokenizer.from_pretrained('Xenova/bert-base-uncased');
const { input_ids } = await tokenizer('I love transformers!');
// Tensor {
// data: BigInt64Array(6) [101n, 1045n, 2293n, 19081n, 999n, 102n],
// dims: [1, 6],
// type: 'int64',
// size: 6,
// } So if you would shift the task of calculating tokens to the user and not include it directly in Rust/wasm, maybe that would make most sense. But again, for me it's not really necessary.
For me certainly - whatever is feasible to you. If markdown was supported that would allow for a really great pipeline, as I just discovered https://r.jina.ai/ that converts any web input to LLM-ready markdown. So pairing this tool with your performant chunking and SemanticFinder would deliver a great user experience :) |
Awesome. Yeah I think I'd likely do something similar to what I have in the Python bindings and accept a callback/lambda function so the user can bring custom logic that isn't compiled in. It has the downside of having to do an FFI call quite often, which isn't always performant, but at least provides the functionality. Well cool, assuming the markdown crate works, it should be quite easy to support a wasm target I think for this use case. It would also enable building a playground of sorts so people can play with the effect of different chunk settings and see it visually, which is something I've been wanting to do anyway. |
Hey Ben,
I would love to use a wasm version of text-splitter in the web application https://github.com/do-me/SemanticFinder. Currently it only supports chars, words, sentences, regex and tokens but all of these separators are too "stiff". I found that your unicode-based approach generally works quite well which would give users more flexibility and hopefully even better results.
Do you think you could release a wasm-compiled version for the web?
The text was updated successfully, but these errors were encountered: