Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panic (stack overflow) when encoding a certain string #245

Open
Crazytieguy opened this issue Jan 18, 2024 · 11 comments
Open

Panic (stack overflow) when encoding a certain string #245

Crazytieguy opened this issue Jan 18, 2024 · 11 comments

Comments

@Crazytieguy
Copy link

Crazytieguy commented Jan 18, 2024

Hi, I'm getting a panic when trying to encode the attached file with the gpt-4 tokenizer. This is from the AMPS dataset that was published along with the MATH dataset. Backtrace:

called `Result::unwrap()` on an `Err` value: RuntimeError(StackOverflow)
stack backtrace:
   0: rust_begin_unwind
             at /rustc/79e9716c980570bfd1f666e3b16ac583f0168962/library/std/src/panicking.rs:597:5
   1: core::panicking::panic_fmt
             at /rustc/79e9716c980570bfd1f666e3b16ac583f0168962/library/core/src/panicking.rs:72:14
   2: core::result::unwrap_failed
             at /rustc/79e9716c980570bfd1f666e3b16ac583f0168962/library/core/src/result.rs:1652:5
   3: _tiktoken::CoreBPE::_encode_native
   4: _tiktoken::_::<impl _tiktoken::CoreBPE>::__pymethod_encode__
   5: pyo3::impl_::trampoline::fastcall_with_keywords
   6: _PyEval_EvalFrameDefault
             at /tmp/python-build.20230808162458.7883/Python-3.11.4/Python/ceval.c:5258:29
   7: _PyEval_EvalFrame
             at /tmp/python-build.20230808162458.7883/Python-3.11.4/./Include/internal/pycore_ceval.h:73:16
   8: _PyEval_Vector
             at /tmp/python-build.20230808162458.7883/Python-3.11.4/Python/ceval.c:6439:24
   9: _PyFunction_Vectorcall
             at /tmp/python-build.20230808162458.7883/Python-3.11.4/Objects/call.c:393:16
  10: _PyObject_VectorcallTstate
             at /tmp/python-build.20230808162458.7883/Python-3.11.4/./Include/internal/pycore_call.h:92:11
  11: method_vectorcall
             at /tmp/python-build.20230808162458.7883/Python-3.11.4/Objects/classobject.c:89:18
  12: do_call_core
             at /tmp/python-build.20230808162458.7883/Python-3.11.4/Python/ceval.c:7357:12
  13: _PyEval_EvalFrameDefault
             at /tmp/python-build.20230808162458.7883/Python-3.11.4/Python/ceval.c:5381:22
  14: _PyEval_EvalFrame
             at /tmp/python-build.20230808162458.7883/Python-3.11.4/./Include/internal/pycore_ceval.h:73:16
  15: _PyEval_Vector
             at /tmp/python-build.20230808162458.7883/Python-3.11.4/Python/ceval.c:6439:24
  16: _PyFunction_Vectorcall
             at /tmp/python-build.20230808162458.7883/Python-3.11.4/Objects/call.c:393:16
  17: do_call_core
             at /tmp/python-build.20230808162458.7883/Python-3.11.4/Python/ceval.c:7357:12
  18: _PyEval_EvalFrameDefault
             at /tmp/python-build.20230808162458.7883/Python-3.11.4/Python/ceval.c:5381:22
  19: _PyEval_EvalFrame
             at /tmp/python-build.20230808162458.7883/Python-3.11.4/./Include/internal/pycore_ceval.h:73:16
  20: _PyEval_Vector
             at /tmp/python-build.20230808162458.7883/Python-3.11.4/Python/ceval.c:6439:24
  21: _PyFunction_Vectorcall
             at /tmp/python-build.20230808162458.7883/Python-3.11.4/Objects/call.c:393:16
  22: _PyObject_VectorcallTstate
             at /tmp/python-build.20230808162458.7883/Python-3.11.4/./Include/internal/pycore_call.h:92:11
  23: method_vectorcall
             at /tmp/python-build.20230808162458.7883/Python-3.11.4/Objects/classobject.c:67:20
  24: thread_run
             at /tmp/python-build.20230808162458.7883/Python-3.11.4/./Modules/_threadmodule.c:1092
  25: pythread_wrapper
             at /tmp/python-build.20230808162458.7883/Python-3.11.4/Python/thread_pthread.h:241:5
  26: <unknown>
  27: <unknown>
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.```
@Crazytieguy Crazytieguy changed the title Panic when encoding a certain string Panic (stack overflow) when encoding a certain string Jan 18, 2024
@l0rinc
Copy link
Contributor

l0rinc commented Feb 11, 2024

I've check every single JSON file in the MATH and AMPS pretraining datasets, all of them were passing this test.

def test_math_roundtrips():
    enc = tiktoken.get_encoding("cl100k_base")
    base_dir = '.../Downloads/amps'

    for dirpath, _, filenames in os.walk(base_dir):
        for filename in filenames:
            if filename.endswith(".json"):
                print(f'Checking ${filename}!')

                with open(os.path.join(dirpath, filename), 'rb') as f:
                    content = f.read().decode('utf-8')
                    encoded_content = enc.encode(content)
                    decoded_content = enc.decode(encoded_content)

                    assert content == decoded_content, f"Roundtrip mismatch for {filename}"

Can you please attach the file and a simple reproducer that fails?

@hauntsaninja
Copy link
Collaborator

hauntsaninja commented Feb 11, 2024

Here's a simple repro of the problem:

import tiktoken
enc = tiktoken.get_encoding("r50k_base")
enc.encode("^" * 1000000)

There's a stack overflow in the Rust regex library we use. I haven't yet gotten a chance to see if it's easy to fix the Rust (fancy-)regex library, but one workaround would be to raise a more specific exception, catch it, and fall back to using the Python regex library to split the string as done in this private method: https://github.com/openai/tiktoken/blob/main/tiktoken/core.py#L360

@l0rinc
Copy link
Contributor

l0rinc commented Feb 11, 2024

Fixed init in my recently pushed PRs for cl100k_base (see #234 and #239) - and the backported possessives quantifiers to the legacy encoding in #258

Cherrypicking the PRs on top of each other not just passes, but is quite fast as well:
image

@fedor57
Copy link

fedor57 commented Jul 11, 2024

Hi, I have the same crash trying to tokenize this web page

https://www.mathauditor.com/2147483647-in-english.html

There is a roman representation of the number 2147483647, which contains about 2,147,483 letters M in one source token

@thijs-hakkenberg
Copy link

I have the same issue. The error according to ChatGPT why we cannot catch the exception in python:
_The error message you are seeing is because a Rust panic unwinding to Python is not an actual Python exception, and thus cannot be caught by a standard Python exception handler. It completely aborts the Python interpreter.

This is an issue with the Rust-Python interoperability. A Python program can't catch Rust panics because Rust panics are designed to unwind the stack, cleaning up as they go, until they reach the application boundary, at which point the application aborts. In this case, the application boundary is the Rust-Python boundary, so the panic unwinds the Rust stack, crosses the boundary, and causes the Python interpreter to abort.

However, latest PyO3 versions have a feature that allows converting Rust panics into Python exceptions, but it's opt-in and has to be enabled in the Rust library. The Python dependency using PyO3 can then be rerun so that Rust panics will become catchable Python RuntimeError exceptions.

Unfortunately, it seems like the tiktoken library doesn't use a new enough PyO3 version or has this feature disabled, and so it doesn't convert Rust panics into Python exceptions. Therefore, you can't catch the panic in Python code and the Python interpreter aborts.

For now, you could try to ensure your code never causes a panic in tiktoken. For instance, by checking properties of the input before passing it to tiktoken methods that might cause a panic.

Otherwise, this is a pretty hard issue to work around from Python. The best way to resolve it would be to open an issue on the repository of the library causing the issue or ask the maintainer to upgrade the PyO3 dependency and enable panic conversion._

@l0rinc
Copy link
Contributor

l0rinc commented Sep 10, 2024

You could add the PRs mentioned in #245 (comment) and build a custom TikToken version that supports big tokens.
@hauntsaninja, do you think we could merge some of those PRs instead?

@fedor57
Copy link

fedor57 commented Sep 10, 2024

What I did, I tested a safe long token length about 200000 chars. Then before calling the tiktoken, I've splited the input into batches, called the tiktoken separately on each batch and concatenated token arrays. The result is still correct if you decode the tokenized string. Theoretically suboptimal for LLM use, but in practice no difference.

Before that I was kindof searching runs of 200000 without spaces to insert batch break if needed. But later I deleted the code as not being truly useful.

hauntsaninja pushed a commit that referenced this issue Oct 3, 2024
Fixes the crash in #245 by
prohibiting the regex engine from backtracking catastrophically via
[possessive
quantifiers](https://www.regular-expressions.info/possessive.html).

<img width="400" alt="image"
src="https://github.com/openai/tiktoken/assets/1841944/ed341153-4cf4-4c1c-93d6-3f5e32133569">

Interestingly these possesives make the encoding a lot faster again in
`fancy-regex`.

Before this change (but with large byte pair merge PR cherry-picked):
```
num_threads: 1, num_bytes: 98379553
tiktoken 	11,946,036 bytes / s
tiktoken 	11,961,343 bytes / s
tiktoken 	11,995,846 bytes / s
tiktoken 	11,951,263 bytes / s
tiktoken 	11,983,405 bytes / s
```
Same, with these changes applied:
```
num_threads: 1, num_bytes: 98379553
tiktoken 	14,511,827 bytes / s
tiktoken 	14,638,134 bytes / s
tiktoken 	14,644,029 bytes / s
tiktoken 	14,729,030 bytes / s
tiktoken 	14,666,903 bytes / s
```
Updating the regex libs makes it a tiny bit faster still:
```
num_threads: 1, num_bytes: 98379553
tiktoken 	14,485,590 bytes / s
tiktoken 	14,854,049 bytes / s
tiktoken 	14,891,086 bytes / s
tiktoken 	14,843,007 bytes / s
tiktoken 	14,874,520 bytes / s
```

This is almost 2x faster than [before any of the
optimizations](#234).

-------

Opened an issue for increasing the [default backtrack
limit](https://github.com/fancy-regex/fancy-regex/blob/bf2c807447f72ee20ae839e0f8cb3a06fc79982c/src/lib.rs#L407),
see: fancy-regex/fancy-regex#134, but it
shouldn't be necessary here anymore.

---------

Co-authored-by: Lőrinc <[email protected]>
@aneubeck
Copy link

Hi,
maybe our implementation https://github.com/github/rust-gems/tree/main/crates/bpe-openai is a solution to problematic inputs?
We have found a novel algorithm to the BPE problem which has linear worst-case complexity.
In our implementation, we also rewrote the problematic look-ahead regexes to make them more efficient and avoid pathological cases. Overall this lead to a pretty nice speedup.

@l0rinc
Copy link
Contributor

l0rinc commented Jan 20, 2025

@aneubeck, does your repository have any tests? Does it pass the tests from tiktoken? Also, I don’t see how a linear worst case is possible, but linearithmic seems plausible. See: #245 (comment)

@aneubeck
Copy link

aneubeck commented Jan 20, 2025

In our repository, we compared results for randomly generated input for our and tiktokens implementation (the tests and benchmarks are in the bpe crate).
We didn't run the tests from tiktoken.

Note: we are not dealing with special tokens, since we didn't have a need for them (yet). So probably not all tests would pass, since this feature is not supported.

It is linear in the input size and has some constant worst case factor which depends on the token dictionary. That factor is relatively small and importantly doesn't increase with the input size. Here is a bit more information: https://github.blog/ai-and-ml/llms/so-many-tokens-so-little-time-introducing-a-faster-more-flexible-byte-pair-tokenizer/ and more information can be found in the bpe repository.

@Ariya12138
Copy link

I met the same problem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants
@l0rinc @aneubeck @fedor57 @hauntsaninja @Crazytieguy @Ariya12138 @thijs-hakkenberg and others