Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix type notation of merges in BPE Python binding #1685

Closed
wants to merge 1 commit into from

Conversation

Coqueue
Copy link

@Coqueue Coqueue commented Nov 21, 2024

Heya, this change is to correct type notations of merges in Python binding of BPE as the counterpart in Rust implementation expects a file name string or Vec<(String, String)> instead of a mapping from a tuple of integers to another.

@Narsil
Copy link
Collaborator

Narsil commented Jan 8, 2025

Why do you think the annotations are incorrect ? The 2 things are unlinked and the signature here is correct.

@Narsil Narsil closed this Jan 8, 2025
@Coqueue
Copy link
Author

Coqueue commented Jan 8, 2025

Hi Nicolas, thanks for taking a look.

The 2 things are unlinked and the signature here is correct.

I wonder if this is true.
Taking the Python SentencePieceBPETokenizer class as an example, the merges annotated as Optional[Union[str, Dict[Tuple[int, int], Tuple[int, int]]]] is only used to initialize the BPE class:

if vocab is not None and merges is not None:
        tokenizer = Tokenizer(BPE(vocab, merges, dropout=dropout, unk_token=unk_token, fuse_unk=fuse_unk))

where the BPE class is generated from the Rust implementation mentioned in the original post.

BPE class also contains below comment:

class BPE(Model):
    """
    An implementation of the BPE (Byte-Pair Encoding) algorithm

    Args:
        vocab (:obj:`Dict[str, int]`, `optional`):
            A dictionary of string keys and their ids :obj:`{"am": 0,...}`

        merges (:obj:`List[Tuple[str, str]]`, `optional`):
            A list of pairs of tokens (:obj:`Tuple[str, str]`) :obj:`[("a", "b"),...]`

At least this comment gives a different type annotation for merges.

Why do you think the annotations are incorrect?

I encountered an issue initializing SentencePieceBPETokenizer with the given type annotation, but succeeded following the type annotation given in the comment of BPE class.

The type annotation of merges being Dict[Tuple[int, int], Tuple[int, int]]]] should actually be for MergesMap in Rust, as an alias of HashMap<Pair, (u32, u32)>, which is built internally in Rust from merges.

As you closed the PR as a repo collaborator, I'll provide the above context -- I'll be more than happy to learn more if I was wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants