Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pubtabnet's structure token annotations differ from origin #34

Open
Arlen-yuzu opened this issue Oct 27, 2024 · 1 comment
Open

Pubtabnet's structure token annotations differ from origin #34

Arlen-yuzu opened this issue Oct 27, 2024 · 1 comment

Comments

@Arlen-yuzu
Copy link

In mini_pubtabnet_examples.jsonl, structure tokens have '[', ']',like '[', ']', which is different from origin annotations in pubtabnet. Could you share your data processing python scripts about this.

@gjgjh
Copy link

gjgjh commented Dec 14, 2024

This issue is the same as #30 . Unitable has preprocessed the original Pubtabnet dataset, using <td>[]</td> to represent a non-empty cell and <td></td> to represent an empty cell. The processed format is similar to mini_pubtabnet_examples.jsonl. Once familiar with the data format, this preprocessing task should not be difficult to implement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants