Skip to content

Conversation

@TingquanGao
Copy link
Contributor

No description provided.

docsUrl: "https://www.paddleocr.ai/",
snippets: snippets.paddleocr,
filter: true,
countDownloads: `path_extension:"safetensors" OR path_extension:"pdiparams"`,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the reminder. We will reconsider how to implement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I have updated the codes. It now only counts file whose full path is exactly inference.pdiparams, instead of ending with the pdiparams suffix.
Could you please review it again? If there are any issues or suggestions, please feel free to let me know and I will make further improvements.

Thank you!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for iterating! But I think this change will still result in the same situation described in my previous comment. Sorry my message was not clear, let me try to summarize:

  • If the user only downloads PaddleOCR-VL-0.9B and runs it, then the download count will be incremented once.
  • If they use the full pipeline to run both PP-DocLayoutV2 and PaddleOCR-VL-0.9B one after the other, then the download count will be incremented twice: one for each model. This is because files from both directories will match the rule.

If PP-DocLayoutV2 is never intended to be used without PaddleOCR-VL-0.9B, then perhaps it would make sense to add a rule for just PaddleOCR-VL-0.9B. But I don't know if that's the case, or if PP-DocLayoutV2 may be used independently. I'm pointing this out just to give you information about how to interpret the download counts; depending on the use cases you support, you may want to restrict counts to one of the model, or accept duplicate counts when both are used as part of the same pipeline.

As discussed offline, we think that using one repo per model is usually clearer and provides a lot more benefits in addition to cleaner download counts. For example, this repo is a copy of your PaddleOCR-VL-0.9B directory, and it can easily be used with the from_pretrained API as shown in the snippet I added to the model card. It also conveys useful metadata: it's an image-text-to-text model, it works with transformers, but it requires custom code to do so (as indicated by the custom_code tag, which was automatically inferred). It provides some starter code snippets when users click on the "Use this model" button, which are also inferred automatically based on the model characteristics.

If you need more time to work on a single model-per-repo structure, that's totally ok - we can merge this PR meanwhile, but keep in mind what the download rule means, and if possible try to make it match the use cases your models support.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your comments. Please allow me to clarify our plan:

  1. Both the PaddleOCR-VL-0.9B and PP-DocLayoutV2 models can be used independently. Therefore, we agree with your suggestion that the two models should be stored in separate repositories.
  2. The PaddleOCR-VL-0.9B model will continue to be hosted in the PaddlePaddle/PaddleOCR-VL repository. However, we will move the relevant files to the root directory of the repository and remove the PaddleOCR-VL-0.9B subdirectory. This change will be completed as soon as possible, most likely by tomorrow.
  3. As for the PP-DocLayoutV2 model currently present in the PaddlePaddle/PaddleOCR-VL repository, we will remove it at an appropriate time in the future. After that, the PP-DocLayoutV2 model will only be available in the PaddlePaddle/PP-DocLayoutV2 repository.

Given the plan described above, would it be possible to proceed with merging the current code?

Thank you very much for your patience, guidance, and suggestions. If there is anything inappropriate in our approach, please feel free to point it out.

@TingquanGao TingquanGao changed the title update countDownloads for PaddleOCR [WIP] update countDownloads for PaddleOCR Oct 20, 2025
@TingquanGao TingquanGao changed the title [WIP] update countDownloads for PaddleOCR update countDownloads for PaddleOCR Oct 20, 2025
Copy link
Member

@pcuenca pcuenca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine for me, cc @Wauplin in case there's more feedback

Copy link
Contributor

@Wauplin Wauplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic seems good to me! Thanks a lot @TingquanGao @pcuenca for the discussions. Also agree splitting the models in 2 repos is the right move.

@Wauplin Wauplin merged commit 1237054 into huggingface:main Oct 21, 2025
2 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants