-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve download statistics #627
Comments
I think 2. is a good idea to speed up the website (please specify some details for implementation like desired file size and extension), but what became of the idea to track download counts through the core libraries with a third party service?
But from https://help.zenodo.org/faq/#statistics I take that we wouldn't have to divide the unique download count by the number of files as all files downloaded within a 1 hour window count as one unique download. So maybe we just take that in combination with cover image caching and call it good enough for now? |
Yes, using core library could be more accurate, however, I don't like the fact that we require every client (core, the packager, deepimagej, AVIA, qupath etc.) implement report the downloads actively. It will make it super complicated to maintain the statistics system, every updates need to propagate to all the clients in implemented in different languages, and if things go wrong, we lose the download count. There might be also GDPR implications for collecting such data. If we can achieve the download statistic from zenodo directly, it will be much more maintainable and effortless.
Yes, if that's the case, would be great! Otherwise, I think we don't need to have a perfect solution, if we can manage to improve it compared to our current method, we can put a disclaimer when displaying these statistics, make it transparent that where the error can come from, that should be good enough for now. For the cover image: The actual size is W: 296, H: 167, I am thinking if we make it twice, say: 600x340, it should be good enough:
The we can use png as format. |
Yes, definitely 👍 Alright, I'll implement it in the coming weeks. |
Hi @FynnBe Would you have some time to look into this one? We are gathering some downloading numbers for a report, would be great if we can access to more realistic download numbers. |
We are using zenodo's
|
Great! The number seems still a bit too high, fixing the cache of cover image is definitely the way to go. But for now, would the number more realistic if we use the download volume divide by the total file size? This will fix the r existing stats for the models? What do you think? |
sure, I'll calculate that for all current download counts and add it as an offset to zenodo's reported download count. |
@FynnBe This is fantastic! Thanks a lot! |
Still need to add caching soon... currently we create thumbnails on every CI run |
@FynnBe This is how it looks like today: From what I can see, some models didn't change while some others increased the number. What do you think? |
that's very good. so bioimage.io visits do not trigger download count bumps anymore... |
opened a PR to fix the broken deepimagej manifest: deepimagej/models#51 |
Cross reference: bioimage-io/bioimage.io#353 |
@FynnBe It looks like we are counting the CI downloads, in the history, every model increases its download by 3 f8b4156 Is this becasue we downloaded the yaml file, based on the definition of unique download, if we download any of the file in the deposit within 1 hour window, it will be count as a unique download -- this is different from what we want. Maybe it's better to use The model "3D UNet Arabidopsis Apical " for example: https://bioimage.io/#/?id=10.5281%2Fzenodo.6346511 The pytorch package size is 42.8 MB, while the size of the two cover images is 0.8MB and the yaml file 3KB, this means we need to download the two cover images 52 times or the RDF yaml file 14,758 times to be able to count as 1 download. |
The issue with the download volume are the cases where two or more weight formats are specified. We encourage downloads with the preferred weights format only, so each real download would only count as approx. 1/2 or 1/3... Let us update the bioimageio packages to check the "CI" env var and adapt our requests accordingly... |
@FynnBe I think the issue is that we cannot rely on the fact that no file is downloaded from the deposit. I just checked that our yaml file contains the zenodo link to the cover images and test inputs/outputs, this means that when the user open a model card, or do the test run, it will pull from the zenodo, thus increase the download number. Maybe we should replace the cover image link within the rdf file too, so when the user open the model card it won't count as unique download. For a more conservative download count, maybe a model download number should be computed from the volume. Or present it as is, saying We can perhaps use the the unique download number, and rename it to Thanks for the CI configuration, that helps a lot. What do you think? |
the collection.json that the website should use does not contain any zenodo links:
We could, but for the website you may as well read
I think we'll manage do make the unique download count a meaningful measure. Once our website, our CIs and the core python packages (and JAVA libraries) all account for use in CI this should be the best measure. until then we can update the offset by the estimate of download volume. But a rogue CI using Zenodo links directly will also increase the download volume more than we want any CI to do... So we can only encourange users and developers to interact with the modelzoo through our libraries that set the "User-Agent" accordingly...
That's a good proposal 👍 (Then we also don't need to worry about caching test inputs/output)
we are getting there... |
In order to improve the statistics on the website, maybe a quicker way is to normalize the current number by its file numbers. A major reason for the over exaggerated download numbers for the models is because it add up all the download counts for all the files in one deposit. There are two things we can make it more realistic:
@FynnBe what do you think?
The text was updated successfully, but these errors were encountered: