Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve download statistics #627

Open
oeway opened this issue Aug 23, 2023 · 18 comments · Fixed by #633
Open

Improve download statistics #627

oeway opened this issue Aug 23, 2023 · 18 comments · Fixed by #633
Assignees

Comments

@oeway
Copy link
Contributor

oeway commented Aug 23, 2023

In order to improve the statistics on the website, maybe a quicker way is to normalize the current number by its file numbers. A major reason for the over exaggerated download numbers for the models is because it add up all the download counts for all the files in one deposit. There are two things we can make it more realistic:

  1. We can use the total downloaded volume count dividing the actual total size of files. Alternatively (should be less accurate), divid the unique download number by the number of files in the deposit (see more info here: https://help.zenodo.org/faq/#statistics). Note that the CI download should be already filtered out.
  2. Cache the cover image in gh-pages, i.e. fetch the cover image, resize it so we make it small, so each time the website opens it won't pull image from zenodo. This will also make the website faster.

@FynnBe what do you think?

@FynnBe
Copy link
Member

FynnBe commented Aug 24, 2023

I think 2. is a good idea to speed up the website (please specify some details for implementation like desired file size and extension), but what became of the idea to track download counts through the core libraries with a third party service?
Otherwise I'm thinking

  • normalize by size: if someone specifies more than one sets of weights, download volume might be skewed due to selective weights format download. Almost a disincentive to upload multiple weights formats.
  • normalize by file count: same issue; a well documented description with multiple cover images, maybe multiple input tensors gets undercounted.

But from https://help.zenodo.org/faq/#statistics I take that we wouldn't have to divide the unique download count by the number of files as all files downloaded within a 1 hour window count as one unique download. So maybe we just take that in combination with cover image caching and call it good enough for now?

@oeway
Copy link
Contributor Author

oeway commented Aug 24, 2023

I think 2. is a good idea to speed up the website (please specify some details for implementation like desired file size and extension), but what became of the idea to track download counts through the core libraries with a third party service?

Yes, using core library could be more accurate, however, I don't like the fact that we require every client (core, the packager, deepimagej, AVIA, qupath etc.) implement report the downloads actively. It will make it super complicated to maintain the statistics system, every updates need to propagate to all the clients in implemented in different languages, and if things go wrong, we lose the download count. There might be also GDPR implications for collecting such data. If we can achieve the download statistic from zenodo directly, it will be much more maintainable and effortless.

Otherwise I'm thinking

  • normalize by size: if someone specifies more than one sets of weights, download volume might be skewed due to selective weights format download. Almost a disincentive to upload multiple weights formats.
  • normalize by file count: same issue; a well documented description with multiple cover images, maybe multiple input tensors gets undercounted.

But from https://help.zenodo.org/faq/#statistics I take that we wouldn't have to divide the unique download count by the number of files as all files downloaded within a 1 hour window count as one unique download. So maybe we just take that in combination with cover image caching and call it good enough for now?

Yes, if that's the case, would be great! Otherwise, I think we don't need to have a perfect solution, if we can manage to improve it compared to our current method, we can put a disclaimer when displaying these statistics, make it transparent that where the error can come from, that should be good enough for now.

For the cover image: The actual size is W: 296, H: 167, I am thinking if we make it twice, say: 600x340, it should be good enough:

  • When the image is smaller than that, keep its original size
  • If the aspect ratio does not match, keep the original aspect ratio and make sure both W and H do not exceed 600x340.

The we can use png as format.

@FynnBe
Copy link
Member

FynnBe commented Aug 24, 2023

If we can achieve the download statistic from zenodo directly, it will be much more maintainable and effortless.

Yes, definitely 👍

Alright, I'll implement it in the coming weeks.

@FynnBe FynnBe self-assigned this Aug 24, 2023
@oeway
Copy link
Contributor Author

oeway commented Sep 11, 2023

Hi @FynnBe Would you have some time to look into this one? We are gathering some downloading numbers for a report, would be great if we can access to more realistic download numbers.

@FynnBe
Copy link
Member

FynnBe commented Sep 12, 2023

We are using zenodo's unique_download count already:

download_count = int(hit["stats"]["unique_downloads"])

@oeway
Copy link
Contributor Author

oeway commented Sep 12, 2023

Great! The number seems still a bit too high, fixing the cache of cover image is definitely the way to go. But for now, would the number more realistic if we use the download volume divide by the total file size? This will fix the r existing stats for the models? What do you think?

@FynnBe
Copy link
Member

FynnBe commented Sep 12, 2023

sure, I'll calculate that for all current download counts and add it as an offset to zenodo's reported download count.

@oeway
Copy link
Contributor Author

oeway commented Sep 12, 2023

@FynnBe This is fantastic! Thanks a lot!

@FynnBe
Copy link
Member

FynnBe commented Sep 12, 2023

Still need to add caching soon... currently we create thumbnails on every CI run

@oeway
Copy link
Contributor Author

oeway commented Sep 13, 2023

For the record, I took a screenshot of the download number now for page 3.

Let's check how the number change tomorrow. One thing to verify is whether the CI increases the download number.

Screenshot 2023-09-13 at 17 31 16

@oeway
Copy link
Contributor Author

oeway commented Sep 18, 2023

@FynnBe This is how it looks like today:
Screenshot 2023-09-18 at 22 24 12

From what I can see, some models didn't change while some others increased the number. What do you think?

@FynnBe
Copy link
Member

FynnBe commented Sep 18, 2023

that's very good. so bioimage.io visits do not trigger download count bumps anymore...
we should keep in mind though that the collection CI has been failing these last few days: https://github.com/bioimage-io/collection-bioimage-io/actions/workflows/auto_update_main.yaml
I'll take a look into that...

@FynnBe FynnBe reopened this Sep 18, 2023
@FynnBe
Copy link
Member

FynnBe commented Sep 18, 2023

opened a PR to fix the broken deepimagej manifest: deepimagej/models#51
moving forward #635 should avoid bumping download counts multiple times by CI for now.

@oeway
Copy link
Contributor Author

oeway commented Sep 19, 2023

Cross reference: bioimage-io/bioimage.io#353

@oeway
Copy link
Contributor Author

oeway commented Sep 25, 2023

@FynnBe It looks like we are counting the CI downloads, in the history, every model increases its download by 3 f8b4156

Is this becasue we downloaded the yaml file, based on the definition of unique download, if we download any of the file in the deposit within 1 hour window, it will be count as a unique download -- this is different from what we want.

Maybe it's better to use the total volume / total size after all.

The model "3D UNet Arabidopsis Apical " for example: https://bioimage.io/#/?id=10.5281%2Fzenodo.6346511

The pytorch package size is 42.8 MB, while the size of the two cover images is 0.8MB and the yaml file 3KB, this means we need to download the two cover images 52 times or the RDF yaml file 14,758 times to be able to count as 1 download.

@FynnBe
Copy link
Member

FynnBe commented Sep 26, 2023

@FynnBe It looks like we are counting the CI downloads, in the history, every model increases its download by 3 f8b4156

Is this becasue we downloaded the yaml file, based on the definition of unique download, if we download any of the file in the deposit within 1 hour window, it will be count as a unique download -- this is different from what we want.

Maybe it's better to use the total volume / total size after all.

The model "3D UNet Arabidopsis Apical " for example: https://bioimage.io/#/?id=10.5281%2Fzenodo.6346511

The pytorch package size is 42.8 MB, while the size of the two cover images is 0.8MB and the yaml file 3KB, this means we need to download the two cover images 52 times or the RDF yaml file 14,758 times to be able to count as 1 download.

The issue with the download volume are the cases where two or more weight formats are specified. We encourage downloads with the preferred weights format only, so each real download would only count as approx. 1/2 or 1/3...

Let us update the bioimageio packages to check the "CI" env var and adapt our requests accordingly...
The factor 3 is weird though... I'll try to investigate. nothing should request all RDFs from zenodo... Our CI only download's the during the PRs and on bioimageio version bumps. (So it might be other's CI that does the regular bumping; we need to amend the request headers)

@oeway
Copy link
Contributor Author

oeway commented Sep 27, 2023

@FynnBe I think the issue is that we cannot rely on the fact that no file is downloaded from the deposit. I just checked that our yaml file contains the zenodo link to the cover images and test inputs/outputs, this means that when the user open a model card, or do the test run, it will pull from the zenodo, thus increase the download number.

Maybe we should replace the cover image link within the rdf file too, so when the user open the model card it won't count as unique download.

For a more conservative download count, maybe a model download number should be computed from the volume. Or present it as is, saying download count by volume, and when we compute the total size, we can average the weights file size so we are more close to the reality.

We can perhaps use the the unique download number, and rename it to user interaction count, so any interaction, either download or run the test input/output can count as 1 user interaction.

Thanks for the CI configuration, that helps a lot.

What do you think?

@FynnBe
Copy link
Member

FynnBe commented Sep 27, 2023

@FynnBe I think the issue is that we cannot rely on the fact that no file is downloaded from the deposit. I just checked that our yaml file contains the zenodo link to the cover images and test inputs/outputs, this means that when the user open a model card, or do the test run, it will pull from the zenodo, thus increase the download number.

the collection.json that the website should use does not contain any zenodo links:
https://github.com/bioimage-io/collection-bioimage-io/blob/gh-pages/collection.json
But the collection.json does not contain links to test inputs/outputs... (we could opt to cache them just like the cover images though... however at some point that might be a lot to deploy to gh-pages...)

Maybe we should replace the cover image link within the rdf file too, so when the user open the model card it won't count as unique download.

We could, but for the website you may as well read covers from the collection.json file.
I like to keep the reference to the original, non-resized cover stored at zenodo in the RDF and have the thumbnail in collection.json or is the thumbnail not big enough for the expanded model card??

For a more conservative download count, maybe a model download number should be computed from the volume. Or present it as is, saying download count by volume, and when we compute the total size, we can average the weights file size so we are more close to the reality.

I think we'll manage do make the unique download count a meaningful measure. Once our website, our CIs and the core python packages (and JAVA libraries) all account for use in CI this should be the best measure. until then we can update the offset by the estimate of download volume. But a rogue CI using Zenodo links directly will also increase the download volume more than we want any CI to do... So we can only encourange users and developers to interact with the modelzoo through our libraries that set the "User-Agent" accordingly...

We can perhaps use the the unique download number, and rename it to user interaction count, so any interaction, either download or run the test input/output can count as 1 user interaction.

That's a good proposal 👍 (Then we also don't need to worry about caching test inputs/output)

Thanks for the CI configuration, that helps a lot.

we are getting there...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants