Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

submission: automatically identify "pyhf" file_type of resource files #163

Closed
GraemeWatt opened this issue Nov 8, 2019 · 14 comments
Closed
Assignees

Comments

@GraemeWatt
Copy link
Member

Consider the pyhf JSON files attached to https://www.hepdata.net/record/ins1748602?version=1 as an additional resource:

additional_resources:
- description: Archive of full likelihoods in the HistFactory JSON format described
    in ATL-PHYS-PUB-2019-029 Provided are 3 statiscal models labeled RegionA RegionB
    and RegionC respectively each in their own sub-directory. For each model the background-only
    model is found i the file named 'BkgOnly.json' For each model a set of patches
    for various signal points is provided
  location: HEPData_workspaces.tar.gz

It would be good if we could automatically identify pyhf tarballs from the description and location of the additional_resources in the submission.yaml file, then we can make these resource files more prominent (as we already do for links to Rivet analyses). Can we agree on some convention, e.g. a location ending in workspaces.tar.gz, that will be used for future pyhf uploads to HEPData? The code can then check for this convention to write file_type as pyhf in the dataresource table of the database.

Cc: @lukasheinrich

@GraemeWatt
Copy link
Member Author

More HEPData records have appeared with attached pyhf tarballs.

  1. https://www.hepdata.net/record/ins1765529?version=1
additional_resources:
- description: Archive of full likelihoods in the HistFactory JSON format described
    in ATL-PHYS-PUB-2019-029. In the sub-directory the statiscal models SR-lowMass,
    SR-highMass and SR-combined are provided. The SR-combined is a combined fit of
    the SR-lowMass and SR-highMass. For each model the background-only model is found
    in the file named 'BkgOnly.json'. For each model a set of patches for various
    signal points is provided
  location: SUSY-2018-04_likelihoods.tar.gz
  1. https://www.hepdata.net/record/ins1771533?version=1
additional_resources:
- description: Archive of full likelihoods in the HistFactory JSON format described
     in SUSY-2018-06. The background-only fit is found
     in the file named 'BkgOnly.json'. For each model a set of patches for various signal points is
     provided
  location: likelihoods_ANA-SUSY-2018-06_3L-RJ-mimic.tar.gz

The proposed convention of a location ending in workspaces.tar.gz has clearly not been followed. Maybe we need to instead look for a string like HistFactory JSON contained in the description?

@GraemeWatt
Copy link
Member Author

GraemeWatt commented Mar 19, 2020

Agreement with Louie Corpe for HEPData recommendations for ATLAS:

if location.endswith('.tar.gz') and \
   ('histfactory json' in description.lower() or 'pyhf' in description.lower()):
    file_type = 'pyhf_tarball'

whereas individual pyhf JSON files (#164) would have:

file_type = 'pyhf_json'

For individual pyhf JSON files when we provide native support, we would probably identify them by requiring a data_schema key rather than using the location or description.

@kratsg
Copy link

kratsg commented Jun 2, 2020

Can we add an option that allows you to override the automated detection? E.G. something like

additional_resources:
- description: Archive of full likelihoods in the HistFactory JSON format described
     in SUSY-2018-06. The background-only fit is found
     in the file named 'BkgOnly.json'. For each model a set of patches for various signal points is
     provided
  location: likelihoods_ANA-SUSY-2018-06_3L-RJ-mimic.tar.gz
  type: histfactory

@GraemeWatt
Copy link
Member Author

Hopefully, we'll get around to tackling this issue soon. But I noticed that the latest HEPData record (released today) with an attached pyhf archive has:

additional_resources:
- description: Archive of full statistical likelihoods and README
  location: FullLikelihoods_sm.tar.gz

@ldcorpe, is the agreement above not being followed by ATLAS? Should we add "likelihoods" as an additional trigger phrase in addition to "histfactory json" or "pyhf"?

@kratsg, yes, we should probably allow something like type: pyhf_tarball (better than type: histfactory) to override an automatic detection of the file type based on trigger phrases. We'd need to first modify the JSON schema that defines the submission.yaml file (see HEPData/hepdata-validator#23) to add the type field. Of course, future ATLAS submissions would then need to consistently specify type: pyhf_tarball in the submission.yaml file.

@kratsg
Copy link

kratsg commented Jan 28, 2021

Hi @GraemeWatt , thanks for flagging this. I can talk with the SUSY conveners and try to get this sorted out. Im realizing that this is likely just a miscommunication somewhere. One thing I'm worried about is relying on the location filename/description for these sorts of things. I would definitely prefer a type key or similar.

pyhf_tarball and histfactory to me are identical (pyhf is just "python histfactory"). I'm generally not opposed to anything you propose as long as there's a way to look up valid types and what they map to (somewhere).

@GraemeWatt GraemeWatt moved this to Todo in @HEPData Feb 3, 2022
@alisonrclarke alisonrclarke moved this from To do to In Progress in @HEPData Apr 21, 2022
@alisonrclarke alisonrclarke self-assigned this Apr 21, 2022
@alisonrclarke
Copy link
Contributor

alisonrclarke commented Apr 21, 2022

Plan is:

  • Add function to detect pyhf/HistFactory files either from type property of resource in YAML (checking for HistFactory) or from filename/description, and return the file_type as HistFactory
  • Update the ES indexer to add HistFactory as an analysis type (see use of ANALYSIS_ENDPOINTS, but will need modification to return the file landing page as the URL)
  • Add a cli/one-off script to check resources of existing submissions and update any with pyhf/HistFactory files
  • Add a badge like the 'Rivet analysis' badge for pyhf
  • Link to the pyhf resource in the 'View analyses' dropdown (might be automatic once indexed correctly)
  • Add the ability to search on analysis:rivet or analysis:histfactory
  • Update submission docs with details of type field

@kratsg
Copy link

kratsg commented Apr 21, 2022

A minor nitpick (admittedly) but I think we should call it "HistFactory" or "HistFactory JSON" or similar, rather than pyhf. To be precise, pyhf is just a python-only implementation of HistFactory, but in this context, you're talking specifically about storing HistFactory models in a JSON format.

@alisonrclarke
Copy link
Contributor

A minor nitpick (admittedly) but I think we should call it "HistFactory" or "HistFactory JSON" or similar, rather than pyhf. To be precise, pyhf is just a python-only implementation of HistFactory, but in this context, you're talking specifically about storing HistFactory models in a JSON format.

HistFactory it is (OK'd by @GraemeWatt )

@GraemeWatt
Copy link
Member Author

@kratsg @lukasheinrich @matthewfeickert @cranmer : do you get notifications from the HEPData Zulip instance? I posted a message two days ago (tagging @all) asking for feedback on this feature, now implemented on our QA site, before I deploy in production. (I don't want to post links to the QA site publicly.). Please log in and reply to the post.

@lukasheinrich
Copy link

Hi @GraemeWatt ,

yes I did and it looks great. This past week was particularly busy, but I'll try to go over it next week.

Thanks,
Lukas

@kratsg
Copy link

kratsg commented May 13, 2022

I've commented on Zulip. Do you want me to copy/paste it into GitHub as well?

@GraemeWatt
Copy link
Member Author

Thanks for the comments. We can continue the discussion on Zulip, so no need to repeat comments here.

@GraemeWatt
Copy link
Member Author

Now deployed in production and sent tweets to advertise new search options:

Repository owner moved this from Ready for review to Done in @HEPData May 24, 2022
@matthewfeickert
Copy link
Member

Thank you so much HEPData team — this is amazing! 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

No branches or pull requests

5 participants