-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
submission: provide native support for individual "pyhf" JSON files #164
Comments
From the hepdata-validator module's side of things: It looks like the It has a method In the
However, the schema file path argument is not used from the validate method. I think it would be fairly straightforward to modify the validator to use custom schemas properly, and from remote URLs, without affecting the default schemas. |
Before starting to work on an implementation, it would be better if the JSON schema were stable on the pyhf side. For example, the URL for The method of telling HEPData which schema and version is being used by the data file should be decided (see scikit-hep/pyhf#740). The current hepdata-validator code is written assuming that a custom schema Cc: @lukasheinrich |
adding @kratsg and @matthewfeickert. Yes I think we'd be happy to finalize the schema to a good base version. I agree having the patchset schema would be good to have b/c then the likelihood publication maybe only includes 2 files (instead of hundreds) Maybe @alisonrclarke can provide some instructions what you need from us. The basic funcationality of adding support for external schemas is probably independent of which those schemas entail in detail? |
for visualization there is some ongoing work e.g. here https://scikit-hep.org/pyhf/examples/notebooks/altair.html https://www.youtube.com/watch?v=9egt9ZTm7T0 probably something based on vega / altair is sufficient for us |
stability is something i'm working on
do you need a |
Hi @GraemeWatt and @alisonrclarke can we add @Sinclert to the team? |
I've sent an invitation to @Sinclert to join the @HEPData organization. Let me know if Write access is needed for individual repositories. You can contribute by forking repositories and making pull requests without having Write access. The The idea of giving the |
@GraemeWatt , @Sinclert -- I have provided the fix and closed the relevant issue scikit-hep/pyhf#686 via scikit-hep/pyhf#791. I would be interested to see if the |
Example additional_resources:
- {description: Web page with auxiliary material, location: 'https://atlas.web.cern.ch/Atlas/GROUPS/PHYSICS/PAPERS/SUSY-2018-31/'}
- {description: Truth code to compute acceptance for all signal regions using the
SimpleAnalysis framework, location: Sbottom_MB2018.cxx}
- {description: Archive of full likelihoods in the HistFactory JSON format described
in ATL-PHYS-PUB-2019-029 Provided are 3 statiscal models labeled RegionA RegionB
and RegionC respectively each in their own sub-directory. For each model the background-only
model is found i the file named 'BkgOnly.json' For each model a set of patches
for various signal points is provided, location: HEPData_workspaces.tar.gz}
- {description: 'slha files for the 3 baseline signal points used in the analysis
for regions A,B,C', location: SbMB_SLHAs.tar.gz}
comment: ''
data_license: {description: The content can be shared and adapted but you must give
appropriate credit and cannot restrict access to others., name: cc-by-4.0, url: 'https://creativecommons.org/licenses/by/4.0/'}
hepdata_doi: 10.17182/hepdata.89408.v1 |
@kratsg: The example data_file: RegionA-BkgOnly.json
data_schema: https://scikit-hep.org/pyhf/schemas/1.0.0/workspace.json I've just merged @Sinclert's PR (HEPData/hepdata-validator#17) which extends the hepdata-validator to download JSON schema from remote locations. I've provided an example submission.yaml file for use with one of the tests. Please take a look. You don't provide a JSON schema for the signal patch files like Would you want to include all signal patch files (combined with the background-only workspace) as separate YAML documents of the Please be more precise exactly how you expect HEPData to natively support pyhf JSON files to avoid @Sinclert spending time on unnecessary software development. Please look at the recent changes to the hepdata-validator and let us know if further changes are necessary before I make a new release. Please provide an example HEPData submission that we can use for testing. Thank you in advance for your cooperation. |
@GraemeWatt , thanks for the comments. I wasn't aware you wanted an example that was more "native". I need to think more about it.
The reason there is no JSON schema is that they follow the jsonpatch format (see rfc6902). The
We want them all to be available for analysis. Is it possible to include all in an upload, but only display "tables" for a subset of them?
This still won't solve the problem since, as you mentioned, these patch files would just be usurped or merged into a single one containing multiple sets of patches where you choose which patch to apply to the
Ok, I will get back to you on this. |
Hi @GraemeWatt , to answer a lot of questions, I'm copying/pasting a proposal here that we came up with. likelihoods:
- type: histfactory
name: RegionA
workspace:
name: background-only
description: background-only workspace
data_file: BkgOnly.json
data_schema: https://scikit-hep.org/pyhf/schemas/1.0.0/workspace.json
patchset:
name: patchsets
description: sets of signal patches for the background-only workspace
data_file: patchset.json
data_schema: https://scikit-hep.org/pyhf/schemas/1.0.0/patchset.json A few keypoints
Question:
For HEPData, I have a question for you. Is HEPData open to having pyhf installed to parse the background-only / patchsets (apply patches to the background-only as well) in order to get information out? We do have a
|
Dear @GraemeWatt , we have developed a patchset schema: https://scikit-hep.org/pyhf/schemas/1.0.0/patchset.json . |
Dear @GraemeWatt , Following Giordon converging on workspace schemas, one of our analysis in the ATLAS SUSY group ( EWK 1L, already with v1 on HEPData: https://www.hepdata.net/record/ins1755298. Many thanks to Danika MacDonell for doing so) has prepared all the necessary files for the record upgrade with the new likelihood yaml format exploted.
In case it helps, you can find a streamlined version of the submission tarball that we use here:
Please let us know if we can assist you with anything from our side. Thanks, |
Hi @butsuri43 , As far as I know, the newest version of |
probably we'd like to use the patchset schema as well for this upload @kratsg @butsuri43 ? |
I though we did use it. |
@lukasheinrich it uses the patchset likelihoods:
type: histfactory
name: RegionA
workspace:
name: background-only
description: background-only workspace
data_file: BkgOnly.json
data_schema: https://scikit-hep.org/pyhf/schemas/1.0.0/workspace.json
patchset:
name: patchsets
description: sets of signal patches for the background-only workspace
data_file: patchset.json
data_schema: https://scikit-hep.org/pyhf/schemas/1.0.0/patchset.json Note that |
@danika -- one thing we need to make sure is to provide the appropriate region name (there's only one region - so i don't know what you would call it...) |
@kratsg should this be a list? (not the dash on line 2 I added)
|
Yes it should be, good catch! @butsuri43 - can you check the format? It should be what @lukasheinrich just put above. |
Hi all, Thanks Giordon and Lukas for checking the files. More importantly, now, when uploading the tarball, the HEPData website complains with the following error message: Now, this is more understandable error to me, and basically points to the fact that that what we have in 'likelihoodDescription.yaml' file is of the type |
Dear Krzysztof (@butsuri43), Thanks for the messages. I've just made a commit (HEPData/hepdata-validator@d8f6792) to the validator code to display a more user-friendly error message ( The schema introduced above by @kratsg is just a proposal that I haven't even commented on yet, let alone started to work on a code implementation. It's completely different from what I originally anticipated at the top of this issue and from what @Sinclert already implemented for a potential 0.2.2 release of the This issue is a low priority for me and other tasks will take precedence in the next few weeks, such as migrating to new infrastructure at CERN next week, then migrating from Python 2 to Python 3. But perhaps in a month or so, we can discuss if it's worthwhile to proceed further with this issue. Best regards, |
Sounds good. We can just follow up here then. |
Hi @GraemeWatt thanks for still considering this. I think we are happy to assist where we can. As you can see this is being picked up by theorists, so I think effort put into this will not go in vain (I wholeheartedly support prioritizing py2->py3 first though) I looked at the JSON record of one of the hepdata entries e.g. https://www.hepdata.net/record/ins1755298?format=json and see this section
how does this work internally? If we look at this URL: https://www.hepdata.net/record/data/90607/476075/1 I assume this is the original uploaded YAML form the submission? I think an ideal situation would be if we could get this to show up an
and My understanding is that for this to happen you need the likelihood as part of the overall JSON of the record. This could be easily achieved using https://json-spec.readthedocs.io/reference.html by slightly modifying the record (see
if you use this on the hepdata-validator all these files could be resolved to inline all JSON
and you'd never even notice it was originally in two files. We use the same strategy in yadage (part of RECAST) https://github.com/yadage/yadage/blob/master/tests/testspecs/mapreduce/workflow.yml where after processing the files steps.yml and workflow.yml get fused into a big JSON doc when viewing the JSON recorod again I think it'd be nice to (like in the tables) then provide a separte URL for that section of the record. The benefits are the same as with not storing the tables themselves as "additional resources", e.g. you can visualize them better, etc. |
Hi Graeme, Thank you very much for your response. |
Dear @GraemeWatt - independent of which solution we end up going with -- is it possible now to get a badge added for the analyses which currently have (and will have) HistFactory likelihoods attached to the entries? I think this is our first concern when trying to get HEPData ready to support likelihoods natively. This would massively help for identifying/searching already. We can reiterate more on the native likelihood support afterwards. |
I think there is some prior art from adding Rivet badges. @GraemeWatt does it still need an external JSON file to identify the records w/ rivet analyses or is this now handled internally |
Small update: We just merged the PR that allows HEPData users to easily defined which remotely-defined JSON schemas are accepted from a validation point of view. Now, from a renderization point of view, there is a not-so-clear requirement (explained by Graeme in this comment), where accepted schemas also need to be renderizable, by including the following fields in their definition:
Therefore, the current list of remotely-defined accepted schemas looks like this: hepdata/hepdata/modules/records/utils/validators.py Lines 37 to 54 in e75194b
|
thanks @Sinclert - I think one would need to encapsulate this then. The dependent_variables, independent_variables are specific to the table schema and external data types prrobably won't be able to guarantee semantically useful mappings to thesse keys. So if the rendering could somehow be modularized such that table -> render with |
@Sinclert, to clarify, I wasn't proposing that the pyhf format should be put into
This part is done, congratulations! Unfortunately, it's the easiest part.
This part is still to do and is more difficult, requiring substantial parts of the HEPData code to be rewritten to support the pyhf format instead of the default "table" format. I think the next step would be for the pyhf authors to define a data_file: histfactory_likelihoods.yaml
data_schema: https://diana-hep.org/pyhf/schemas/1.0.0/likelihoods.json To proceed further, it would be good if the pyhf authors could (i) define |
Oh I misunderstood then. Aiming for a different visualization type seems to be the way to go. |
Hi all... I will be giving a talk "at CERN" for PhyStat / CERN Data Science Seminar on Oct 14 It would be great if I can plug this development |
Let me start by saying that I think this integration is suffering deep communication problems, which have made this issue become the most commented issue (by far) of all the HEPData ecosystem of packages. It seems to me that this discussion has transformed into some kind of messy drawer where different ideas are thrown but no real decisions are being taken. I urge future responses to address the comments placed in their previous comment so that conversation can progress. Let's focus on what Graeme said:
TLDR summary:@lukasheinrich @kratsg @matthewfeickert please 🙏🏻
@GraemeWatt please 🙏🏻
Long explanation1. Define the
|
Hi @Sinclert , sorry for the mess :-/ 1 + 2: I did not quite understand what problems my proposal has, maybe I forgot? I would make a small change (allowing more than one patchset for a given base likelihood, but the overal structure seems sane to me?) 3: I think a summary (Option A) is best as a first step. Ideally pyhf could provide an external JS-package (say I would propose to leave Option C for a later time once we have more experience. What I tried to convey in my last comment is that the summary does not need to replicate the full |
Apologies for my part in the miscommunication. I put this issue on hold back in April and left many questions unanswered due to other priorities in the last few months. Having said this, some of my existing answers above seem to have been ignored by the pyhf authors. @Sinclert, your last comment above is a great summary! I agree with everything you said there. The misunderstanding probably reflects the fact that modifying HEPData to provide native support for pyhf files is actually very complicated. Commenting in a single GitHub issue is not the best forum for discussion. I've just sent an invitation to @lukasheinrich @kratsg @matthewfeickert @cranmer (using your CERN email addresses) to join the "pyhf" stream of a Zulip instance (similar to Slack) that I was previously using to chat to @Sinclert. I'm also happy to join a video call if you want to discuss how to proceed. I'll try to answer the outstanding questions raised above.
Yes, the
I'd prefer to avoid installing
No, the original YAML is processed into a JSON format suitable for rendering by the front-end JavaScript code. This processing is carried out every time a YAML data table is loaded, and it should be switched off for the pyhf format.
Yes, it's a separate issue #163, but I didn't yet find time to work on the implementation.
Well, I didn't explicitly point out the problem yet, so let me try to explain now. Each HEPData table corresponds to two YAML documents, (i) one in data_file: pyhf_likelihood.yaml
data_schema: https://diana-hep.org/pyhf/schemas/1.0.0/likelihood.json Here, you can use
No, I would prefer to avoid installing Putting my comments together, I guess the name: RegionA
workspace: {$ref: BkgOnly.json}
patchset: {$ref: patchset.json}
inspect: {$ref: inspect.json} Here, I've removed |
@lukasheinrich @kratsg @matthewfeickert @cranmer : I'm closing this issue now, as discussed in May 2022 on the HEPData Zulip chat and in-person with Lukas. To summarise, the original idea of "native support" was too complicated to be workable. A simpler approach to provide visualization for HistFactory JSON files would be to have a separate web application, possibly based on Giordon's Related to this point, I made a comment in Zulip last year:
I've just added a pyhf section to the HEPData submission docs. If you want to edit this text, e.g. by linking to pyhf docs on formatting, please submit a PR to the hepdata-submission repository. Update: An idea emerging from the FAIROS-HEP Kick-Off Workshop in February 2023 was that if Binder could import a pyhf resource file (including a Jupyter notebook) from HEPData, that could provide a simpler way of visualizing the pyhf statistical models than developing a dedicated pyhf visualization web app. |
To provide native support of individual pyhf JSON files, the structure of a HEPData submission would be mostly the same, i.e. tied together by a
submission.yaml
file containing metadata, but instead of a data file specified as:and validated by the usual "table" schema, there would be something like:
or
The first step would be to extend the
hepdata-validator
package to validate against the custom JSON schema specified by thedata_schema
key instead of the default "table" schema.Appropriate visualization and conversion tools would need to be developed to render each "pyhf" JSON file directly in a web browser, instead of using the current tools for the "table" schema.
See also HEPData/hepdata_lib#98.
Cc: @lukasheinrich
The text was updated successfully, but these errors were encountered: