Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

podio::DataSource #309

Merged
merged 32 commits into from
Oct 1, 2024
Merged

podio::DataSource #309

merged 32 commits into from
Oct 1, 2024

Conversation

kjvbrt
Copy link
Contributor

@kjvbrt kjvbrt commented Aug 7, 2023

DataSource providing EDM4hep collections. This allows writing analyzers using EDM4hep collections directly.
Example:

    struct selPDG {
      selPDG(const int pdgID, const bool chargeConjugateAllowed = false);
      const int m_pdg;
      const bool m_chargeConjugateAllowed;
      edm4hep::TrackCollection operator() (
          const edm4hep::MCRecoTrackParticleAssociationCollection& inAssocColl);
    };

Two implementations are provided, frame and legacy.
To run the analysis using these analyzers --use-source or --use-legacy-source.

Implementation: The class keeps three vectors m_Collections, m_podioReaders and m_frames.
Thread safety: questionable

@kjvbrt kjvbrt requested review from vvolkl and tmadlener August 8, 2023 13:54
@vvolkl
Copy link
Member

vvolkl commented Aug 14, 2023

This will be very nice to have, thanks! I'll take a closer look this week. For the naming, I'd suggest "DataSource" instead of "Source". This should also eventually move to the edm4hep repository, right?

@kjvbrt kjvbrt changed the title [WIP] EDM4hep Source [WIP] EDM4hep DataSource Aug 14, 2023
@kjvbrt
Copy link
Contributor Author

kjvbrt commented Aug 15, 2023

Thanks for the suggestion I renamed the Source to DataSource.

Copy link
Contributor

@tmadlener tmadlener left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice. I have started to have a look and I have made some more or less detailed comments below. I also have a few questions about how this is supposed to work in the end for RDataFrame, and about some of the concepts there.

First a conceptual one: Does a slot in RDataFrame correspond to a thread? Following up on that, IIUC the basic idea here is to have one reader per slot / thread and then also one Frame per slot / thread? I am not entirely sure how much overhead this is memory wise, but it might be possible to only have one reader but multiple Frames. However, I am not entirely sure how easy it would be to make this work with the RDF infrastructure.

Then one from the usage side. In the end this would be used as free functions that get "proper" edm4hep collections as arguments and return new (subset) collections? e.g.

df.Define("reco_pt", "getPt(ReconstructedParticles)")

or would the idea be to have them as sort of "member functions" in the end? (Is that even possible?)

Then finally one that already goes beyond this PR, but sort of fits anyhow: Can the Snapshot mechanism be made to produce EDM4hep output again from this?

EDM4hepDataSource(const std::vector<std::string>& filePathList,
int nEvents = -1);

void SetNSlots(unsigned int nSlots);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the virtual functions here be marked with either final or override, just so that the compiler has a chance to detect mismatches here?

Comment on lines 79 to 81
std::vector<std::unique_ptr<podio::ROOTFrameReader>> m_podioReaders;
/// Podio frames
std::vector<std::unique_ptr<podio::Frame>> m_frames;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how RDataFrame handles this internally and which parts need to be shared over several threads, and which are managed by RDataFrame and we basically can assume that it will only be accessed from only one thread.

From a podio point of view:

  • ROOTFrameReader / ROOTLegacyReader: no internal thread safety at all, i.e. either make sure that each thread has it's own reader or put a lock / mutex around it for readEntry/readNextEntry.
  • Frame: fully thread safe. Access from as many thread as you want, including putting new things into it. With the caveat that this uses a lock / mutex internally, so there could be some lock congestion if used from very many threads

Comment on lines 343 to 357
if (std::find(m_columnNames.begin(),
m_columnNames.end(),
columnName) != m_columnNames.end()) {
return true;
}

return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (std::find(m_columnNames.begin(),
m_columnNames.end(),
columnName) != m_columnNames.end()) {
return true;
}
return false;
return std::find(m_columnNames.begin(),
m_columnNames.end(),
columnName) != m_columnNames.end();

Could ditch the if here and simply return the result directly.

Comment on lines 366 to 376
<< m_columnTypes.at(i) << std::endl;

return m_columnTypes.at(i) + "Collection";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<< m_columnTypes.at(i) << std::endl;
return m_columnTypes.at(i) + "Collection";
<< m_columnTypes[i] << std::endl;
return m_columnTypes[i] + "Collection";

No need to do another range-check here, since we have already done that just above.

Comment on lines 113 to 157
ROOT::VecOps::RVec<float>
getPt(const edm4hep::ReconstructedParticleCollection& inParticles);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is most likely beyond the scope if this pull request, but there is probably quite a bit of (kinematic) functionality that could be templated on the collection type and then simply forwarded to other utility functionality, e.g. the one that already comes with EDM4hep

Comment on lines 97 to 120
struct selPDG {
selPDG(const int pdgID, const bool chargeConjugateAllowed = false);
const int m_pdg;
const bool m_chargeConjugateAllowed;
edm4hep::ReconstructedParticleCollection operator() (
const edm4hep::MCRecoParticleAssociationCollection& inAssocColl);
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this is most likely beyond the scope if this PR, but this also looks general enough to be templated (but I a likely missing a detail here).

@Zehvogel
Copy link

Can we split the DataSource itself off of this and get it into edm4hep directly? I tested it a bit and it looked ready to use?

@tmadlener
Copy link
Contributor

If putting the datasource into EDM4hep would be (more or less) easily possible, I would also be in favor of having it there. That should make maintenance a bit easier and we would see earlier if we break it when we do changes in EDM4hep.

Copy link
Contributor

@tmadlener tmadlener left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only had a quick look at some of the files and made a few smaller comments.

Overall, what is the (longer term) plan for FCCAnalysis for switching to the podio datasource? This PR suggests that at least for some time the "old way" and the "new way" will coexist for some time. Since EDM4hep has changed enough such that newer files cannot be easily read with older versions of FCCAnalysis, I am wondering whether it makes sense to have the two things in parallel, or whether it would be easier to simply make a clean cut and move everything to the data source way.

Some of the functionality might also be general enough for it to be moved to EDM4hep? E.g. getting the pt (or four momentum) for any datatype that has a momentum (and at least a mass or energy).

CMakeLists.txt Outdated
Comment on lines 34 to 35
set(WITH_SOURCE ON CACHE STRING "Enable PODIO Data Source")
set_property(CACHE WITH_SOURCE PROPERTY STRINGS ON OFF)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
set(WITH_SOURCE ON CACHE STRING "Enable PODIO Data Source")
set_property(CACHE WITH_SOURCE PROPERTY STRINGS ON OFF)
set(WITH_PODIO_DATASOURCE ON CACHE STRING "Enable PODIO Data Source")
set_property(CACHE WITH_PODIO_DATASOURCE PROPERTY STRINGS ON OFF)

Just to give this a slightly more discernible name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, used WITH_PODIO_DATASOURCE

Comment on lines 11 to 24
template <typename T>
T sel_PDG::operator() (const T& inAssocColl) {
T result;
result.setSubsetCollection();

for (const auto& assoc: inAssocColl) {
const auto& particle = assoc.getSim();
if (particle.getPDG() == m_pdg) {
result.push_back(assoc);
}
}

return result;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This definition needs to go to the header file, otherwise we will most likely run into linker errors. Unless, there are explicit template instantiations that I am missing somewhere. Similar for others cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved also with other ones.

* with the MC particle with the desired PDG ID.
*/
edm4hep::ReconstructedParticleCollection operator() (
const edm4hep::MCRecoParticleAssociationCollection& inAssocColl);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const edm4hep::MCRecoParticleAssociationCollection& inAssocColl);
const edm4hep::RecoMCParticleLinkCollection& inAssocColl);

The Association types are deprecated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything was converted into Links.

Comment on lines 137 to 140
lVec.SetXYZM(particle.getMomentum().x,
particle.getMomentum().y,
particle.getMomentum().z,
particle.getMass());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be done via edm4hep::utils::p4(particle, edm4hep::utils::UseMass) from the edm4hep/utils/kinematics.h header.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the hint, used...

Comment on lines 156 to 158
return std::sqrt(std::pow(p.getMomentum().x, 2) +
std::pow(p.getMomentum().y, 2));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar here edm4hep::utils::pt can also do this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@kjvbrt
Copy link
Contributor Author

kjvbrt commented Sep 20, 2024

I only had a quick look at some of the files and made a few smaller comments.

Thanks, this was just work in progress commit...

Overall, what is the (longer term) plan for FCCAnalysis for switching to the podio datasource? This PR suggests that at least for some time the "old way" and the "new way" will coexist for some time. Since EDM4hep has changed enough such that newer files cannot be easily read with older versions of FCCAnalysis, I am wondering whether it makes sense to have the two things in parallel, or whether it would be easier to simply make a clean cut and move everything to the data source way.

I would like to ditch "old way" at some point, but I'm not sure we can make clean cut quickly. There is a lot of analysis code which uses the "old way" and also the performance of the datasource way needs to be better understood.

Some of the functionality might also be general enough for it to be moved to EDM4hep? E.g. getting the pt (or four momentum) for any datatype that has a momentum (and at least a mass or energy).

Upstreaming some of the functionalities to edm4hep is the good point. How would you handle generating of those functions? With C++ templating or something like Podio(jinja2)?
Long term I thing it would be good to separate c++ libraries from Python orchestration and only at the point where we know which files the analysis will operate on we source appropriate stack.

@Zehvogel
Copy link

Zehvogel commented Sep 20, 2024

also the performance of the datasource way needs to be better understood.

I think unfortunately the performance will always be worse. Podio gives us an AoS style access in memory while when we access the root file directly we have SoA. I.e. using the podio datasource will potentially always result in reading also unneeded data. Unless I am confusing something... Maybe @m-fila wants to comment? :)

@tmadlener
Copy link
Contributor

There is a lot of analysis code which uses the "old way" and also the performance of the datasource way needs to be better understood.

Good point. Didn't think of that.

I think unfortunately the performance will always be worse. Podio gives us an AoS style access in memory while when we access the root file directly we have SoA. I.e. using the podio datasource will potentially always result in reading also unneeded data. Unless I am confusing something... Maybe @m-fila wants to comment? :)

There are a few things to consider. The major difference is that the podio data source currently has no way to only read selective branches / collections only. It will always read the full event. On top of that reconstructing the relations takes a bit of time. At some point that might be an acceptable overhead, but how large the analysis has to be for that is unclear to me. There are also almost certainly some optimizations that can be (or are already) done if the podio middleman is cut out of the whole thing. The memory layout is another thing, but that might even be something we can change in podio, because the interface should give us enough abstraction to switch the implementation below from AoS to SoA, but this has to be investigated further.

@kjvbrt kjvbrt merged commit eef66b3 into HEP-FCC:master Oct 1, 2024
3 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants