Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

master_june24: major bug in memory access for channelIds in C++ calculate_wavefunctions (missing ieventAccessRecordConst) #899

Closed
valassi opened this issue Jul 9, 2024 · 8 comments · Fixed by #882
Assignees
Milestone

Comments

@valassi
Copy link
Member

valassi commented Jul 9, 2024

Another issue introduced in #830 and being reviewed in #882.

This is a big bug. Essentially, ievt0 in calculate_wavefunctions is ignored
https://github.com/valassi/madgraph4gpu/blob/8e312bc02d9d072615fcb1052b5db54754498517/epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc#L288

#ifdef MGONGPUCPP_GPUIMPL
#ifdef __CUDACC__
#pragma nv_diagnostic pop
#endif
      // CUDA kernels take input/output buffers with momenta/MEs for all events
      const fptype* momenta = allmomenta;
      const fptype* COUPs[nxcoup];
      for( size_t ixcoup = 0; ixcoup < nxcoup; ixcoup++ ) COUPs[ixcoup] = allCOUPs[ixcoup];
      fptype* MEs = allMEs;
#ifdef MGONGPU_SUPPORTS_MULTICHANNEL
      fptype* numerators = allNumerators;
      fptype* denominators = allDenominators;
#endif
#else
      // C++ kernels take input/output buffers with momenta/MEs for one specific event (the first in the current event page)
      const fptype* momenta = M_ACCESS::ieventAccessRecordConst( allmomenta, ievt0 );
      const fptype* COUPs[nxcoup];
      for( size_t idcoup = 0; idcoup < ndcoup; idcoup++ )
        COUPs[idcoup] = CD_ACCESS::ieventAccessRecordConst( allCOUPs[idcoup], ievt0 ); // dependent couplings, vary event-by-event
      //for( size_t iicoup = 0; iicoup < nicoup; iicoup++ ) // BUG #823
      for( size_t iicoup = 0; iicoup < nIPC; iicoup++ )     // FIX #823
        COUPs[ndcoup + iicoup] = allCOUPs[ndcoup + iicoup]; // independent couplings, fixed for all events
      fptype* MEs = E_ACCESS::ieventAccessRecord( allMEs, ievt0 );
#ifdef MGONGPU_SUPPORTS_MULTICHANNEL
      fptype* numerators = NUM_ACCESS::ieventAccessRecord( allNumerators, ievt0 );
      fptype* denominators = DEN_ACCESS::ieventAccessRecord( allDenominators, ievt0 );
#endif
#endif

      // Reset color flows (reset jamp_sv) at the beginning of a new event or event page
      for( int i = 0; i < ncolor; i++ ) { jamp_sv[i] = cxzero_sv(); }

#ifdef MGONGPU_SUPPORTS_MULTICHANNEL
      // Numerators and denominators for the current event (CUDA) or SIMD event page (C++)
      fptype_sv& numerators_sv = NUM_ACCESS::kernelAccess( numerators );
      fptype_sv& denominators_sv = DEN_ACCESS::kernelAccess( denominators );
#endif

      // *** DIAGRAM 1 OF 3 ***

      // Wavefunction(s) for diagram number 1
      vxxxxx<M_ACCESS, W_ACCESS>( momenta, 0., cHel[ihel][0], -1, w_fp[0], 0 );

      vxxxxx<M_ACCESS, W_ACCESS>( momenta, 0., cHel[ihel][1], -1, w_fp[1], 1 );

      oxxxxx<M_ACCESS, W_ACCESS>( momenta, cIPD[0], cHel[ihel][2], +1, w_fp[2], 2 );

      ixxxxx<M_ACCESS, W_ACCESS>( momenta, cIPD[0], cHel[ihel][3], -1, w_fp[3], 3 );

      VVV1P0_1<W_ACCESS, CD_ACCESS>( w_fp[0], w_fp[1], COUPs[0], 1.0, 0., 0., w_fp[4] );

      // Amplitude(s) for diagram number 1
      FFV1_0<W_ACCESS, A_ACCESS, CD_ACCESS>( w_fp[3], w_fp[2], w_fp[4], COUPs[1], 1.0, &amp_fp[0] );
#ifdef MGONGPU_SUPPORTS_MULTICHANNEL
      if( channelIds != 0 )
      {
        channelids_sv = CID_ACCESS::kernelAccessConst( channelIds );
#if defined __CUDACC__ or !defined MGONGPU_CPPSIMD
        if( channelids_sv == 1 ) numerators_sv += cxabs2( amp_sv[0] );
        denominators_sv += cxabs2( amp_sv[0] );
#else
        for( int i = 0; i < neppV; ++i )
        {
          if( channelids_sv[i] == 1 ) numerators_sv += cxabs2( amp_sv[0] );
          denominators_sv += cxabs2( amp_sv[0] );
        }
#endif
      }
#endif
      jamp_sv[0] += cxtype( 0, 1 ) * amp_sv[0];
      jamp_sv[1] -= cxtype( 0, 1 ) * amp_sv[0];

      // *** DIAGRAM 2 OF 3 ***

I already noted in #895 that the memory access should be called once for all diagrams instead of once per diagram. Essentially it could be moved higher up (and I am actually moving it elsewhere in sigmakin...).

The problem I report here however is that the current implementation does not reflect and diverges from the relevant design of memory access in the C++ case, look at numerators for instance

  • CUDA access essentially proceeds only through kernelAccess functions (because numerators = allNumerators;), as it looks up the relevant ievt from the 'which thread is this' functions
  • C++ instead MUST take into account ievt0, which is the result of a loop in sigmaKin, to know which fragment of the allNumerators array to look at, numerators = NUM_ACCESS::ieventAccessRecord( allNumerators, ievt0 );

Instead, for channelIds this is ignored, and the kernel function is applied directly to the full channelId array. This means in practice that only the very first events in the channelId array are accessed, over and over. Maybe I am wrong, but I would guess this is the case. As there are no specific tests in master_june24 for different channelIds in teh array (see #896), I think this went undetected.

@valassi valassi self-assigned this Jul 9, 2024
@roiser
Copy link
Member

roiser commented Jul 9, 2024

Hi, for the C++ implementation please regard, as also discussed this morning in the dev meeting, that the convention is that a channelID value will be the same for all entries in the mg5 warp. thanks Stefan

@valassi
Copy link
Member Author

valassi commented Jul 15, 2024

Hi, for the C++ implementation please regard, as also discussed this morning in the dev meeting, that the convention is that a channelID value will be the same for all entries in the mg5 warp. thanks Stefan

Thanks. This clarifies that the reimplementation of this feature can be considerably simplified (see #898)

This (major) bug however has nothing to do with the fact that channelId is meant eventually to be the same inside warps, i.e. what you said above is irrelevant here. The point is that there was a missing ieventAccessRecordConst call (which was not even implemented in MemoryAccessChannelIds.h).

Using debug printouts, for a channelId array that contains123123123123 etc, this is what is actually found

    BACKEND=none
    channelids_sv 0 1
    channelids_sv 1 1
    channelids_sv 2 1
    channelids_sv 3 1
    channelids_sv 4 1
    channelids_sv 5 1
    channelids_sv 6 1
    channelids_sv 7 1
    channelids_sv 8 1
    channelids_sv 9 1
    channelids_sv 10 1
    channelids_sv 11 1
    channelids_sv 12 1
    channelids_sv 13 1
    channelids_sv 14 1
    channelids_sv 15 1
    BACKEND=sse4
    channelids_sv 0 1 2
    channelids_sv 2 1 2
    channelids_sv 4 1 2
    channelids_sv 6 1 2
    channelids_sv 8 1 2
    channelids_sv 10 1 2
    channelids_sv 12 1 2
    channelids_sv 14 1 2
    BACKEND=avx2
    channelids_sv 0 1 2 3 1
    channelids_sv 4 1 2 3 1
    channelids_sv 8 1 2 3 1
    channelids_sv 12 1 2 3 1
    BACKEND=512y
    channelids_sv 0 1 2 3 1
    channelids_sv 4 1 2 3 1
    channelids_sv 8 1 2 3 1
    channelids_sv 12 1 2 3 1
    BACKEND=512z
    channelids_sv 0 1 2 3 1 2 3 1 2
    channelids_sv 8 1 2 3 1 2 3 1 2
    BACKEND=cuda

I have now fixed this. This is what it should print out.

    BACKEND=none
    channelids_sv 0 1
    channelids_sv 1 2
    channelids_sv 2 3
    channelids_sv 3 1
    channelids_sv 4 2
    channelids_sv 5 3
    channelids_sv 6 1
    channelids_sv 7 2
    channelids_sv 8 3
    channelids_sv 9 1
    channelids_sv 10 2
    channelids_sv 11 3
    channelids_sv 12 1
    channelids_sv 13 2
    channelids_sv 14 3
    channelids_sv 15 1
    BACKEND=sse4
    channelids_sv 0 1 2
    channelids_sv 2 3 1
    channelids_sv 4 2 3
    channelids_sv 6 1 2
    channelids_sv 8 3 1
    channelids_sv 10 2 3
    channelids_sv 12 1 2
    channelids_sv 14 3 1
    BACKEND=avx2
    channelids_sv 0 1 2 3 1
    channelids_sv 4 2 3 1 2
    channelids_sv 8 3 1 2 3
    channelids_sv 12 1 2 3 1
    BACKEND=512y
    channelids_sv 0 1 2 3 1
    channelids_sv 4 2 3 1 2
    channelids_sv 8 3 1 2 3
    channelids_sv 12 1 2 3 1
    BACKEND=512z
    channelids_sv 0 1 2 3 1 2 3 1 2
    channelids_sv 8 3 1 2 3 1 2 3 1
    BACKEND=cuda

@valassi valassi changed the title master_june24: wrong memory access for channelIds in C++ implementation master_june24: major bug in memory access for channelIds in C++ calculate_wavefunctions (missing ieventAccessRecordConst) Jul 15, 2024
@valassi
Copy link
Member Author

valassi commented Jul 15, 2024

I renamed this to indicate that this is a bug in calculate_wavefunctions. There is a similar bug in sigmaKin #911

@oliviermattelaer
Copy link
Member

Sorry @valassi, I'm lost here, which line do you think is problematic?

Independently, I think that your test is not the correct one here.
The design for channelId is to contain values like "1111222233332222333333331111222" (when the warp size is set to 4)
i.e. channelId will contain warp_size identical value in a row (each of those values being "randomly" choose by fortran --depending of the PDF of the (set of) events.
So your example of "channelId array that contains123123123123" is not really testing something interesting (since this is possible if warp_size is set to 1)

Did you observe such content? or is it a pure assumption?

Thanks,

Olivier

@valassi
Copy link
Member Author

valassi commented Jul 15, 2024

Hi @oliviermattelaer well since the previous implementation was not even assuming/enforcing that 123123123 was not allowed (I am just implementing this now #898) then I used this as a test. This was mainly to show #894, i.e that the SIMD implementation was completely wrong (I just fixed that).

My comment here on memory access #899 is totally orthogonal to whether you assume that a warp contains the same channelid or not, There was a major bug (now fixed) because in practice the implementation was always checking channelids on the first SIMD vector. You see in the wrong implementation, all SIMD vectors are always 123123 etc, while they should be at least (eg SIMD-4) 1231, 2312, 3123. Now fixed. Now if you do put 111122223333, this means that you would get 1111, 1111, 1111, instead of 1111, 2222, 3333. Bug. Again, this is irrespective of whether you enforce or not that channelsids in a warp are the same.

The bug is:

  • There is a distinction in SIMD code between "allarray" and "array". The former includes all nevt events. The latter includes only a SIMD vector. The implementation form channelIDs #830 did not take this into account
  • To pass from allchannelids to channelids you need a call to ieventAccessRecordConst. This was missing. I have now put it in.

There are so many major bugs in #830 that indeed it's easy to get lost. Anyway, I am fixing them one by one and I file the info on each bug for info and for the record. But you may chhose to ignore individual bugs and just wait for the full thing. I should be done in a couple of days.

@valassi
Copy link
Member Author

valassi commented Jul 15, 2024

So your example of "channelId array that contains123123123123" is not really testing something interesting (since this is possible if warp_size is set to 1)

Did you observe such content? or is it a pure assumption?

PS What I did (#896) is that I added cudacpp tests, without madevent, for the functionality of cudacpp. This is a software functionality that is well defined and can (and MUST) be tested independently of madevent, so I just played around with 123123. Now I am removing this and moving to the 111122223333 (and adding sanity checks that this is the case).

The implementation in #830 was not using sanity checks, and in any case the explicit SIMD loops on neppV seem to indicate that it was ready to handle a more general case. Anyway, ignore. I am just testing, testing, testing, what was missing before.

@oliviermattelaer
Copy link
Member

Ah ok, I guess what you refer to is what I just comment on #894, that it is not clear to me what happens if nb_warp=2 for the SIMD case, if yes then I think that I do understand the issue now, (but no clue how to fix it).

Would you be able to point a commit where you fix this, I think it is super important for me (and @roiser) to fully understand what you did such that we can maintain such tricky stuff in the long term. (If you do not have it yet, it might even be good if I or Stefan tries to fix those for the exact same reason)

Thanks,

Olivier

valassi added a commit to valassi/madgraph4gpu that referenced this issue Jul 17, 2024
…and recreate txt ref for runTest (use cuda/double as the reference platform)

CUDACPP_RUNTEST_DUMPEVENTS=1 ./runTest_cuda.exe
\cp ../../test/ref/dump* ../../../CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/test/ref/

NB: the CUDA test succeeds with the new reference files, but the C++ multichannel test madgraph5#896 fails due to bugs madgraph5#894 and madgraph5#899
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jul 17, 2024
… bug madgraph5#899

make -j bldall -f cudacpp.mk
for bck in none sse4 avx2 512y 512z cuda; do echo BACKEND=$bck; ./build.${bck}_d_inl0_hrd0/runTest_*.exe; done | egrep '(BACKEND|channelids_sv)'

BACKEND=none
channelids_sv 0 1
channelids_sv 1 1
channelids_sv 2 1
channelids_sv 3 1
channelids_sv 4 1
channelids_sv 5 1
channelids_sv 6 1
channelids_sv 7 1
channelids_sv 8 1
channelids_sv 9 1
channelids_sv 10 1
channelids_sv 11 1
channelids_sv 12 1
channelids_sv 13 1
channelids_sv 14 1
channelids_sv 15 1
BACKEND=sse4
channelids_sv 0 1 2
channelids_sv 2 1 2
channelids_sv 4 1 2
channelids_sv 6 1 2
channelids_sv 8 1 2
channelids_sv 10 1 2
channelids_sv 12 1 2
channelids_sv 14 1 2
BACKEND=avx2
channelids_sv 0 1 2 3 1
channelids_sv 4 1 2 3 1
channelids_sv 8 1 2 3 1
channelids_sv 12 1 2 3 1
BACKEND=512y
channelids_sv 0 1 2 3 1
channelids_sv 4 1 2 3 1
channelids_sv 8 1 2 3 1
channelids_sv 12 1 2 3 1
BACKEND=512z
channelids_sv 0 1 2 3 1 2 3 1 2
channelids_sv 8 1 2 3 1 2 3 1 2
BACKEND=cuda
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jul 17, 2024
…by cleanly separating allChannelIds and channelIds and adding a missing ieventAccessRecordConst call

Also add the missing ieventAccessRecordConst function in MemoryAccessChannelIds.h and comment out unused non-const function kernelAccess

The debug printouts show now that the issue is solved

for bck in none sse4 avx2 512y 512z cuda; do echo BACKEND=$bck; ./build.${bck}_d_inl0_hrd0/runTest_*.exe; done | egrep '(BACKEND|channelids_sv)'
BACKEND=none
channelids_sv 0 1
channelids_sv 1 2
channelids_sv 2 3
channelids_sv 3 1
channelids_sv 4 2
channelids_sv 5 3
channelids_sv 6 1
channelids_sv 7 2
channelids_sv 8 3
channelids_sv 9 1
channelids_sv 10 2
channelids_sv 11 3
channelids_sv 12 1
channelids_sv 13 2
channelids_sv 14 3
channelids_sv 15 1
BACKEND=sse4
channelids_sv 0 1 2
channelids_sv 2 3 1
channelids_sv 4 2 3
channelids_sv 6 1 2
channelids_sv 8 3 1
channelids_sv 10 2 3
channelids_sv 12 1 2
channelids_sv 14 3 1
BACKEND=avx2
channelids_sv 0 1 2 3 1
channelids_sv 4 2 3 1 2
channelids_sv 8 3 1 2 3
channelids_sv 12 1 2 3 1
BACKEND=512y
channelids_sv 0 1 2 3 1
channelids_sv 4 2 3 1 2
channelids_sv 8 3 1 2 3
channelids_sv 12 1 2 3 1
BACKEND=512z
channelids_sv 0 1 2 3 1 2 3 1 2
channelids_sv 8 3 1 2 3 1 2 3 1
BACKEND=cuda

NB: after fixing bug madgraph5#899, the SIMD tests still fail because of bug madgraph5#894

for bck in none sse4 avx2 512y 512z cuda; do echo BACKEND=$bck; ./build.${bck}_d_inl0_hrd0/runTest_*.exe; done | egrep '(BACKEND| ME |r.ME |In comparing)'
BACKEND=none
BACKEND=sse4
MadgraphTest.h:254: In comparing event 0 from iteration 0
  ME  1.094026373218036e-01
r.ME  8.613813520483170e-02
BACKEND=avx2
MadgraphTest.h:254: In comparing event 0 from iteration 0
  ME  1.914754413491214e-01
r.ME  8.613813520483170e-02
BACKEND=512y
MadgraphTest.h:254: In comparing event 0 from iteration 0
  ME  1.914754413491214e-01
r.ME  8.613813520483170e-02
BACKEND=512z
MadgraphTest.h:254: In comparing event 0 from iteration 0
  ME  1.972915668783644e-01
r.ME  8.613813520483170e-02
BACKEND=cuda
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jul 17, 2024
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jul 17, 2024
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jul 17, 2024
…n channelIds and allChannelIds also in sigmaKin (fix madgraph5#899; wip on madgraph5#911)
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jul 17, 2024
…the fact madgraph5#898 that channelids are the same inside each warp

This also fixes all pending CID_ACCESS::ieventAccessRecordConst calls madgraph5#899 and madgraph5#911

Note: CPPProcess.cc is now again very close to upstream/master (most of master_june24 changes from madgraph5#830 have been undone)
@oliviermattelaer oliviermattelaer added this to the warp milestone Jul 18, 2024
@valassi
Copy link
Member Author

valassi commented Jul 19, 2024

Ah ok, I guess what you refer to is what I just comment on #894, that it is not clear to me what happens if nb_warp=2 for the SIMD case, if yes then I think that I do understand the issue now, (but no clue how to fix it).

Would you be able to point a commit where you fix this, I think it is super important for me (and @roiser) to fully understand what you did such that we can maintain such tricky stuff in the long term. (If you do not have it yet, it might even be good if I or Stefan tries to fix those for the exact same reason)

Thanks,

Olivier

Thanks Olivier :-)

There are many independent bugs here.

I hope I gave you enough pointers, let me know if you need more information.

This specific 899 is fixed in 882. Linking it there and closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants