Overlap communication with computation in multiply_module #290

ilectra · 2023-11-27T18:15:56Z

Fixes #265

- Refactor do_comms/prefetch to use flag for non-blocking comms - Add omp barrier after calculating indices in loop - Take into account local comms - they do not produce an MPI_request

…ts run but produce wrong results.

tkoskela · 2023-12-05T10:35:45Z

From @davidbowler: compute is only being called on kpart [2 : end]. To fix, call compute kernels on kpart -1 then call once after the loop on kpart.

…for only kpart=[2,end-1]. Get rid of not needed variable icall of prefetch, in various places. Code now produces correct results for MPI nprocs=1 and nthreads=any, but deadlocks for more than 1 MPI procs. Separated the MPI_waits, but was not fixed.

- recv_part should be stored in one sequential list throughout the loop, not two, since it is used for the MPI_Irecv tags that have to match the MPI_Issend ones. - the periodicity check has to check the previous partition (kpart-1), not kpart-2. Adapt a bunch of stuff accordingly. Simplify the way the computation is taken care of for the last partition (now inside the loop).

…asses locally though....

Using \"-check bounds\" when compiling, showed that the index of recv_part did not start at 0 when it was not an allocatable in a function that used it.

ilectra · 2023-12-20T17:05:38Z

I think this can be reviewed now. I'll produce some profiles, to see if we gained anything, when I'm back from the holidays, I don't think I'll have time tomorrow.

tkoskela

I did a pass of reviews, sorry about being a bit nitpicky in some places. It's Friday.

I would like to understand the performance better before we merge this in.

tkoskela · 2024-01-19T12:17:49Z

src/comms_module.f90

+    !lenbind_rem = size(bind_rem)
+    ierr = 0
+    ilen1 = nc_part
+    !if(3*ilen1+5*ilen2>lenbind_rem) call cq_abort('Get error ',3*ilen1+5*ilen2,lenbind_rem)


Suggested change

!if(3*ilen1+5*ilen2>lenbind_rem) call cq_abort('Get error ',3*ilen1+5*ilen2,lenbind_rem)

tkoskela · 2024-01-19T12:19:14Z

src/comms_module.f90

+    ierr = 0
+    ilen1 = nc_part
+    !if(3*ilen1+5*ilen2>lenbind_rem) call cq_abort('Get error ',3*ilen1+5*ilen2,lenbind_rem)
+    if(ilen3>lenb_rem) call cq_abort('Get error 2 ',ilen3,lenb_rem)


It seems like this check should happen inside the if(ilen3.gt.0) clause

tkoskela · 2024-01-19T12:19:40Z

src/comms_module.f90

+    !lenb_rem = size(b_rem)
+    !lenbind_rem = size(bind_rem)


Suggested change

!lenb_rem = size(b_rem)

!lenbind_rem = size(bind_rem)

tkoskela · 2024-01-19T12:25:43Z

src/multiply_module.f90

+               bind_rem,b_rem,lenb_rem,bind,&
+               a_b_c%istart(ipart,nnode), &
+               bmat(1)%mx_abs,parts%mx_mem_grp,tag)
+       end if
    end if


I think removing icall here was an error that I reverted in #295 but it still seems to have made its way here. Unless you sorted out the logic?

tkoskela · 2024-01-19T12:31:25Z

src/multiply_module.f90

+       ! If that previous partition was a periodic one, copy over arrays from previous index
+       if(.not.new_partition(index_comp)) then
+          part_array(:,index_comp) = part_array(:,index_rec)
+          n_cont(index_comp) = n_cont(index_rec)
+          ilen2(index_comp) = ilen2(index_rec)
+          b_rem(index_comp) = b_rem(index_rec)
+          lenb_rem(index_comp) = lenb_rem(index_rec)


Have you looked at how often this happens? I'm slightly concerned about the performance cost of these data copies. Could this be handled differently, for example via a if branch or shudder pointers?

tkoskela · 2024-01-19T12:42:15Z

src/multiply_module.f90

       end if
       !$omp barrier
-    end do main_loop
+    end do main_loop    


Suggested change

end do main_loop

end do main_loop

tkoskela · 2024-01-19T12:48:06Z

src/multiply_module.f90

+    logical, intent(in), optional :: do_nonb
+    integer, intent(out), optional :: request(2)
+
+    integer :: ind_part, ipart, nnode, offset


offset seems to be unused. Also I'm a bit confused about what offsets the comment in the docstring refers to here.

In general there's a huge amount of unused variables in this module (by no means an isolated case, they're a plague throughout the code). Not necessarily worth doing in this PR, but whenever we are working on a source file, it's worth compiling with -Wall and removing the unused variables.

tkoskela · 2024-01-19T12:52:42Z

src/multiply_module.f90

+    new_partition = .true.
+
+    ! Check if this is a periodic image of the previous partition
+    if(kpart>1) then


Is this necessary? I thought kpart>1 is guaranteed.

Especially if we start the loop index from 1 and don't need to subtract 1 on line 609, then this can go

There seems to be a mix of use of .gt. and >. We should decide which one to use. Going on a bit of a tangent here, but we should discuss coding style with Dave (should have done at the beginning rather 🙈)

tkoskela · 2024-01-19T12:58:18Z

src/multiply_module.f90

+       if(allocated(b_rem)) deallocate(b_rem)
+       if(a_b_c%parts%i_cc2node(ind_part)==myid+1) then
+          lenb_rem = a_b_c%bmat(ipart)%part_nd_nabs
+       else
+          lenb_rem = a_b_c%comms%ilen3rec(ipart,nnode)
+       end if
+       allocate(b_rem(lenb_rem))


I wonder if we could get rid of this deallocation/reallocation business by just allocating to the max value and being explicit about indexing. I imagine that would have performance benefits as well. Let's put this up in a separate issue if it looks complicated.

tkoskela · 2024-01-19T13:03:25Z

src/multiply_module.f90


+    ! Set non-blocking receive flag
+    do_nonb_local = .false.
+    if (present(do_nonb)) do_nonb_local = do_nonb


Could we remove the do_nonb argument, and just check for if(present(request)) here? Then you wouldn't have to check for that later? Is there a benefit of being able to handle a scenario where do_nonb = .false. gets passed in?

…ageable.

Conflicts: src/system/system.kathleen.make

ilectra · 2024-06-17T11:48:51Z

There's no performance improvement seen, if anything there's a small degradation (see below). I think I understand why: the problem is the order communications are received, and not the time they take.

Will not merge for now, and instead investigate optimising the order in https://github.com/OrderN/CONQUEST-release/tree/ic-mm-comms-optimise-order . If that works, then we can revisit overlapping comms with computation, for further improvement.

ilectra added 2 commits October 6, 2023 16:30

Refactor communications into its own subroutine in multiply_module

9e15b16

Add non-blocking Mquest_get, use it to overlap comms with comps.

05317da

ilectra added area: main-source Relating to the src/ directory (main Conquest source code) improves: speed Speed-up of code labels Nov 27, 2023

ilectra self-assigned this Nov 27, 2023

ilectra added 2 commits November 29, 2023 10:22

Some debugging:

45015e9

- Refactor do_comms/prefetch to use flag for non-blocking comms - Add omp barrier after calculating indices in loop - Take into account local comms - they do not produce an MPI_request

More debugging: move pointers out of functions, correct barriers. Tes…

4e2d38c

…ts run but produce wrong results.

ilectra requested a review from tkoskela November 29, 2023 17:36

ilectra added 3 commits December 15, 2023 12:33

Some clean up of variables from PR 295

bd1ef12

ilectra force-pushed the ic-mm-comms-overlap branch from c297692 to bd1ef12 Compare December 20, 2023 15:32

ilectra added 2 commits December 20, 2023 15:43

Attempt to fix failing test in CI for "double free or corruption" - p…

26b2585

…asses locally though....

Further attempt to fix failing CI test.

a0b6b2e

Using \"-check bounds\" when compiling, showed that the index of recv_part did not start at 0 when it was not an allocatable in a function that used it.

ilectra mentioned this pull request Dec 20, 2023

Remove pointers in multiply module #294

Merged

ilectra marked this pull request as ready for review December 20, 2023 17:04

ilectra added 2 commits January 10, 2024 11:20

Update makefile for kathleen.

4d77876

Merge branch 'develop' into ic-mm-comms-overlap

4a71353

tkoskela reviewed Jan 19, 2024

View reviewed changes

tkoskela mentioned this pull request Jan 19, 2024

Multi threading exact exchange #276

Closed

ilectra and others added 3 commits January 24, 2024 11:58

Add .gitignore and needed housekeeping, to make benchmark folders man…

c720172

…ageable.

Remove unneeded barriers.

6ab122f

Merge branch 'develop' into ic-mm-comms-overlap

34fac21

Conflicts: src/system/system.kathleen.make

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overlap communication with computation in multiply_module #290

Overlap communication with computation in multiply_module #290

ilectra commented Nov 27, 2023

tkoskela commented Dec 5, 2023

ilectra commented Dec 20, 2023

tkoskela left a comment

tkoskela Jan 19, 2024

tkoskela Jan 19, 2024

tkoskela Jan 19, 2024

tkoskela Jan 19, 2024

tkoskela Jan 19, 2024

tkoskela Jan 19, 2024

tkoskela Jan 19, 2024

tkoskela Jan 19, 2024

tkoskela Jan 19, 2024

tkoskela Jan 19, 2024

tkoskela Jan 19, 2024

ilectra commented Jun 17, 2024

Overlap communication with computation in multiply_module #290

Are you sure you want to change the base?

Overlap communication with computation in multiply_module #290

Conversation

ilectra commented Nov 27, 2023

tkoskela commented Dec 5, 2023

ilectra commented Dec 20, 2023

tkoskela left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ilectra commented Jun 17, 2024