Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overlap communication with computation in multiply_module #290

Open
wants to merge 14 commits into
base: develop
Choose a base branch
from

Conversation

ilectra
Copy link
Contributor

@ilectra ilectra commented Nov 27, 2023

Fixes #265

@ilectra ilectra added area: main-source Relating to the src/ directory (main Conquest source code) improves: speed Speed-up of code labels Nov 27, 2023
@ilectra ilectra self-assigned this Nov 27, 2023
- Refactor do_comms/prefetch to use flag for non-blocking comms
- Add omp barrier after calculating indices in loop
- Take into account local comms - they do not produce an MPI_request
@tkoskela
Copy link
Contributor

tkoskela commented Dec 5, 2023

From @davidbowler: compute is only being called on kpart [2 : end]. To fix, call compute kernels on kpart -1 then call once after the loop on kpart.

…for only kpart=[2,end-1].

Get rid of not needed variable icall of prefetch, in various places.
Code now produces correct results for MPI nprocs=1 and nthreads=any, but deadlocks for more than 1 MPI procs. Separated the MPI_waits, but was not fixed.
- recv_part should be stored in one sequential list throughout the loop, not two, since it is used for the MPI_Irecv tags that have to match the MPI_Issend ones.
- the periodicity check has to check the previous partition (kpart-1), not kpart-2. Adapt a bunch of stuff accordingly.
Simplify the way the computation is taken care of for the last partition (now inside the loop).
Using \"-check bounds\" when compiling, showed that the index of recv_part did not start at 0 when it was not an allocatable in a function that used it.
@ilectra ilectra marked this pull request as ready for review December 20, 2023 17:04
@ilectra
Copy link
Contributor Author

ilectra commented Dec 20, 2023

I think this can be reviewed now. I'll produce some profiles, to see if we gained anything, when I'm back from the holidays, I don't think I'll have time tomorrow.

Copy link
Contributor

@tkoskela tkoskela left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a pass of reviews, sorry about being a bit nitpicky in some places. It's Friday.

I would like to understand the performance better before we merge this in.

!lenbind_rem = size(bind_rem)
ierr = 0
ilen1 = nc_part
!if(3*ilen1+5*ilen2>lenbind_rem) call cq_abort('Get error ',3*ilen1+5*ilen2,lenbind_rem)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
!if(3*ilen1+5*ilen2>lenbind_rem) call cq_abort('Get error ',3*ilen1+5*ilen2,lenbind_rem)

ierr = 0
ilen1 = nc_part
!if(3*ilen1+5*ilen2>lenbind_rem) call cq_abort('Get error ',3*ilen1+5*ilen2,lenbind_rem)
if(ilen3>lenb_rem) call cq_abort('Get error 2 ',ilen3,lenb_rem)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this check should happen inside the if(ilen3.gt.0) clause

Comment on lines +278 to +279
!lenb_rem = size(b_rem)
!lenbind_rem = size(bind_rem)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
!lenb_rem = size(b_rem)
!lenbind_rem = size(bind_rem)

bind_rem,b_rem,lenb_rem,bind,&
a_b_c%istart(ipart,nnode), &
bmat(1)%mx_abs,parts%mx_mem_grp,tag)
end if
end if
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think removing icall here was an error that I reverted in #295 but it still seems to have made its way here. Unless you sorted out the logic?

Comment on lines 255 to 261
! If that previous partition was a periodic one, copy over arrays from previous index
if(.not.new_partition(index_comp)) then
part_array(:,index_comp) = part_array(:,index_rec)
n_cont(index_comp) = n_cont(index_rec)
ilen2(index_comp) = ilen2(index_rec)
b_rem(index_comp) = b_rem(index_rec)
lenb_rem(index_comp) = lenb_rem(index_rec)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you looked at how often this happens? I'm slightly concerned about the performance cost of these data copies. Could this be handled differently, for example via a if branch or shudder pointers?

end if
!$omp barrier
end do main_loop
end do main_loop
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
end do main_loop
end do main_loop

logical, intent(in), optional :: do_nonb
integer, intent(out), optional :: request(2)

integer :: ind_part, ipart, nnode, offset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

offset seems to be unused. Also I'm a bit confused about what offsets the comment in the docstring refers to here.

In general there's a huge amount of unused variables in this module (by no means an isolated case, they're a plague throughout the code). Not necessarily worth doing in this PR, but whenever we are working on a source file, it's worth compiling with -Wall and removing the unused variables.

new_partition = .true.

! Check if this is a periodic image of the previous partition
if(kpart>1) then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary? I thought kpart>1 is guaranteed.

Especially if we start the loop index from 1 and don't need to subtract 1 on line 609, then this can go

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be a mix of use of .gt. and >. We should decide which one to use. Going on a bit of a tangent here, but we should discuss coding style with Dave (should have done at the beginning rather 🙈)

Comment on lines +619 to +625
if(allocated(b_rem)) deallocate(b_rem)
if(a_b_c%parts%i_cc2node(ind_part)==myid+1) then
lenb_rem = a_b_c%bmat(ipart)%part_nd_nabs
else
lenb_rem = a_b_c%comms%ilen3rec(ipart,nnode)
end if
allocate(b_rem(lenb_rem))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could get rid of this deallocation/reallocation business by just allocating to the max value and being explicit about indexing. I imagine that would have performance benefits as well. Let's put this up in a separate issue if it looks complicated.


! Set non-blocking receive flag
do_nonb_local = .false.
if (present(do_nonb)) do_nonb_local = do_nonb
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we remove the do_nonb argument, and just check for if(present(request)) here? Then you wouldn't have to check for that later? Is there a benefit of being able to handle a scenario where do_nonb = .false. gets passed in?

@ilectra
Copy link
Contributor Author

ilectra commented Jun 17, 2024

There's no performance improvement seen, if anything there's a small degradation (see below). I think I understand why: the problem is the order communications are received, and not the time they take.
Screenshot from 2024-06-17 12-44-45
Will not merge for now, and instead investigate optimising the order in https://github.com/OrderN/CONQUEST-release/tree/ic-mm-comms-optimise-order . If that works, then we can revisit overlapping comms with computation, for further improvement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: main-source Relating to the src/ directory (main Conquest source code) improves: speed Speed-up of code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants