-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overlap communication with computation in multiply_module #290
base: develop
Are you sure you want to change the base?
Conversation
- Refactor do_comms/prefetch to use flag for non-blocking comms - Add omp barrier after calculating indices in loop - Take into account local comms - they do not produce an MPI_request
…ts run but produce wrong results.
From @davidbowler: compute is only being called on kpart [2 : end]. To fix, call compute kernels on kpart -1 then call once after the loop on kpart. |
…for only kpart=[2,end-1]. Get rid of not needed variable icall of prefetch, in various places. Code now produces correct results for MPI nprocs=1 and nthreads=any, but deadlocks for more than 1 MPI procs. Separated the MPI_waits, but was not fixed.
- recv_part should be stored in one sequential list throughout the loop, not two, since it is used for the MPI_Irecv tags that have to match the MPI_Issend ones. - the periodicity check has to check the previous partition (kpart-1), not kpart-2. Adapt a bunch of stuff accordingly. Simplify the way the computation is taken care of for the last partition (now inside the loop).
c297692
to
bd1ef12
Compare
…asses locally though....
Using \"-check bounds\" when compiling, showed that the index of recv_part did not start at 0 when it was not an allocatable in a function that used it.
I think this can be reviewed now. I'll produce some profiles, to see if we gained anything, when I'm back from the holidays, I don't think I'll have time tomorrow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a pass of reviews, sorry about being a bit nitpicky in some places. It's Friday.
I would like to understand the performance better before we merge this in.
!lenbind_rem = size(bind_rem) | ||
ierr = 0 | ||
ilen1 = nc_part | ||
!if(3*ilen1+5*ilen2>lenbind_rem) call cq_abort('Get error ',3*ilen1+5*ilen2,lenbind_rem) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
!if(3*ilen1+5*ilen2>lenbind_rem) call cq_abort('Get error ',3*ilen1+5*ilen2,lenbind_rem) |
ierr = 0 | ||
ilen1 = nc_part | ||
!if(3*ilen1+5*ilen2>lenbind_rem) call cq_abort('Get error ',3*ilen1+5*ilen2,lenbind_rem) | ||
if(ilen3>lenb_rem) call cq_abort('Get error 2 ',ilen3,lenb_rem) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like this check should happen inside the if(ilen3.gt.0)
clause
!lenb_rem = size(b_rem) | ||
!lenbind_rem = size(bind_rem) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
!lenb_rem = size(b_rem) | |
!lenbind_rem = size(bind_rem) |
bind_rem,b_rem,lenb_rem,bind,& | ||
a_b_c%istart(ipart,nnode), & | ||
bmat(1)%mx_abs,parts%mx_mem_grp,tag) | ||
end if | ||
end if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think removing icall
here was an error that I reverted in #295 but it still seems to have made its way here. Unless you sorted out the logic?
src/multiply_module.f90
Outdated
! If that previous partition was a periodic one, copy over arrays from previous index | ||
if(.not.new_partition(index_comp)) then | ||
part_array(:,index_comp) = part_array(:,index_rec) | ||
n_cont(index_comp) = n_cont(index_rec) | ||
ilen2(index_comp) = ilen2(index_rec) | ||
b_rem(index_comp) = b_rem(index_rec) | ||
lenb_rem(index_comp) = lenb_rem(index_rec) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you looked at how often this happens? I'm slightly concerned about the performance cost of these data copies. Could this be handled differently, for example via a if branch or shudder pointers?
end if | ||
!$omp barrier | ||
end do main_loop | ||
end do main_loop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
end do main_loop | |
end do main_loop |
logical, intent(in), optional :: do_nonb | ||
integer, intent(out), optional :: request(2) | ||
|
||
integer :: ind_part, ipart, nnode, offset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
offset seems to be unused. Also I'm a bit confused about what offsets the comment in the docstring refers to here.
In general there's a huge amount of unused variables in this module (by no means an isolated case, they're a plague throughout the code). Not necessarily worth doing in this PR, but whenever we are working on a source file, it's worth compiling with -Wall
and removing the unused variables.
new_partition = .true. | ||
|
||
! Check if this is a periodic image of the previous partition | ||
if(kpart>1) then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary? I thought kpart>1
is guaranteed.
Especially if we start the loop index from 1 and don't need to subtract 1 on line 609, then this can go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seems to be a mix of use of .gt.
and >
. We should decide which one to use. Going on a bit of a tangent here, but we should discuss coding style with Dave (should have done at the beginning rather 🙈)
if(allocated(b_rem)) deallocate(b_rem) | ||
if(a_b_c%parts%i_cc2node(ind_part)==myid+1) then | ||
lenb_rem = a_b_c%bmat(ipart)%part_nd_nabs | ||
else | ||
lenb_rem = a_b_c%comms%ilen3rec(ipart,nnode) | ||
end if | ||
allocate(b_rem(lenb_rem)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we could get rid of this deallocation/reallocation business by just allocating to the max value and being explicit about indexing. I imagine that would have performance benefits as well. Let's put this up in a separate issue if it looks complicated.
|
||
! Set non-blocking receive flag | ||
do_nonb_local = .false. | ||
if (present(do_nonb)) do_nonb_local = do_nonb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we remove the do_nonb
argument, and just check for if(present(request))
here? Then you wouldn't have to check for that later? Is there a benefit of being able to handle a scenario where do_nonb = .false.
gets passed in?
Conflicts: src/system/system.kathleen.make
There's no performance improvement seen, if anything there's a small degradation (see below). I think I understand why: the problem is the order communications are received, and not the time they take. |
Fixes #265