Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmarks #17

Open
szaghi opened this issue Sep 5, 2015 · 26 comments
Open

Benchmarks #17

szaghi opened this issue Sep 5, 2015 · 26 comments

Comments

@szaghi
Copy link
Member

szaghi commented Sep 5, 2015

Here we can discuss the benchmarks analysis.

It could be useful to compare with the performance of other non abstract implementations.

Architecture Comparison with other non ACP Conclusion
Serial
Parallel shared memory
Parallel distributed memory
@szaghi szaghi added this to the Ready for publishing milestone Sep 5, 2015
@szaghi
Copy link
Member Author

szaghi commented Sep 18, 2015

I have just added the first OpenMP test (v0.0.8) it seems to scale reasonably!

not all parallelizable parts are actually parallelized yet

More detailed analysis will come soon!

@szaghi
Copy link
Member Author

szaghi commented Oct 16, 2015

I have made some progress with OpenMP benchmarks. You can find a script for it under the paper sub-directory. Into the paper I placed some figure for strong and weak scaling that I also upload here. It seem that at 8 cores my workstation finishes the fuel...

strong scaling

strong-scaling

weak scaling

weak-scaling

With my workstation (Intel Xeon with 12 physical cores) the performance are not exiting, but the test has been done very quickly and it could be some huge mistakes. For example the "size" of the test could be meaningless (maybe it is to small and I spent more time to create the OpenMP threads than for the actual computations). I am almost sure that the MPI test will perform better. However, I need some help from HPC guru ( @francescosalvadore, @muellermichel, all other than me...): the tricky point with OpenMP are the operator overloading, e.g. see the multiply. To take advantage of the automatic lhs-reallocation I have to split the operators implementation with something like a serial section and threaded one:

  ! serial
  select type(opr)
  class is(euler_1D_openmp)
    opr = lhs
  endselect
  ! parallel
  !$OMP PARALLEL DEFAULT(NONE) PRIVATE(i) SHARED(lhs, rhs, opr)
  select type(opr)
  class is(euler_1D_openmp)
    !$OMP DO
    do i=1, lhs%Ni
      opr%U(:, i) = lhs%U(:, i) * rhs
    enddo
  endselect
  !$OMP END PARALLEL

This is due to the fact that the first "serial" section is parallelized it self, it being the assignment operator where I placed something like:

    !$OMP PARALLEL DEFAULT(NONE) PRIVATE(i) SHARED(lhs, rhs)
  select type(opr)
  class is(euler_1D_openmp)
    !$OMP DO
    do i=1, lhs%Ni
      lhs%U(:, i) = rhs%U(:, i)
    enddo
  endselect
  !$OMP END PARALLEL

Because operators overloading is ubiquitous in FOODiE, could my bad OpenMP implementation be the reason of these bad performance? Obviously, we must ensure that the benchmark has sense before answer this question, but I think that your experience can already discover the murder...

The complete OpenMP test is here. I will very appreciate your opinions on it. How you implement it?

See you soon.

@szaghi
Copy link
Member Author

szaghi commented Oct 17, 2015

Hi all,
just a remark: I realized that the weak scaling is not computed right. The Euler test has a variable time resolution that increases as the space resolution increases accordingly to the CFL condition.... thus in the weak scaling when the size increases the number of steps performed increases as well! I have to modify the main program fixing the total number of time steps independently on resolution, I am sorry.

For the moment consider only the strong scaling, that indead is not so bad up to 8 cores.

@milancurcic
Copy link
Contributor

@szaghi Thank you more making progress with this. I have been having some thoughts about the usefulness of parallel benchmarks for FOODiE. As you state in the manuscript draft, FOODiE is unaware of the parallelism implemented in the user's external program. Other than just proof of concept which is useful, what is the purpose then of parallel benchmarks? The parallelism is never implemented in the same dimension as integration dimension, and is not part of the library itself, so these results do not tell me much about FOODiE.

What do you think?

@szaghi
Copy link
Member Author

szaghi commented Oct 17, 2015

@milancurcic you are right, my intention is not clear.

The parallel benchmarks are aimed to prove that the ADT and ACP do not decrease or affect the performance when used into a parallel framework. Most of the critics about OOP and abstraction in the Fortran community are focused on the performance issue: I would like to prove that Damian's theory is right, if used carefully high level of abstraction does not decrease performance.

As you said, FOODiE is unaware of parallelism, but I want to prove that it can be used into HPC codes without much concern.

I would like that these benchmarks (MPI will come soon, but I would like to try hybrid-fortran and coarray) say to you about FOODiE Milan trust me, you can safe use me, I will not devaste your parallel scaling

@milancurcic
Copy link
Contributor

👍

@sourceryinstitute
Copy link

If you are going parallel, I'd recommend getting started with coarray Fortran (CAF) early on and please, please, please don't go down the MPI path. There are so many reasons to choose CAF over MPI in a new project and I fear I'm not going to have time to write them all. If you guys ever think about setting up occasional teleconferences via Skype or Google Hangout, I would be glad to join and could probably provide guidance more efficiently in that forum than online.

I only have a moment and will try to type what I can fast.

First, I'm sure you know the saying, "MPI is the assembly language of parallel programming." I used to interpret that as a joke made at MPI's expense. Then I mentioned that joke to none other than Bill Gropp who led the development of MPI since the early days more than 20 years ago and his response was (paraphrasing): "That's what we intended! We were originally targeting compilers, not applications. The problem is that it took so long for parallel languages to come along that MPI ended up in the hands of application developers." Well, CAF is now the most widely installed parallel language. Anyone who has a recent release of the GCC, Intel, or Cray compiler has CAF in their hands. There is no longer a good reason to write assembly language -- not even for performance which I'll mention next.

On the OpenCoarrays project, we now have data showing CAF outperforming a code with raw MPI in the source even when OpenCoarrays wraps MPI. See, for example, the first recent article at http://www.opencoarrays.org/publications. Why is this the case? I'm glad you asked! :) It's because one way to do a fair comparison is to compare CAF to the MPI feature set used by every application developer with whom I've had this discussion. The OpenCoarrays CAF implementation exploits the one-sided "put" and "get" communication offered in MPI-3; whereas every application developer I've met is still using MPI's older two-sided send/receive communication. The one-sided communication often outperforms the two-sided communication on hardware with support for remote-direct memory access (RDMA), which includes Infiniband and the proprietary Cray/Intel interconnects. Based on the aforementioned article, it apparently also outperforms MPI in shared memory and when exploiting a GPU.

I'd say it might even be a good idea to start down the CAF path even before adding OpenMP to your code. I say this partly because CAF can work in either shared or distributed memory, whereas OpenMP only works in memory that is at least logically shared (though on some architectures, it might be physically distributed). I would add OpenMP in a secondary stage of performance tuning.

Also, I'd love to hear your about your experience if you take a look at DO CONCURRENT. It can replace some some of the simpler loop-level OpenMP directives, but it also can be used in conjunction with OpenMP. Either way, I'd start with fully exploiting Fortran's own capabilities before adding directives that are external to the language. It will greatly aid the portability and clarity and generality of your code -- partly because it will offload some decisions to the compiler the the compiler might be better qualified to make. For example, sometimes the compiler might decide not to multithread the loop if there is a better technology such as loop unrolling or vectorization that might be better suited for the particular loop. With OpenMP, you're forcing one technology on the compiler (multithreading) to the exclusion of other options.

Parallelization using CAF is one area where I'd love to help although I'd still probably help in more of an advisory mode than in a coding mode for the near future.

@szaghi
Copy link
Member Author

szaghi commented Oct 18, 2015

@sourceryinstitute ok, you have convinced me! I will start with CAF. However, I have never done it before... I need help. Can you give me some references (books, papers, examples)? I will surely start with opencoarray, but I have not yet finc the time to study it: are there (in opencoarray) references to learn CAF?

Thank you very much for your help!

@szaghi
Copy link
Member Author

szaghi commented Oct 18, 2015

The OpenCoarrays CAF implementation exploits the one-sided "put" and "get" communication offered in MPI-3; whereas every application developer I've met is still using MPI's older two-sided send/receive communication.

Very very interesting! I am one of the addicted to the simple send/receive pattern (the most simple approach for my cases). I am very excited to learn and try opencoarrays!

I'd say it might even be a good idea to start down the CAF path even before adding OpenMP to your code. I say this partly because CAF can work in either shared or distributed memory, whereas OpenMP only works in memory that is at least logically shared (though on some architectures, it might be physically distributed). I would add OpenMP in a secondary stage of performance tuning.

Indeed I planned to do a comprhensive tests suite for parallel benchmarks:

  • serial, non abstract, without overloaded operators, ptocedural version (almost completed, but not yet tested and uploaded on github);
  • OpenMP FOODiE aware (completed in few minutes... it was difficult for you stopping me in time :-), tested and uploaded);
  • CAF FOODiE aware (I must learn CAF and OpenCoarrays before try to implement it, the timeline is not clear);
  • MPI FOODiE aware, now freezed until CAF OpenCoarrays will be completed;
  • Cuda/hybrid-fortran FOODiE aware, delayed in the near future;

See you tomorrow.

@sourceryinstitute
Copy link

Of the two application developers I've talked to who investigated switching from two-sided send/receive to one-sided put/get, one concluded that it would require more time than he could invest and the other attempted to make the switch but found it too complicated. This is the real value of OpenCoarrays: we do the puts and gets for you and save you a lot of the hassle. You get to just use a syntax and semantics that feels like a very natural extension of prior versions of Fortran.

I have a collection of tutorial videos online at http://www.sourceryinstitute.org/videos.html. The total amount of content is 1 hour and is intended more as a rapid overview that takes the viewer from the basic concepts up to writing an object-oriented PDE solver using a "functional programming style", which is how I more often refer to the Abstract Calculus pattern now. I do that because more people are familiar with functional programming than with patterns and far more than are familiar with the Abstract Calculus pattern as described in my book.

@sourceryinstitute
Copy link

The aforementioned videos pre-date OpenCoarrays and therefore use the Intel compiler. I would like to produce a new video focused on how to use OpenCoarrays, but it requires a significant investment of time. The previous videos were produced and edited by a film professional when I was working at Stanford full-time last year so it will be challenging to match the quality.

Also, the previous videos move through the material quite fast so they are intended more as a teaser to show enough capability to get people interested in studying further. Most modern Fortran books include some coarray material and my book has a small amount of coarray material in Chapter 12. I also frequently teach tutorials at conferences and other locations. In case you're attending the SC15 supercomputing conference next month, I'll co-teach a full-day tutorial there that will focus primarily on CAF: http://sc15.supercomputing.org/schedule/event_detail?evid=tut103.

@szaghi
Copy link
Member Author

szaghi commented Oct 18, 2015

Of the two application developers I've talked to who investigated switching from two-sided send/receive to one-sided put/get, one concluded that it would require more time than he could invest and the other attempted to make the switch but found it too complicated. This is the real value of OpenCoarrays: we do the puts and gets for you and save you a lot of the hassle. You get to just use a syntax and semantics that feels like a very natural extension of prior versions of Fortran.

I agree, it is a great plus of OpenCoarrays to make easy this kind of coversion: I never tried to move from send/receive because I feel it is a hard passage which requires a lot of time. I cannot wait for the tomorrow lunch break for start learning CAF and OpenCoarrays!

I have a collection of tutorial videos online at http://www.sourceryinstitute.org/videos.html. The total amount of content is 1 hour and is intended more as a rapid overview that takes the viewer from the basic concepts up to writing an object-oriented PDE solver using a "functional programming style", which is how I more often refer to the Abstract Calculus pattern now. I do that because more people are familiar with functional programming than with patterns and far more than are familiar with the Abstract Calculus pattern as described in my book.

Wonderful! I will try to watch them when my family does not watch me!

Unfortunately, my institute does not allow me to attend conferences that are not focused on hydrodynamics... thus not I cannot attend your conference :-(

@szaghi
Copy link
Member Author

szaghi commented Oct 18, 2015

I just read my first CAF example... it is quantum leap with respect MPI! It is extremely beatiful!

I can write my first observation: even in the worst case that for some special obscure circumstances some MPI implementations are faster than an ipothetical equivalent CAF (that from Damian's experience I guess it is a very irrealistic scenario), the consiseness, clearness Fortranish-style of CAF syntax overcomes such lacks. Modern fortran is fantastic!

@rouson
Copy link

rouson commented Oct 18, 2015

That's music to my ears. I have often thought that once someone sees the clarity and conciseness of CAF, it would be very hard to go back to writing MPI.

There are of course many caveats, but I think they will become less important over time:

  1. If the CAF implementation wraps MPI and a comparison is made to a code that has equivalent MPI in the source, then of course CAF can't win. At some point in the OpenCoarrays project, however, we realized that the scenario of comparing the MPI we generate form CAF to source code with the same MPI calls is not the relevant scenario. The more relevant comparison is between the MPI we generate and the MPI most people write.
  2. Getting one-sided communication to outperform two-sided communication requires
    (a) Hardware support such as RDMA over Infiniband or a proprietary Intel/Cray interconnect and
    (b) An MPI implementation tuned to take advantage of that hardware (this is not yet universal but should become common over time).
    An example of hardware that does not support one-sided communication is Ethernet (although there has been some work on supporting it in Ethernet, but it's likely to be expensive and it's hard for expensive Ethernet to compete with Infiniband). In the case of garden-variety Ethernet, the MPI implementation will likely emulate one-sided communication using two-sided communication, in which case there is no advantage.
  3. It is not necessary that MPI be the communication library that supports CAF. OpenCoarrays also contains a library that uses GASNet (http://gasnet.lbl.gov) to support CAF that is no longer maintained but could be revived in the future. CAF is a Partitioned Global Address Space (PGAS) language. GASNet is designed to support PGAS languages. On some systems, GASNet has a "conduit" (software) that outperforms MPI.

I guess item 3 isn't really a caveat. It's actually one of the coolest things about CAF. Besides liberating the application developer from embedding raw MPI calls in his or her source code, CAF can also liberate him or her from even having MPI under the hood. Using OpenCoarrays, the exact same CAF source can be linked to MPI or GASNet without modifying even one line of the CAF source (assuming our unmaintained GASNet layer still works). Which communication library gets used simply becomes a build-time decision.

I know one very prominent researcher who would very much like to see C++ replace Fortran. Yet even that researcher says he wishes coarrays would become part of the C++ standard (there are coarray C++ compilers out there, but I've heard no indication that the coarray feature are likely to make their way into the C++ standard any time soon). As far as I know, Fortran remains the only backed by an international standard that supports distributed-memory parallel programming in the language standard.

@rouson
Copy link

rouson commented Oct 18, 2015

I see there's some auto-formatting that is changing some of my posts. I just corrected a numbering mistake in the last post. I guess the new version appears online but doesn't get emailed.

@szaghi
Copy link
Member Author

szaghi commented Oct 18, 2015

@rouson

  1. Yes, you are right definitely. The comparison should be done with respect a real scenario where a bad programmer like me is not able to write the best MPI implementation; on the contrary, the same bad programmer can take advantage of OpenCoarrays that ensures a great quality low-level parallel back-end... really amazing!
  2. You are right I guess;
  3. Yes, emancipating from MPI monopolio is a good thing.

All these are is very cool news for me: Fortran rocks!

P.S. the time is mature for a Navier-Stokes CAF based solver... why do you, Fanfarillo, Filippone & Co. have not yet done?

@rouson
Copy link

rouson commented Oct 18, 2015

On Oct 18, 2015, at 1:32 PM, Stefano Zaghi [email protected] wrote:

P.S. the time is mature for a Navier-Stokes CAF based solver... why do you, Fanfarillo, Filippone & Co. have not yet done?

We are. :) It will eventually be open-source, but not just yet. We are envisioning a pretty comprehensive framework with a range of options for spatial discretization, including spectral methods, compact finite difference methods, and finite volume methods. The framework should also support a range of physics, including at least Lagrangian particle tracking, fluid dynamics, and magnetohydrodynamics. Several of these capabilities exist in some form, but either need updated software design (using the latest language features) or more robust numerics. If any of you are interested in contributing, let me know and I’ll keep you in mind when we invite others into the project. For the parts of it that are the most mature, it shouldn’t be very far into the future (hopefully months).

Damian

@rouson
Copy link

rouson commented Oct 18, 2015

P.S. And I think it's quite likely that FOODiE/FOODIE could play a role quite soon. I'm thinking of using it for the Lagrangian particle tracking module. We believe that CAF will shine most on problems that require dynamic load balancing such as particle tracking.

@milancurcic
Copy link
Contributor

@rouson Damian, your project sounds very interesting and I will definitely be interested in joining, if time permitting. Please feel free to reach out when you are ready.

I am slowly reading and digesting all your messages on CAF and I can say I feel being won over as well. All parallel aspects of my PhD work were written using MPI, and yes, I as well used simple send/receive calls to communicate. I remember writing a simple gather/scatter example program in CAF some time in 2010 or 11 after John Reid's paper came out. If I recall correctly, I could only get it to work with Intel Fortran compiler + MPICH, and the coarray support seemed lacking in general. So then I put it on the shelf and forgot about, as it was the time of my PhD when I had to just make things work and move on. Your e-mails convince me that now is a different time from 5 years ago (I am not surprised).

That being said, I am currently working on an ocean wave modeling framework based on the model I developed during my PhD. Existing parallelism is pure MPI, but now I feel it may be time to switch gears in the CAF direction. :)

Thanks,
milan

@szaghi
Copy link
Member Author

szaghi commented Oct 19, 2015

@rouson I am here! I am interested absolutely, but I doubt to be of any help for you, I am not to your level, however I could do some oompa loompa works :-)

My aims after FOODIE will be stable and many solvers where implemented (embedded RK in particular) are develop other small, KISS libraries for:

  • multigrid;
  • WENO interpolation (hybridized with compact Padè-like formulae);
  • AMR;
  • mesh overlapping (chimera) for moving grids and complex geometry;
  • immersed boundary (alternatively to chimera for some problems);
  • multi-phase model by VOF and or levelset;
  • partcles tracking (lagrangian methods);
  • a lot of other libraries...

All these stuff will be free and collaborative. I hope to interest at least @andreadimascio that was my mentore 4/5 years ago (and now I am not allowed to work with him, but I hope that some virtual connections do not hurt my Institute) and he has already faced with all the above problems and he has success... however, like it is happening for FOODIE, I hope to learn a lot from all of you.

I will be very happy to study your work, e.g. I could attend any conferences/meeting/lessons at Torvergata University.

@giacrossi
Copy link
Contributor

@rouson Dear Damian, I'm very interested on your project, but if @szaghi said that he isn't at your level, I can assert that I'm far far away from your experience!!! :-D

I've always used MPI for all parallel aspects of my codes, but CAF seems a very good improvement (also if I think that improvement is not the correct word: it's a totally new world!!!).

I'm starting to follow your abstract data type paradigm for my first github project, and I've read all your very very intresting discussion about FOODIE (the new acronym is very good) and the numerical schemes that can be implemented in the near future.

As @szaghi my aims are to develop other small libraries for, in particular, WENO interpolation, mesh overlapping and multi-phase models...

I hope to learn a lot from all of you!!!

See you soon!

@szaghi
Copy link
Member Author

szaghi commented Oct 19, 2015

I have added the 1D Euler OpenMP-enabled test solved without FOODIE.

The results seem good

strong-scaling-comparison
weak-scaling-comparison

For details see the manuscript.

possible issue

@francescosalvadore let me conscious that my analysis could be "partial": for verifying if a code scales satisfactory the reference serial baseline code should be highly optimized (in particular I have to check is the compiler has been able to vectorize the baseline code). I have used for all codes (serial-FOODIE, OpenMP-FOODIE, serial-without-FOODIE and OpenMP-without-FOODIE) the same -O2 compiling flags (with gfortran 5.2.0) without taking care of the actual level of optimization the compiler produced. Maybe the serial codes are not well optimized, thus the scaling result so good... Anyway, this slightly affects my aim: I just want to prove that FOODIE does not destroy the scaling, not that this test scales good itself in an absolute meaning...

@milancurcic
Copy link
Contributor

@rouson @szaghi

If you guys ever think about setting up occasional teleconferences via Skype or Google Hangout, I would be glad to join and could probably provide guidance more efficiently in that forum than online.

Sorry I forgot about this. Stefano, it would be a good idea if we get together on Google hangouts to pick Damian's brain and discuss most immediate concerns:

  • Scope and target audience for FOODIE
  • Parallelism and moving forward with CAF
  • Manuscript layout/content/ideas
  • Other issues that may come up

It should be short, no longer than 30-45 minutes. Perhaps we can agree on some time in November? Any day Monday through Thursday, 8AM-6PM EST would work for me.

@szaghi
Copy link
Member Author

szaghi commented Oct 25, 2015

@milancurcic and all,
yes, you are absolutely rigth, we can try. However, I am to sure to be able to hold a spoken conversation with you, but I can try. Let me know when you are avalaible. See you soon.

@milancurcic
Copy link
Contributor

@szaghi Great! I am sure we will be able to talk :)

@rouson Damian, given that you are likely the busiest of us here, can you pick one day in November that you think will work for you to meet with us for an hour? Morning hours would be ideal since Stefano is in Italy.

@zbeekman It would be great if you can join as well.

Thank you all.

@szaghi
Copy link
Member Author

szaghi commented Oct 26, 2015

@milancurcic and all,
It is very embarrassing for me... I should be the most free here, but my director has just made me aware that 3rd November I must (should) go to Genova. Indeed, if you plan to join in the morning, I am confident to be online up to 9:00 am (Rome time). Aside 3rd November (and 2nd November after 14:00 pm if I have to travel to Genova), I should be free all the month.

See you soon.

P.S. I think that Zaak is now very busy with his Ph.D. defense: maybe November is not a good month to disturb his attention. I think we should cheer for Zaak and wait after November for his opinions. Obviously, I will be very happy if he can help us before, but I do not like to disturb his concentration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants