forked from OpenCMISS-Dependencies/mvapich2
-
Notifications
You must be signed in to change notification settings - Fork 0
/
CHANGELOG
1469 lines (1196 loc) · 60.5 KB
/
CHANGELOG
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
MVAPICH2 Changelog
------------------
This file briefly describes the changes to the MVAPICH2 software
package. The logs are arranged in the "most recent first" order.
MVAPICH2-2.1rc1 (12/18/2014)
* Features and Enhancements (since 2.1a):
- Based on MPICH-3.1.3
- Flexibility to use internal communication buffers of different size for
improved performance and memory footprint
- Improve communication performance by removing locks from critical path
- Enhanced communication performance for small/medium message sizes
- Support for linking Intel Trace Analyzer and Collector
- Increase the number of connect retry attempts with RDMA_CM
- Automatic detection and tuning for Haswell architecture
* Bug-Fixes (since 2.1a):
- Fix automatic detection of support for atomics
- Fix issue with void pointer arithmetic with PGI
- Fix deadlock in ctxidup MPICH test in PSM channel
- Fix compile warnings
MVAPICH2-2.1a (09/21/2014)
* Features and Enhancements (since 2.0):
- Based on MPICH-3.1.2
- Support for PMI-2 based startup with SLURM
- Enhanced startup performance for Gen2/UD-Hybrid channel
- GPU support for MPI_Scan and MPI_Exscan collective operations
- Optimize creation of 2-level communicator
- Collective optimization for PSM-CH3 channel
- Tuning for IvyBridge architecture
- Add -export-all option to mpirun_rsh
- Support for additional MPI-T performance variables (PVARs)
in the CH3 channel
- Link with libstdc++ when building with GPU support
(required by CUDA 6.5)
* Bug-Fixes (since 2.0):
- Fix error in large message (>2GB) transfers in CMA code path
- Fix memory leaks in OFA-IB-CH3 and OFA-IB-Nemesis channels
- Fix issues with optimizations for broadcast and reduce collectives
- Fix hang at finalize with Gen2-Hybrid/UD channel
- Fix issues for collectives with non power-of-two process counts
- Thanks to Evren Yurtesen for identifying the issue
- Make ring startup use HCA selected by user
- Increase counter length for shared-memory collectives
MVAPICH2-2.0 (06/20/2014)
* Features and Enhancements (since 2.0rc2):
- Consider CMA in collective tuning framework
* Bug-Fixes (since 2.0rc2):
- Fix bug when disabling registration cache
- Fix shared memory window bug when shared memory collectives are disabled
- Fix mpirun_rsh bug when running mpmd programs with no arguments
MVAPICH2-2.0rc2 (05/25/2014)
* Features and Enhancements (since 2.0rc1):
- CMA support is now enabled by default
- Optimization of collectives with CMA support
- RMA optimizations for shared memory and atomic operations
- Tuning RGET and Atomics operations
- Tuning RDMA FP-based communication
- MPI-T support for additional performance and control variables
- The --enable-mpit-pvars=yes configuration option will now
enable only MVAPICH2 specific variables
- Large message transfer support for PSM interface
- Optimization of collectives for PSM interface
- Updated to hwloc v1.9
* Bug-Fixes (since 2.0rc1):
- Fix multicast hang when there is a single process on one node
and more than one process on other nodes
- Fix non-power-of-two usage of scatter-doubling-allgather algorithm
- Fix for bcastzero type hang during finalize
- Enhanced handling of failures in RDMA_CM based
connection establishment
- Fix for a hang in finalize when using RDMA_CM
- Finish receive request when RDMA READ completes in RGET protocol
- Always use direct RDMA when flush is used
- Fix compilation error with --enable-g=all in PSM interface
- Fix warnings and memory leaks
MVAPICH2-2.0rc1 (03/24/2014)
* Features and Enhancements (since 2.0b):
- Based on MPICH-3.1
- Enhanced direct RDMA based designs for MPI_Put and MPI_Get operations in
OFA-IB-CH3 channel
- Optimized communication when using MPI_Win_allocate for OFA-IB-CH3
channel
- MPI-3 RMA support for CH3-PSM channel
- Multi-rail support for UD-Hybrid channel
- Optimized and tuned blocking and non-blocking collectives for OFA-IB-CH3,
OFA-IB-Nemesis, and CH3-PSM channels
- Improved hierarchical job startup performance
- Optimized sub-array data-type processing for GPU-to-GPU communication
- Tuning for Mellanox Connect-IB adapters
- Updated hwloc to version 1.8
- Added options to specify CUDA library paths
- Deprecation of uDAPL-CH3 channel
* Bug-Fixes (since 2.0b):
- Fix issues related to MPI-3 RMA locks
- Fix an issue related to MPI-3 dynamic window
- Fix issues related to MPI_Win_allocate backed by shared memory
- Fix issues related to large message transfers for OFA-IB-CH3 and
OFA-IB-Nemesis channels
- Fix warning in job launch, when using DPM
- Fix an issue related to MPI atomic operations on HCAs without atomics
support
- Fixed an issue related to selection of compiler. (We prefer the GNU,
Intel, PGI, and Ekopath compilers in that order).
- Thanks to Uday R Bondhugula from IISc for the report
- Fix an issue in message coalescing
- Prevent printing out inter-node runtime parameters for pure intra-node
runs
- Thanks to Jerome Vienne from TACC for the report
- Fix an issue related to ordering of messages for GPU-to-GPU transfers
- Fix a few memory leaks and warnings
MVAPICH2-2.0b (11/08/2013)
* Features and Enhancements (since 2.0a):
- Based on MPICH-3.1b1
- Multi-rail support for GPU communication
- Non-blocking streams in asynchronous CUDA transfers for better overlap
- Initialize GPU resources only when used by MPI transfer
- Extended support for MPI-3 RMA in OFA-IB-CH3, OFA-IWARP-CH3, and
OFA-RoCE-CH3
- Additional MPIT counters and performance variables
- Updated compiler wrappers to remove application dependency on network and
other extra libraries
- Thanks to Adam Moody from LLNL for the suggestion
- Capability to checkpoint CH3 channel using the Hydra process manager
- Optimized support for broadcast, reduce and other collectives
- Tuning for IvyBridge architecture
- Improved launch time for large-scale mpirun_rsh jobs
- Introduced retry mechanism in mpirun_rsh for socket binding
- Updated hwloc to version 1.7.2
* Bug-Fixes (since 2.0a):
- Consider list provided by MV2_IBA_HCA when scanning device list
- Fix issues in Nemesis interface with --with-ch3-rank-bits=32
- Better cleanup of XRC files in corner cases
- Initialize using better defaults for ibv_modify_qp (initial ring)
- Add unconditional check and addition of pthread library
- MPI_Get_library_version updated with proper MVAPICH2 branding
- Thanks to Jerome Vienne from the TACC for the report
MVAPICH2-2.0a (08/24/2013)
* Features and Enhancements (since 1.9):
- Based on MPICH-3.0.4
- Dynamic CUDA initialization. Support GPU device selection after MPI_Init
- Support for running on heterogeneous clusters with GPU and non-GPU nodes
- Supporting MPI-3 RMA atomic operations and flush operations with CH3-Gen2
interface
- Exposing internal performance variables to MPI-3 Tools information
interface (MPIT)
- Enhanced MPI_Bcast performance
- Enhanced performance for large message MPI_Scatter and MPI_Gather
- Enhanced intra-node SMP performance
- Tuned SMP eager threshold parameters
- Reduced memory footprint
- Improved job-startup performance
- Warn and continue when ptmalloc fails to initialize
- Enable hierarchical SSH-based startup with Checkpoint-Restart
- Enable the use of Hydra launcher with Checkpoint-Restart
* Bug-Fixes (since 1.9):
- Fix data validation issue with MPI_Bcast
- Thanks to Claudio J. Margulis from University of Iowa for the report
- Fix buffer alignment for large message shared memory transfers
- Fix a bug in One-Sided shared memory backed windows
- Fix a flow-control bug in UD transport
- Thanks to Benjamin M. Auer from NASA for the report
- Fix bugs with MPI-3 RMA in Nemesis IB interface
- Fix issue with very large message (>2GB bytes) MPI_Bcast
- Thanks to Lu Qiyue for the report
- Handle case where $HOME is not set during search for MV2 user config file
- Thanks to Adam Moody from LLNL for the patch
- Fix a hang in connection setup with RDMA-CM
MVAPICH2-1.9 (05/06/2013)
* Features and Enhancements (since 1.9rc1):
- Updated to hwloc v1.7
- Tuned Reduce, AllReduce, Scatter, Reduce-Scatter and
Allgatherv Collectives
* Bug-Fixes (since 1.9rc1):
- Fix cuda context issue with async progress thread
- Thanks to Osuna Escamilla Carlos from env.ethz.ch for the report
- Overwrite pre-existing PSM environment variables
- Thanks to Adam Moody from LLNL for the patch
- Fix several warnings
- Thanks to Adam Moody from LLNL for some of the patches
MVAPICH2-1.9RC1 (04/16/2013)
* Features and Enhancements (since 1.9b):
- Based on MPICH-3.0.3
- Updated SCR to version 1.1.8
- Install utility scripts included with SCR
- Support for automatic detection of path to utilities used by mpirun_rsh
during configuration
- Utilities supported: rsh, ssh, xterm, totalview
- Support for launching jobs on heterogeneous networks with mpirun_rsh
- Tuned Bcast, Reduce, Scatter Collectives
- Tuned MPI performance on Kepler GPUs
- Introduced MV2_RDMA_CM_CONF_FILE_PATH parameter which specifies path to
mv2.conf
* Bug-Fixes (since 1.9b):
- Fix autoconf issue with LiMIC2 source-code
- Thanks to Doug Johnson from OH-TECH for the report
- Fix build errors with --enable-thread-cs=per-object and
--enable-refcount=lock-free
- Thanks to Marcin Zalewski from Indiana University for the report
- Fix MPI_Scatter failure with MPI_IN_PLACE
- Thanks to Mellanox for the report
- Fix MPI_Scatter failure with cyclic host files
- Fix deadlocks in PSM interface for multi-threaded jobs
- Thanks to Marcin Zalewski from Indiana University for the report
- Fix MPI_Bcast failures in SCALAPACK
- Thanks to Jerome Vienne from TACC for the report
- Fix build errors with newer Ekopath compiler
- Fix a bug with shmem collectives in PSM interface
- Fix memory corruption when more entries specified in mv2.conf than the
requested number of rails
- Thanks to Akihiro Nomura from Tokyo Institute of Technology for the
report
- Fix memory corruption with CR configuration in Nemesis interface
MVAPICH2-1.9b (02/28/2013)
* Features and Enhancements (since 1.9a2):
- Based on MPICH-3.0.2
- Support for all MPI-3 features
- Support for single copy intra-node communication using Linux supported
CMA (Cross Memory Attach)
- Provides flexibility for intra-node communication: shared memory,
LiMIC2, and CMA
- Checkpoint/Restart using LLNL's Scalable Checkpoint/Restart Library (SCR)
- Support for application-level checkpointing
- Support for hierarchical system-level checkpointing
- Improved job startup time
- Provided a new runtime variable MV2_HOMOGENEOUS_CLUSTER for optimized
startup on homogeneous clusters
- New version of LiMIC2 (v0.5.6)
- Provides support for unlocked ioctl calls
- Tuned Reduce, Allgather, Reduce_Scatter, Allgatherv collectives
- Introduced option to export environment variables automatically with
mpirun_rsh
- Updated to HWLOC v1.6.1
- Provided option to use CUDA libary call instead of CUDA driver to check
buffer pointer type
- Thanks to Christian Robert from Sandia for the suggestion
- Improved debug messages and error reporting
* Bug-Fixes (since 1.9a2):
- Fix page fault with memory access violation with LiMIC2 exposed by newer
Linux kernels
- Thanks to Karl Schulz from TACC for the report
- Fix a failure when lazy memory registration is disabled and CUDA is
enabled
- Thanks to Jens Glaser from University of Minnesota for the report
- Fix an issue with variable initialization related to DPM support
- Rename a few internal variables to avoid name conflicts with external
applications
- Thanks to Adam Moody from LLNL for the report
- Check for libattr during configuration when Checkpoint/Restart and
Process Migration are requested
- Thanks to John Gilmore from Vastech for the report
- Fix build issue with --disable-cxx
- Set intra-node eager threshold correctly when configured with LiMIC2
- Fix an issue with MV2_DEFAULT_PKEY in partitioned InfiniBand network
- Thanks to Jesper Larsen from FCOO for the report
- Improve makefile rules to use automake macros
- Thanks to Carmelo Ponti from CSCS for the report
- Fix configure error with automake conditionals
- Thanks to Evren Yurtesen from Abo Akademi for the report
- Fix a few memory leaks and warnings
- Properly cleanup shared memory files (used by XRC) when applications fail
MVAPICH2-1.9a2 (11/08/2012)
* Features and Enhancements (since 1.9a):
- Based on MPICH2-1.5
- Initial support for MPI-3:
(Available for all interfaces: OFA-IB-CH3, OFA-IWARP-CH3, OFA-RoCE-CH3,
uDAPL-CH3, OFA-IB-Nemesis, PSM-CH3)
- Nonblocking collective functions available as "MPIX_" functions
(e.g., "MPIX_Ibcast")
- Neighborhood collective routines available as "MPIX_" functions
(e.g., "MPIX_Neighbor_allgather")
- MPI_Comm_split_type function available as an "MPIX_" function
- Support for MPIX_Type_create_hindexed_block
- Nonblocking communicator duplication routine MPIX_Comm_idup (will
only work for single-threaded programs)
- MPIX_Comm_create_group support
- Support for matched probe functionality (e.g., MPIX_Mprobe,
MPIX_Improbe, MPIX_Mrecv, and MPIX_Imrecv),
(Not Available for PSM)
- Support for "Const" (disabled by default)
- Efficient vector, hindexed datatype processing on GPU buffers
- Tuned alltoall, Scatter and Allreduce collectives
- Support for Mellanox Connect-IB HCA
- Adaptive number of registration cache entries based on job size
- Revamped Build system:
- Uses automake instead of simplemake,
- Allows for parallel builds ("make -j8" and similar)
* Bug-Fixes (since 1.9a):
- CPU frequency mismatch warning shown under debug
- Fix issue with MPI_IN_PLACE buffers with CUDA
- Fix ptmalloc initialization issue due to compiler optimization
- Thanks to Kyle Sheumaker from ACT for the report
- Adjustable MAX_NUM_PORTS at build time to support more than two ports
- Fix issue with MPI_Allreduce with MPI_IN_PLACE send buffer
- Fix memleak in MPI_Cancel with PSM interface
- Thanks to Andrew Friedley from LLNL for the report
MVAPICH2-1.9a (09/07/2012)
* Features and Enhancements (since 1.8):
- Support for InfiniBand hardware UD-multicast
- UD-multicast-based designs for collectives
(Bcast, Allreduce and Scatter)
- Enhanced Bcast and Reduce collectives with pt-to-pt communication
- LiMIC-based design for Gather collective
- Improved performance for shared-memory-aware collectives
- Improved intra-node communication performance with GPU buffers
using pipelined design
- Improved inter-node communication performance with GPU buffers
with non-blocking CUDA copies
- Improved small message communication performance with
GPU buffers using CUDA IPC design
- Improved automatic GPU device selection and CUDA context management
- Optimal communication channel selection for different
GPU communication modes (DD, DH and HD) in different
configurations (intra-IOH and inter-IOH)
- Removed libibumad dependency for building the library
- Option for selecting non-default gid-index in a loss-less
fabric setup in RoCE mode
- Option to disable signal handler setup
- Tuned thresholds for various architectures
- Set DAPL-2.0 as the default version for the uDAPL interface
- Updated to hwloc v1.5
- Option to use IP address as a fallback if hostname
cannot be resolved
- Improved error reporting
* Bug-Fixes (since 1.8):
- Fix issue in intra-node knomial bcast
- Handle gethostbyname return values gracefully
- Fix corner case issue in two-level gather code path
- Fix bug in CUDA events/streams pool management
- Fix ptmalloc initialization issue when MALLOC_CHECK_ is
defined in the environment
- Thanks to Mehmet Belgin from Georgia Institute of
Technology for the report
- Fix memory corruption and handle heterogeneous architectures
in gather collective
- Fix issue in detecting the correct HCA type
- Fix issue in ring start-up to select correct HCA when
MV2_IBA_HCA is specified
- Fix SEGFAULT in MPI_Finalize when IB loop-back is used
- Fix memory corruption on nodes with 64-cores
- Thanks to M Xie for the report
- Fix hang in MPI_Finalize with Nemesis interface when
ptmalloc initialization fails
- Thanks to Carson Holt from OICR for the report
- Fix memory corruption in shared memory communication
- Thanks to Craig Tierney from NOAA for the report
and testing the patch
- Fix issue in IB ring start-up selection with mpiexec.hydra
- Fix issue in selecting CUDA run-time variables when running
on single node in SMP only mode
- Fix few memory leaks and warnings
MVAPICH2-1.8 (04/30/2012)
* Features and Enhancements (since 1.8rc1):
- Introduced a unified run time parameter MV2_USE_ONLY_UD to enable UD only
mode
- Enhanced designs for Alltoall and Allgather collective communication from
GPU device buffers
- Tuned collective communication from GPU device buffers
- Tuned Gather collective
- Introduced a run time parameter MV2_SHOW_CPU_BINDING to show current CPU
bindings
- Updated to hwloc v1.4.1
- Remove dependency on LEX and YACC
* Bug-Fixes (since 1.8rc1):
- Fix hang with multiple GPU configuration
- Thanks to Jens Glaser from University of Minnesota for the report
- Fix buffer alignment issues to improve intra-node performance
- Fix a DPM multispawn behavior
- Enhanced error reporting in DPM functionality
- Quote environment variables in job startup to protect from shell
- Fix hang when LIMIC is enabled
- Fix hang in environments with heterogeneous HCAs
- Fix issue when using multiple HCA ports in RDMA_CM mode
- Thanks to Steve Wise from Open Grid Computing for the report
- Fix hang during MPI_Finalize in Nemesis IB netmod
- Fix for a start-up issue in Nemesis with heterogeneous architectures
- Fix few memory leaks and warnings
MVAPICH2-1.8rc1 (03/22/2012)
* Features & Enhancements (since 1.8a2):
- New design for intra-node communication from GPU Device buffers using
CUDA IPC for better performance and correctness
- Thanks to Joel Scherpelz from NVIDIA for his suggestions
- Enabled shared memory communication for host transfers when CUDA is
enabled
- Optimized and tuned collectives for GPU device buffers
- Enhanced pipelined inter-node device transfers
- Enhanced shared memory design for GPU device transfers for large messages
- Enhanced support for CPU binding with socket and numanode level
granularity
- Support suspend/resume functionality with mpirun_rsh
- Exporting local rank, local size, global rank and global size through
environment variables (both mpirun_rsh and hydra)
- Update to hwloc v1.4
- Checkpoint-Restart support in OFA-IB-Nemesis interface
- Enabling run-through stabilization support to handle process failures in
OFA-IB-Nemesis interface
- Enhancing OFA-IB-Nemesis interface to handle IB errors gracefully
- Performance tuning on various architecture clusters
- Support for Mellanox IB FDR adapter
* Bug-Fixes (since 1.8a2):
- Fix a hang issue on InfiniHost SDR/DDR cards
- Thanks to Nirmal Seenu from Fermilab for the report
- Fix an issue with runtime parameter MV2_USE_COALESCE usage
- Fix an issue with LiMIC2 when CUDA is enabled
- Fix an issue with intra-node communication using datatypes and GPU device
buffers
- Fix an issue with Dynamic Process Management when launching processes on
multiple nodes
- Thanks to Rutger Hofman from VU Amsterdam for the report
- Fix build issue in hwloc source with mcmodel=medium flags
- Thanks to Nirmal Seenu from Fermilab for the report
- Fix a build issue in hwloc with --disable-shared or --disabled-static
options
- Use portable stdout and stderr redirection
- Thanks to Dr. Axel Philipp from *MTU* Aero Engines for the patch
- Fix a build issue with PGI 12.2
- Thanks to Thomas Rothrock from U.S. Army SMDC for the patch
- Fix an issue with send message queue in OFA-IB-Nemesis interface
- Fix a process cleanup issue in Hydra when MPI_ABORT is called (upstream
MPICH2 patch)
- Fix an issue with non-contiguous datatypes in MPI_Gather
- Fix a few memory leaks and warnings
MVAPICH2-1.8a2 (02/02/2012)
* Features and Enhancements (since 1.8a1p1):
- Support for collective communication from GPU buffers
- Non-contiguous datatype support in point-to-point and collective
communication from GPU buffers
- Efficient GPU-GPU transfers within a node using CUDA IPC (for CUDA 4.1)
- Alternate synchronization mechanism using CUDA Events for pipelined device
data transfers
- Exporting processes local rank in a node through environment variable
- Adjust shared-memory communication block size at runtime
- Enable XRC by default at configure time
- New shared memory design for enhanced intra-node small message performance
- Tuned inter-node and intra-node performance on different cluster
architectures
- Update to hwloc v1.3.1
- Support for fallback to R3 rendezvous protocol if RGET fails
- SLURM integration with mpiexec.mpirun_rsh to use SLURM allocated hosts
without specifying a hostfile
- Support added to automatically use PBS_NODEFILE in Torque and PBS
environments
- Enable signal-triggered (SIGUSR2) migration
* Bug Fixes (since 1.8a1p1):
- Set process affinity independently of SMP enable/disable to control the
affinity in loopback mode
- Report error and exit if user requests MV2_USE_CUDA=1 in non-cuda
configuration
- Fix for data validation error with GPU buffers
- Updated WRAPPER_CPPFLAGS when using --with-cuda. Users should not have to
explicitly specify CPPFLAGS or LDFLAGS to build applications
- Fix for several compilation warnings
- Report an error message if user requests MV2_USE_XRC=1 in non-XRC
configuration
- Remove debug prints in regular code path with MV2_USE_BLOCKING=1
- Thanks to Vaibhav Dutt for the report
- Handling shared memory collective buffers in a dynamic manner to eliminate
static setting of maximum CPU core count
- Fix for validation issue in MPICH2 strided_get_indexed.c
- Fix a bug in packetized transfers on heterogeneous clusters
- Fix for deadlock between psm_ep_connect and PMGR_COLLECTIVE calls on
QLogic systems
- Thanks to Adam T. Moody for the patch
- Fix a bug in MPI_Allocate_mem when it is called with size 0
- Thanks to Michele De Stefano for reporting this issue
- Create vendor for Open64 compilers and add rpath for unknown compilers
- Thanks to Martin Hilgemen from Dell Inc. for the initial patch
- Fix issue due to overlapping buffers with sprintf
- Thanks to Mark Debbage from QLogic for reporting this issue
- Fallback to using GNU options for unknown f90 compilers
- Fix hang in PMI_Barrier due to incorrect handling of the socket return
values in mpirun_rsh
- Unify the redundant FTB events used to initiate a migration
- Fix memory leaks when mpirun_rsh reads hostfiles
- Fix a bug where library attempts to use in-active rail in multi-rail
scenario
MVAPICH2-1.8a1p1 (11/14/2011)
* Bug Fixes (since 1.8a1)
- Fix for a data validation issue in GPU transfers
- Thanks to Massimiliano Fatica, NVIDIA, for reporting this issue
- Tuned CUDA block size to 256K for better performance
- Enhanced error checking for CUDA library calls
- Fix for mpirun_rsh issue while launching applications on Linux Kernels
(3.x)
MVAPICH2-1.8a1 (11/09/2011)
* Features and Enhancements (since 1.7):
- Support for MPI communication from NVIDIA GPU device memory
- High performance RDMA-based inter-node point-to-point communication
(GPU-GPU, GPU-Host and Host-GPU)
- High performance intra-node point-to-point communication for
multi-GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU)
- Communication with contiguous datatype
- Reduced memory footprint of the library
- Enhanced one-sided communication design with reduced memory requirement
- Enhancements and tuned collectives (Bcast and Alltoallv)
- Update to hwloc v1.3.0
- Flexible HCA selection with Nemesis interface
- Thanks to Grigori Inozemtsev, Queens University
- Support iWARP interoperability between Intel NE020 and Chelsio T4 Adapters
- RoCE enable environment variable name is changed from MV2_USE_RDMAOE to
MV2_USE_RoCE
* Bug Fixes (since 1.7):
- Fix for a bug in mpirun_rsh while doing process clean-up in abort and
other error scenarios
- Fixes for code compilation warnings
- Fix for memory leaks in RDMA CM code path
MVAPICH2-1.7 (10/14/2011)
* Features and Enhancements (since 1.7rc2):
- Support SHMEM collectives upto 64 cores/node
- Update to hwloc v1.2.2
- Enhancement and tuned collective (GatherV)
* Bug Fixes:
- Fixes for code compilation warnings
- Fix job clean-up issues with mpirun_rsh
- Fix a hang with RDMA CM
MVAPICH2-1.7rc2 (09/19/2011)
* Features and Enhancements (since 1.7rc1):
- Based on MPICH2-1.4.1p1
- Integrated Hybrid (UD-RC/XRC) design to get best performance
on large-scale systems with reduced/constant memory footprint
- Shared memory backed Windows for One-Sided Communication
- Support for truly passive locking for intra-node RMA in shared
memory and LIMIC based windows
- Integrated with Portable Hardware Locality (hwloc v1.2.1)
- Integrated with latest OSU Micro-Benchmarks (3.4)
- Enhancements and tuned collectives (Allreduce and Allgatherv)
- MPI_THREAD_SINGLE provided by default and MPI_THREAD_MULTIPLE as an
option
- Enabling Checkpoint/Restart support in pure SMP mode
- Optimization for QDR cards
- On-demand connection management support with IB CM (RoCE interface)
- Optimization to limit number of RDMA Fast Path connections for very large
clusters (Nemesis interface)
- Multi-core-aware collective support (QLogic PSM interface)
* Bug Fixes:
- Fixes for code compilation warnings
- Compiler preference lists reordered to avoid mixing GCC and Intel
compilers if both are found by configure
- Fix a bug in transferring very large messages (>2GB)
- Thanks to Tibor Pausz from Univ. of Frankfurt for reporting it
- Fix a hang with One-Sided Put operation
- Fix a bug in ptmalloc integration
- Avoid double-free crash with mpispawn
- Avoid crash and print an error message in mpirun_rsh when the hostfile is
empty
- Checking for error codes in PMI design
- Verify programs can link with LiMIC2 at runtime
- Fix for compilation issue when BLCR or FTB installed in non-system paths
- Fix an issue with RDMA-Migration
- Fix for memory leaks
- Fix an issue in supporting RoCE with second port on available on HCA
- Thanks to Jeffrey Konz from HP for reporting it
- Fix for a hang with passive RMA tests (QLogic PSM interface)
MVAPICH2-1.7rc1 (07/20/2011)
* Features and Enhancements (since 1.7a2)
- Based on MPICH2-1.4
- CH3 shared memory channel for standalone hosts (including laptops)
without any InfiniBand adapters
- HugePage support
- Improved on-demand InfiniBand connection setup
- Optimized Fence synchronization (with and without LIMIC2 support)
- Enhanced mpirun_rsh design to avoid race conditions and support for
improved debug messages
- Optimized design for collectives (Bcast and Reduce)
- Improved performance for medium size messages for QLogic PSM
- Support for Ekopath Compiler
* Bug Fixes
- Fixes in Dynamic Process Management (DPM) support
- Fixes in Checkpoint/Restart and Migration support
- Fix Restart when using automatic checkpoint
- Thanks to Alexandr for reporting this
- Compilation warnings fixes
- Handling very large one-sided transfers using RDMA
- Fixes for memory leaks
- Graceful handling of unknown HCAs
- Better handling of shmem file creation errors
- Fix for a hang in intra-node transfer
- Fix for a build error with --disable-weak-symbols
- Thanks to Peter Willis for reporting this issue
- Fixes for one-sided communication with passive target synchronization
- Proper error reporting when a program is linked with both static and
shared MVAPICH2 libraries
MVAPICH2-1.7a2 (06/03/2011)
* Features and Enhancements (Since 1.7a)
- Improved intra-node shared memory communication performance
- Tuned RDMA Fast Path Buffer size to get better performance
with less memory footprint (CH3 and Nemesis)
- Fast process migration using RDMA
- Automatic inter-node communication parameter tuning
based on platform and adapter detection (Nemesis)
- Automatic intra-node communication parameter tuning
based on platform
- Efficient connection set-up for multi-core systems
- Enhancements for collectives (barrier, gather and allgather)
- Compact and shorthand way to specify blocks of processes on the same
host with mpirun_rsh
- Support for latest stable version of HWLOC v1.2
- Improved debug message output in process management and fault tolerance
functionality
- Better handling of process signals and error management in mpispawn
- Performance tuning for pt-to-pt and several collective operations
* Bug fixes
- Fixes for memory leaks
- Fixes in CR/migration
- Better handling of memory allocation and registration failures
- Fixes for compilation warnings
- Fix a bug that disallows '=' from mpirun_rsh arguments
- Handling of non-contiguous transfer in Nemesis interface
- Bug fix in gather collective when ranks are in cyclic order
- Fix for the ignore_locks bug in MPI-IO with Lustre
MVAPICH2-1.7a (04/19/2011)
* Features and Enhancements
- Based on MPICH2-1.3.2p1
- Integrated with Portable Hardware Locality (hwloc v1.1.1)
- Supporting Large Data transfers (>2GB)
- Integrated with Enhanced LiMIC2 (v0.5.5) to support Intra-node
large message (>2GB) transfers
- Optimized and tuned algorithm for AlltoAll
- Enhanced debugging config options to generate
core files and back-traces
- Support for Chelsio's T4 Adapter
MVAPICH2-1.6 (03/09/2011)
* Features and Enhancements (since 1.6-RC3)
- Improved configure help for MVAPICH2 features
- Updated Hydra launcher with MPICH2-1.3.3 Hydra process manager
- Building and installation of OSU micro benchmarks during default
MVAPICH2 installation
- Hydra is the default mpiexec process manager
* Bug fixes (since 1.6-RC3)
- Fix hang issues in RMA
- Fix memory leaks
- Fix in RDMA_FP
MVAPICH2-1.6-RC3 (02/15/2011)
* Features and Enhancements
- Support for 3D torus topology with appropriate SL settings
- For both CH3 and Nemesis interfaces
- Thanks to Jim Schutt, Marcus Epperson and John Nagle from
Sandia for the initial patch
- Quality of Service (QoS) support with multiple InfiniBand SL
- For both CH3 and Nemesis interfaces
- Configuration file support (similar to the one available in MVAPICH).
Provides a convenient method for handling all runtime variables
through a configuration file.
- Improved job-startup performance on large-scale systems
- Optimization in MPI_Finalize
- Improved pt-to-pt communication performance for small and
medium messages
- Optimized and tuned algorithms for Gather and Scatter collective
operations
- Optimized thresholds for one-sided RMA operations
- User-friendly configuration options to enable/disable various
checkpoint/restart and migration features
- Enabled ROMIO's auto detection scheme for filetypes
on Lustre file system
- Improved error checking for system and BLCR calls in
checkpoint-restart and migration codepath
- Enhanced OSU Micro-benchmarks suite (version 3.3)
Bug Fixes
- Fix in aggregate ADIO alignment
- Fix for an issue with LiMIC2 header
- XRC connection management
- Fixes in registration cache
- IB card detection with MV2_IBA_HCA runtime option in
multi rail design
- Fix for a bug in multi-rail design while opening multiple HCAs
- Fixes for multiple memory leaks
- Fix for a bug in mpirun_rsh
- Checks before enabling aggregation and migration
- Fixing the build errors with --disable-cxx
- Thanks to Bright Yang for reporting this issue
- Fixing the build errors related to "pthread_spinlock_t"
seen on RHEL systems
MVAPICH2-1.6-RC2 (12/22/2010)
* Features and Enhancements
- Optimization and enhanced performance for clusters with nVIDIA
GPU adapters (with and without GPUDirect technology)
- Enhanced R3 rendezvous protocol
- For both CH3 and Nemesis interfaces
- Robust RDMA Fast Path setup to avoid memory allocation
failures
- For both CH3 and Nemesis interfaces
- Multiple design enhancements for better performance of
medium sized messages
- Enhancements and optimizations for one sided Put and Get operations
- Enhancements and tuning of Allgather for small and medium
sized messages
- Optimization of AllReduce
- Enhancements to Multi-rail Design and features including striping
of one-sided messages
- Enhancements to mpirun_rsh job start-up scheme
- Enhanced designs for automatic detection of various
architectures and adapters
* Bug fixes
- Fix a bug in Post-Wait/Start-Complete path for one-sided
operations
- Resolving a hang in mpirun_rsh termination when CR is enabled
- Fixing issue in MPI_Allreduce and Reduce when called with MPI_IN_PLACE
- Thanks to the initial patch by Alexander Alekhin
- Fix for an issue in rail selection for small RMA messages
- Fix for threading related errors with comm_dup
- Fix for alignment issues in RDMA Fast Path
- Fix for extra memcpy in header caching
- Fix for an issue to use correct HCA when process to rail binding
scheme used in combination with XRC.
- Fix for an RMA issue when configured with enable-g=meminit
- Thanks to James Dinan of Argonne for reporting this issue
- Only set FC and F77 if gfortran is executable
MVAPICH2-1.6RC1 (11/12/2010)
* Features and Enhancements
- Using LiMIC2 for efficient intra-node RMA transfer to avoid extra
memory copies
- Upgraded to LiMIC2 version 0.5.4
- Removing the limitation on number of concurrent windows in RMA
operations
- Support for InfiniBand Quality of Service (QoS) with multiple lanes
- Enhanced support for multi-threaded applications
- Fast Checkpoint-Restart support with aggregation scheme
- Job Pause-Migration-Restart Framework for Pro-active Fault-Tolerance
- Support for new standardized Fault Tolerant Backplane (FTB) Events
for Checkpoint-Restart and Job Pause-Migration-Restart Framework
- Dynamic detection of multiple InfiniBand adapters and using these
by default in multi-rail configurations (OLA-IB-CH3, OFA-iWARP-CH3 and
OFA-RoCE-CH3 interfaces)
- Support for process-to-rail binding policy (bunch, scatter and
user-defined) in multi-rail configurations (OFA-IB-CH3, OFA-iWARP-CH3 and
OFA-RoCE-CH3 interfaces)
- Enhanced and optimized algorithms for MPI_Reduce and MPI_AllReduce
operations for small and medium message sizes.
- XRC support with Hydra Process Manager
- Improved usability of process to CPU mapping with support of
delimiters (',' , '-') in CPU listing
- Thanks to Gilles Civario for the initial patch
- Use of gfortran as the default F77 compiler
- Support of Shared-Memory-Nemesis interface on multi-core platforms
requiring intra-node communication only (SMP-only systems, laptops, etc. )
* Bug fixes
- Fix for memory leak in one-sided code with --enable-g=all
--enable-error-messages=all
- Fix for memory leak in getting the context of intra-communicator
- Fix for shmat() return code check
- Fix for issues with inter-communicator collectives in Nemesis
- KNEM patch for osu_bibw issue with KNEM version 0.9.2
- Fix for osu_bibw error with Shared-memory-Nemesis interface
- Fix for Win_test error for one-sided RDMA
- Fix for a hang in collective when thread level is set to multiple
- Fix for intel test errors with rsend, bsend and ssend operations in Nemesis
- Fix for memory free issue when it allocated by scandir
- Fix for a hang in Finalize
- Fix for issue with MPIU_Find_local_and_external when it is called
from MPIDI_CH3I_comm_create
- Fix for handling CPPFLGS values with spaces
- Dynamic Process Management to work with XRC support
- Fix related to disabling CPU affinity when shared memory is turned off at run time
- MVAPICH2-1.5.1 (09/14/10)
* Features and Enhancements
- Significantly reduce memory footprint on some systems by changing the
stack size setting for multi-rail configurations
- Optimization to the number of RDMA Fast Path connections
- Performance improvements in Scatterv and Gatherv collectives for CH3
interface (Thanks to Dan Kokran and Max Suarez of NASA for identifying
the issue)
- Tuning of Broadcast Collective
- Support for tuning of eager thresholds based on both adapter and platform
type
- Environment variables for message sizes can now be expressed in short
form K=Kilobytes and M=Megabytes (e.g. MV2_IBA_EAGER_THRESHOLD=12K)
- Ability to selectively use some or all HCAs using colon separated lists.
e.g. MV2_IBA_HCA=mlx4_0:mlx4_1
- Improved Bunch/Scatter mapping for process binding with HWLOC and SMT
support (Thanks to Dr. Bernd Kallies of ZIB for ideas and suggestions)
- Update to Hydra code from MPICH2-1.3b1
- Auto-detection of various iWARP adapters
- Specifying MV2_USE_IWARP=1 is no longer needed when using iWARP
- Changing automatic eager threshold selection and tuning for iWARP
adapters based on number of nodes in the system instead of the number of
processes
- PSM progress loop optimization for QLogic Adapters (Thanks to Dr.
Avneesh Pant of QLogic for the patch)
* Bug fixes
- Fix memory leak in registration cache with --enable-g=all
- Fix memory leak in operations using datatype modules
- Fix for rdma_cross_connect issue for RDMA CM. The server is prevented
from initiating a connection.
- Don't fail during build if RDMA CM is unavailable
- Various mpirun_rsh bug fixes for CH3, Nemesis and uDAPL interfaces
- ROMIO panfs build fix
- Update panfs for not-so-new ADIO file function pointers
- Shared libraries can be generated with unknown compilers
- Explicitly link against DL library to prevent build error due to DSO link
change in Fedora 13 (introduced with gcc-4.4.3-5.fc13)
- Fix regression that prevents the proper use of our internal HWLOC
component
- Remove spurious debug flags when certain options are selected at build
time
- Error code added for situation when received eager SMP message is larger
than receive buffer
- Fix for Gather and GatherV back-to-back hang problem with LiMIC2
- Fix for packetized send in Nemesis
- Fix related to eager threshold in nemesis ib-netmod
- Fix initialization parameter for Nemesis based on adapter type
- Fix for uDAPL one sided operations (Thanks to Jakub Fedoruk from Intel
for reporting this)
- Fix an issue with out-of-order message handling for iWARP
- Fixes for memory leak and Shared context Handling in PSM for QLogic
Adapters (Thanks to Dr. Avneesh Pant of QLogic for the patch)
MVAPICH2-1.5 (07/09/10)
* Features and Enhancements (since 1.5-RC2)
- SRQ turned on by default for Nemesis interface
- Performance tuning - adjusted eager thresholds for
variety of architectures, vbuf size based on adapter
types and vbuf pool sizes
- Tuning for Intel iWARP NE020 adapter, thanks to Harry
Cropper of Intel
- Introduction of a retry mechanism for RDMA_CM connection
establishment
* Bug fixes (since 1.5-RC2)
- Fix in build process with hwloc (for some Distros)
- Fix for memory leak (Nemesis interface)
MVAPICH2-1.5-RC2 (06/21/10)
* Features and Enhancements (since 1.5-RC1)
- Support for hwloc library (1.0.1) for defining CPU affinity
- Deprecating the PLPA support for defining CPU affinity
- Efficient CPU affinity policies (bunch and scatter) to
specify CPU affinity per job for modern multi-core platforms
- New flag in mpirun_rsh to execute tasks with different group IDs
- Enhancement to the design of Win_complete for RMA operations
- Flexibility to support variable number of RMA windows
- Support for Intel iWARP NE020 adapter
* Bug fixes (since 1.5-RC1)
- Compilation issue with the ROMIO adio-lustre driver, thanks
to Adam Moody of LLNL for reporting the issue
- Allowing checkpoint-restart for large-scale systems
- Correcting a bug in clear_kvc function. Thanks to T J (Chris) Ward,
IBM Research, for reporting and providing the resolving patch
- Shared lock operations with RMA with scatter process distribution.
Thanks to Pavan Balaji of Argonne for reporting this issue
- Fix a bug during window creation in uDAPL
- Compilation issue with --enable-alloca, Thanks to E. Borisch,
for reporting and providing the patch
- Improved error message for ibv_poll_cq failures
- Fix an issue that prevents mpirun_rsh to execute programs without
specifying the path from directories in PATH
- Fix an issue of mpirun_rsh with Dynamic Process Migration (DPM)
- Fix for memory leaks (both CH3 and Nemesis interfaces)
- Updatefiles correctly update LiMIC2
- Several fixes to the registration cache
(CH3, Nemesis and uDAPL interfaces)
- Fix to multi-rail communication
- Fix to Shared Memory communication Progress Engine
- Fix to all-to-all collective for large number of processes
MVAPICH2-1.5-RC1 (05/04/10)
* Features and Enhancements
- MPI 2.2 compliant
- Based on MPICH2-1.2.1p1
- OFA-IB-Nemesis interface design
- OpenFabrics InfiniBand network module support for
MPICH2 Nemesis modular design
- Support for high-performance intra-node shared memory
communication provided by the Nemesis design
- Adaptive RDMA Fastpath with Polling Set for high-performance
inter-node communication
- Shared Receive Queue (SRQ) support with flow control,
uses significantly less memory for MPI library
- Header caching
- Advanced AVL tree-based Resource-aware registration cache
- Memory Hook Support provided by integration with ptmalloc2
library. This provides safe release of memory to the
Operating System and is expected to benefit the memory
usage of applications that heavily use malloc and free
operations.
- Support for TotalView debugger
- Shared Library Support for existing binary MPI application
programs to run ROMIO Support for MPI-IO
- Support for additional features (such as hwloc,
hierarchical collectives, one-sided, multithreading, etc.),
as included in the MPICH2 1.2.1p1 Nemesis channel
- Flexible process manager support
- mpirun_rsh to work with any of the eight interfaces
(CH3 and Nemesis channel-based) including OFA-IB-Nemesis,
TCP/IP-CH3 and TCP/IP-Nemesis
- Hydra process manager to work with any of the eight interfaces
(CH3 and Nemesis channel-based) including OFA-IB-CH3,
OFA-iWARP-CH3, OFA-RoCE-CH3 and TCP/IP-CH3
- MPIEXEC_TIMEOUT is honored by mpirun_rsh
* Bug fixes since 1.4.1
- Fix compilation error when configured with
`--enable-thread-funneled'
- Fix MPE functionality, thanks to Anthony Chan for
reporting and providing the resolving patch
- Cleanup after a failure in the init phase is handled better by
mpirun_rsh
- Path determination is correctly handled by mpirun_rsh when DPM is
used
- Shared libraries are correctly built (again)
MVAPICH2-1.4.1
* Enhancements since mvapich2-1.4
- MPMD launch capability to mpirun_rsh
- Portable Hardware Locality (hwloc) support, patch suggested by
Dr. Bernd Kallies <[email protected]>
- Multi-port support for iWARP
- Enhanced iWARP design for scalability to higher process count
- Ring based startup support for RDMAoE
* Bug fixes since mvapich2-1.4
- Fixes for MPE and other profiling tools