{"payload":{"feedbackUrl":"https://github.com/orgs/community/discussions/53140","repo":{"id":164157858,"defaultBranch":"master","name":"aws-ofi-nccl","ownerLogin":"rajachan","currentUserCanPush":false,"isFork":true,"isEmpty":false,"createdAt":"2019-01-04T21:45:31.000Z","ownerAvatar":"https://avatars.githubusercontent.com/u/742736?v=4","public":true,"private":false,"isOrgOwned":false},"refInfo":{"name":"","listCacheKey":"v0:1726778286.0","currentOid":""},"activityList":{"items":[{"before":null,"after":"aa5349448fe007e8b106a39f9d8fde7f24b577e9","ref":"refs/heads/setopt-unsupp","pushedAt":"2024-09-19T20:38:06.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"rajachan","name":"Raghu Raja","path":"/rajachan","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/742736?s=80&v=4"},"commit":{"message":"util: Use FI_ENOPROTOOPT to check for a provider's support for option\n\nfi_endpoint(3) states that fi_setopt/_getopt calls return FI_ENOPROTOOPT\nwhen a provider does not support a requested option. Certain setopt\noptions also return FI_EOPNOTSUPP if a particular mode forced by setting\nan option is not supported (such as, with FI_OPT_CUDA_API_PERMITTED\nbeing set to false when a provider requires CUDA API to support\nFI_HMEM_CUDA). Checking for both as we have the same error\noutcome in either case.\n\nFixes #606\n\nSigned-off-by: Raghu Raja ","shortMessageHtmlLink":"util: Use FI_ENOPROTOOPT to check for a provider's support for option"}},{"before":"420a42e373ed9ce7a8524c86446ed9c36e16f674","after":"093376b54839c12c19bb59d56b11c73de978122a","ref":"refs/heads/ci-clang-ver","pushedAt":"2024-09-19T19:46:57.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"rajachan","name":"Raghu Raja","path":"/rajachan","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/742736?s=80&v=4"},"commit":{"message":"ci: Drop the use of cache-apt-pkgs-action\n\nOne less third-party action dependency.\n\nSigned-off-by: Raghu Raja ","shortMessageHtmlLink":"ci: Drop the use of cache-apt-pkgs-action"}},{"before":null,"after":"420a42e373ed9ce7a8524c86446ed9c36e16f674","ref":"refs/heads/ci-clang-ver","pushedAt":"2024-09-19T19:41:49.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"rajachan","name":"Raghu Raja","path":"/rajachan","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/742736?s=80&v=4"},"commit":{"message":"ci: Downgrade to clang-18\n\nLLVM is in the middle of a release migration and that's breaking package\ninstalls. Reverting to clang-18, which is the current \"stable\" release\nversion.\n\nSigned-off-by: Raghu Raja ","shortMessageHtmlLink":"ci: Downgrade to clang-18"}},{"before":null,"after":"03b5c681e4c1e25d647c9049e906cd2cbabe6b90","ref":"refs/heads/drop-cache-action","pushedAt":"2024-09-19T19:05:21.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"rajachan","name":"Raghu Raja","path":"/rajachan","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/742736?s=80&v=4"},"commit":{"message":"Drop the use of cache-apt-pkgs-action\n\nThis is causing failures in unexpected ways.\n\nSigned-off-by: Raghu Raja ","shortMessageHtmlLink":"Drop the use of cache-apt-pkgs-action"}},{"before":"ce1654a017077bdb6b900c27b6e7344c44dba8e7","after":null,"ref":"refs/heads/control-qp-rebase","pushedAt":"2024-09-08T05:17:09.000Z","pushType":"branch_deletion","commitsCount":0,"pusher":{"login":"rajachan","name":"Raghu Raja","path":"/rajachan","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/742736?s=80&v=4"}},{"before":"cf659bda82f20143332c509f0532450387f14f40","after":"ce1654a017077bdb6b900c27b6e7344c44dba8e7","ref":"refs/heads/control-qp-rebase","pushedAt":"2024-09-07T05:01:31.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"rajachan","name":"Raghu Raja","path":"/rajachan","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/742736?s=80&v=4"},"commit":{"message":"rdma: Poll the control cq if no match\n\nIf a match isn't found for the current send, poll the control cq\nto see if the match can be found. While this extends the current\nsend() call, it potentially lowers the time until data transfer\nstarts.\n\nSigned-off-by: Brian Barrett \nSigned-off-by: Raghu Raja ","shortMessageHtmlLink":"rdma: Poll the control cq if no match"}},{"before":"b16b8f309101e75d6e8835c545314a4dcd5aa481","after":"cf659bda82f20143332c509f0532450387f14f40","ref":"refs/heads/control-qp-rebase","pushedAt":"2024-09-06T18:58:06.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"rajachan","name":"Raghu Raja","path":"/rajachan","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/742736?s=80&v=4"},"commit":{"message":"rdma: Poll the control cq if no match\n\nIf a match isn't found for the current send, poll the control cq\nto see if the match can be found. While this extends the current\nsend() call, it potentially lowers the time until data transfer\nstarts.\n\nSigned-off-by: Brian Barrett \nSigned-off-by: Raghu Raja ","shortMessageHtmlLink":"rdma: Poll the control cq if no match"}},{"before":"1de991c1e8ae381ab82903a733a715b5ce516f71","after":"b16b8f309101e75d6e8835c545314a4dcd5aa481","ref":"refs/heads/control-qp-rebase","pushedAt":"2024-09-06T06:21:20.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"rajachan","name":"Raghu Raja","path":"/rajachan","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/742736?s=80&v=4"},"commit":{"message":"rdma: Poll the control cq if no match\n\nIf a match isn't found for the current send, poll the control cq\nto see if the match can be found. While this extends the current\nsend() call, it potentially lowers the time until data transfer\nstarts.\n\nSigned-off-by: Brian Barrett \nSigned-off-by: Raghu Raja ","shortMessageHtmlLink":"rdma: Poll the control cq if no match"}},{"before":"c11ee00643b28595e6fdd19a2f9c20defc8c8141","after":"1de991c1e8ae381ab82903a733a715b5ce516f71","ref":"refs/heads/control-qp-rebase","pushedAt":"2024-09-03T16:03:29.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"a-szegel","name":"Seth Zegelstein","path":"/a-szegel","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/97712042?s=80&v=4"},"commit":{"message":"rdma: Poll the control cq if no match\n\nIf a match isn't found for the current send, poll the control cq\nto see if the match can be found. While this extends the current\nsend() call, it potentially lowers the time until data transfer\nstarts.\n\nSigned-off-by: Brian Barrett \nSigned-off-by: Raghu Raja ","shortMessageHtmlLink":"rdma: Poll the control cq if no match"}},{"before":"0f0343c4e673320bbaae57fbcf6a7a8a8af67202","after":"c11ee00643b28595e6fdd19a2f9c20defc8c8141","ref":"refs/heads/control-qp-rebase","pushedAt":"2024-08-29T22:52:49.000Z","pushType":"push","commitsCount":3,"pusher":{"login":"rajachan","name":"Raghu Raja","path":"/rajachan","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/742736?s=80&v=4"},"commit":{"message":"Merge branch 'master' into control-qp-rebase","shortMessageHtmlLink":"Merge branch 'master' into control-qp-rebase"}},{"before":"dcef637ed70ea8364834065a97061fe279da7960","after":"0f0343c4e673320bbaae57fbcf6a7a8a8af67202","ref":"refs/heads/control-qp-rebase","pushedAt":"2024-08-28T17:11:52.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"rajachan","name":"Raghu Raja","path":"/rajachan","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/742736?s=80&v=4"},"commit":{"message":"rdma: Poll the control cq if no match\n\nIf a match isn't found for the current send, poll the control cq\nto see if the match can be found. While this extends the current\nsend() call, it potentially lowers the time until data transfer\nstarts.\n\nSigned-off-by: Brian Barrett \nSigned-off-by: Raghu Raja ","shortMessageHtmlLink":"rdma: Poll the control cq if no match"}},{"before":"ca7884fdff2a029c3fd5a17cee23f9f9563d2fbe","after":"dcef637ed70ea8364834065a97061fe279da7960","ref":"refs/heads/control-qp-rebase","pushedAt":"2024-08-28T17:00:15.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"rajachan","name":"Raghu Raja","path":"/rajachan","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/742736?s=80&v=4"},"commit":{"message":"rdma: Poll the control cq if no match\n\nIf a match isn't found for the current send, poll the control cq\nto see if the match can be found. While this extends the current\nsend() call, it potentially lowers the time until data transfer\nstarts.\n\nSigned-off-by: Brian Barrett \nSigned-off-by: Raghu Raja ","shortMessageHtmlLink":"rdma: Poll the control cq if no match"}},{"before":null,"after":"6fb9090a096a8fb17061bc79bdfad2fae3c26580","ref":"refs/heads/checkpprtest","pushedAt":"2024-08-28T15:19:09.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"rajachan","name":"Raghu Raja","path":"/rajachan","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/742736?s=80&v=4"},"commit":{"message":"rdma: Poll the control cq if no match\n\nIf a match isn't found for the current send, poll the control cq\nto see if the match can be found. While this extends the current\nsend() call, it potentially lowers the time until data transfer\nstarts.\n\nSigned-off-by: Brian Barrett \nSigned-off-by: Raghu Raja ","shortMessageHtmlLink":"rdma: Poll the control cq if no match"}},{"before":"6898e99957b37b282dabd74900e24c681807b71b","after":"ca7884fdff2a029c3fd5a17cee23f9f9563d2fbe","ref":"refs/heads/control-qp-rebase","pushedAt":"2024-08-26T04:06:41.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"rajachan","name":"Raghu Raja","path":"/rajachan","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/742736?s=80&v=4"},"commit":{"message":"rdma: Poll the control cq if no match\n\nIf a match isn't found for the current send, poll the control cq\nto see if the match can be found. While this extends the current\nsend() call, it potentially lowers the time until data transfer\nstarts.\n\nSigned-off-by: Brian Barrett \nSigned-off-by: Raghu Raja ","shortMessageHtmlLink":"rdma: Poll the control cq if no match"}},{"before":"d32118c8ca5b3a1982037b0b95f1d46162d70a7a","after":"6898e99957b37b282dabd74900e24c681807b71b","ref":"refs/heads/control-qp-rebase","pushedAt":"2024-08-26T03:55:23.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"rajachan","name":"Raghu Raja","path":"/rajachan","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/742736?s=80&v=4"},"commit":{"message":"rdma: Poll the control cq if no match\n\nIf a match isn't found for the current send, poll the control cq\nto see if the match can be found. While this extends the current\nsend() call, it potentially lowers the time until data transfer\nstarts.\n\nSigned-off-by: Brian Barrett \nSigned-off-by: Raghu Raja ","shortMessageHtmlLink":"rdma: Poll the control cq if no match"}},{"before":"1026010b6f59fb7c5b73a188588e90a2bd030b6d","after":"d32118c8ca5b3a1982037b0b95f1d46162d70a7a","ref":"refs/heads/control-qp-rebase","pushedAt":"2024-08-26T03:47:12.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"rajachan","name":"Raghu Raja","path":"/rajachan","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/742736?s=80&v=4"},"commit":{"message":"rdma: Poll the control cq if no match\n\nIf a match isn't found for the current send, poll the control cq\nto see if the match can be found. While this extends the current\nsend() call, it potentially lowers the time until data transfer\nstarts.\n\nSigned-off-by: Brian Barrett \nSigned-off-by: Raghu Raja ","shortMessageHtmlLink":"rdma: Poll the control cq if no match"}},{"before":"37707a980ef106ab0d2c6ae3118b80b5c6514b38","after":"1026010b6f59fb7c5b73a188588e90a2bd030b6d","ref":"refs/heads/control-qp-rebase","pushedAt":"2024-08-26T02:03:15.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"rajachan","name":"Raghu Raja","path":"/rajachan","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/742736?s=80&v=4"},"commit":{"message":"rdma: Poll the control cq if no match\n\nIf a match isn't found for the current send, poll the control cq\nto see if the match can be found. While this extends the current\nsend() call, it potentially lowers the time until data transfer\nstarts.\n\nSigned-off-by: Brian Barrett \nSigned-off-by: Raghu Raja ","shortMessageHtmlLink":"rdma: Poll the control cq if no match"}},{"before":"2c191c970ea518d7d95b95cc6bff7662211b52d8","after":"95f6376cadd99cd369e9a3de34be4374939f8c1d","ref":"refs/heads/master","pushedAt":"2024-08-23T07:47:58.000Z","pushType":"push","commitsCount":135,"pusher":{"login":"rajachan","name":"Raghu Raja","path":"/rajachan","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/742736?s=80&v=4"},"commit":{"message":".ci/aws: Add g4dn testing to PR CI\n\nSigned-off-by: Seth Zegelstein ","shortMessageHtmlLink":".ci/aws: Add g4dn testing to PR CI"}},{"before":null,"after":"37707a980ef106ab0d2c6ae3118b80b5c6514b38","ref":"refs/heads/control-qp-rebase","pushedAt":"2024-08-20T07:22:26.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"rajachan","name":"Raghu Raja","path":"/rajachan","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/742736?s=80&v=4"},"commit":{"message":"rdma: Poll the control cq if no match\n\nIf a match isn't found for the current send, poll the control cq\nto see if the match can be found. While this extends the current\nsend() call, it potentially lowers the time until data transfer\nstarts.\n\nSigned-off-by: Brian Barrett \nSigned-off-by: Raghu Raja ","shortMessageHtmlLink":"rdma: Poll the control cq if no match"}},{"before":"4e721059d6bd7e1343f5cf615adb84e116743e0f","after":null,"ref":"refs/heads/drop-winorderhack","pushedAt":"2024-08-09T16:00:38.000Z","pushType":"branch_deletion","commitsCount":0,"pusher":{"login":"rajachan","name":"Raghu Raja","path":"/rajachan","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/742736?s=80&v=4"}},{"before":"93036b6965285ef5482f4aba3c5196589354102b","after":"4e721059d6bd7e1343f5cf615adb84e116743e0f","ref":"refs/heads/drop-winorderhack","pushedAt":"2024-08-08T17:41:18.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"rajachan","name":"Raghu Raja","path":"/rajachan","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/742736?s=80&v=4"},"commit":{"message":"aws: Do not skip the WRITE_IN_ORDER_ALIGNED_128_BYTES check for P5\n\nEFA now reports WRITE_IN_ORDER_ALIGNED_128_BYTES for this platform and\nthe plugin no longer has to override this check to support the\nlow-latency protocols. Removing the hack so the protocol support is\ndetermined programmatically for both the send/recv and RDMA protocols.\n\nSigned-off-by: Raghu Raja ","shortMessageHtmlLink":"aws: Do not skip the WRITE_IN_ORDER_ALIGNED_128_BYTES check for P5"}},{"before":null,"after":"93036b6965285ef5482f4aba3c5196589354102b","ref":"refs/heads/drop-winorderhack","pushedAt":"2024-08-08T17:38:40.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"rajachan","name":"Raghu Raja","path":"/rajachan","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/742736?s=80&v=4"},"commit":{"message":"aws: Do not skip the WRITE_IN_ORDER_ALIGNED_128_BYTES check for P5\n\nEFA now reports WRITE_IN_ORDER_ALIGNED_128_BYTES for this platform and\nthe plugin no longer has to circumvent that check to determine support\nfor the low-latency protocols.\n\nSigned-off-by: Raghu Raja ","shortMessageHtmlLink":"aws: Do not skip the WRITE_IN_ORDER_ALIGNED_128_BYTES check for P5"}},{"before":"b45a27fc2349d68395737b2fec8b41839da683a3","after":"040d71f16814974e2eca4432c5c185e347b9753d","ref":"refs/heads/mrcache","pushedAt":"2024-07-10T16:40:57.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"AmedeoSapio","name":"Amedeo Sapio","path":"/AmedeoSapio","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/10835281?s=80&v=4"},"commit":{"message":"Report global registrations to NCCL\n\nWhen net-plugin returns regIsGlobal=1 to NCCL (as part of net-plugin\ngetProperties() API), it signals to NCCL that registered MRs are global,\nin the sense that they can be used by all communicators. In addition, it\nalso signals to NCCL that the net-plugin have a fast MR cache such that\ncalling regMr() on same buffer (address and size), will quickly return a\npreviously globally registered MR on same buffer.\n\nWhen user registers a buffer with NCCL by using ncclCommRegister() API,\nif net-plugin supports regIsGlobal=1, NCCL will register the buffer\nglobally once (On each net device) with regMr() API. When the net\nproxy-thread starts to execute a communication task on a previously\nregistered user buffer, it will call the net-plugin regMr() to quickly\nfetch the previously globally registered MR from the plugin managed MR\ncache.\n\nSince we now have such MR cache in the plugin, we can report registrations\nas global if 1. the MR scope for the libfabric provider is the domain, and\n2. if the plugin is using one domain per process.\n\nAdditionally, we are not reporting registrations as global for the SENDRECV\nprotocol because the SENDRECV protocol currently does not correctly handle\nthe truncated send case (send size > recv size) which NCCL may use when\nregIsGlobal=1.\n\nThis reverts commit d9c416ff5291dabe1264dd0f3307d415815012f0, with\nadditional changes.\n\nSigned-off-by: Amedeo Sapio ","shortMessageHtmlLink":"Report global registrations to NCCL"}},{"before":"f3afc3176318d404b2066d8fc5073a191181a3ae","after":"b45a27fc2349d68395737b2fec8b41839da683a3","ref":"refs/heads/mrcache","pushedAt":"2024-07-09T21:36:23.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"AmedeoSapio","name":"Amedeo Sapio","path":"/AmedeoSapio","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/10835281?s=80&v=4"},"commit":{"message":"sendrecv: add MR cache to SENDRECV protocol\n\nThis commit is making the SENDRECV protocol use the MR cache for memory\nregistrations.\n\nSigned-off-by: Amedeo Sapio ","shortMessageHtmlLink":"sendrecv: add MR cache to SENDRECV protocol"}},{"before":"872fdba6da728601d6b2d40f02e1489ea691cd2f","after":"f3afc3176318d404b2066d8fc5073a191181a3ae","ref":"refs/heads/mrcache","pushedAt":"2024-07-09T21:34:18.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"AmedeoSapio","name":"Amedeo Sapio","path":"/AmedeoSapio","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/10835281?s=80&v=4"},"commit":{"message":"sendrecv: add MR cache to SENDRECV protocol\n\nThis commit is making the SENDRECV protocol use the MR cache for memory\nregistrations.\n\nSigned-off-by: Amedeo Sapio ","shortMessageHtmlLink":"sendrecv: add MR cache to SENDRECV protocol"}},{"before":"2a195fb99d3519ebe097078d7054bcbd21e98f12","after":"872fdba6da728601d6b2d40f02e1489ea691cd2f","ref":"refs/heads/mrcache","pushedAt":"2024-07-09T16:34:22.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"AmedeoSapio","name":"Amedeo Sapio","path":"/AmedeoSapio","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/10835281?s=80&v=4"},"commit":{"message":"Report global registrations to NCCL\n\nWhen net-plugin returns regIsGlobal=1 to NCCL (as part of net-plugin\ngetProperties() API), it signals to NCCL that registered MRs are global,\nin the sense that they can be used by all communicators. In addition, it\nalso signals to NCCL that the net-plugin have a fast MR cache such that\ncalling regMr() on same buffer (address and size), will quickly return a\npreviously globally registered MR on same buffer.\n\nWhen user registers a buffer with NCCL by using ncclCommRegister() API,\nif net-plugin supports regIsGlobal=1, NCCL will register the buffer\nglobally once (On each net device) with regMr() API. When the net\nproxy-thread starts to execute a communication task on a previously\nregistered user buffer, it will call the net-plugin regMr() to quickly\nfetch the previously globally registered MR from the plugin managed MR\ncache.\n\nSince we now have such MR cache in the plugin, we can report registrations\nas global if 1. the MR scope for the libfabric provider is the domain, and\n2. if the plugin is using one domain per process.\n\nAdditionally, we are not reporting registrations as global for the SENDRECV\nprotocol because the SENDRECV protocol currently does not correctly handle\nthe truncated send case (send size > recv size) which NCCL may use when\nregIsGlobal=1.\n\nThis reverts commit d9c416ff5291dabe1264dd0f3307d415815012f0, with\nadditional changes.\n\nSigned-off-by: Amedeo Sapio ","shortMessageHtmlLink":"Report global registrations to NCCL"}},{"before":"e431f03217cd8aec89f4566d27e603d70bc7f4c7","after":"2a195fb99d3519ebe097078d7054bcbd21e98f12","ref":"refs/heads/mrcache","pushedAt":"2024-07-08T19:02:24.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"aws-nslick","name":"Nicholas Sielicki","path":"/aws-nslick","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/145174695?s=80&v=4"},"commit":{"message":"Report global registrations to NCCL\n\nWhen net-plugin returns regIsGlobal=1 to NCCL (as part of net-plugin\ngetProperties() API), it signals to NCCL that registered MRs are global,\nin the sense that they can be used by all communicators. In addition, it\nalso signals to NCCL that the net-plugin have a fast MR cache such that\ncalling regMr() on same buffer (address and size), will quickly return a\npreviously globally registered MR on same buffer.\n\nWhen user registers a buffer with NCCL by using ncclCommRegister() API,\nif net-plugin supports regIsGlobal=1, NCCL will register the buffer\nglobally once (On each net device) with regMr() API. When the net\nproxy-thread starts to execute a communication task on a previously\nregistered user buffer, it will call the net-plugin regMr() to quickly\nfetch the previously globally registered MR from the plugin managed MR\ncache.\n\nSince we now have such MR cache in the plugin, we can report registrations\nas global if 1. the MR scope for the libfabric provider is the domain, and\n2. if the plugin is using one domain per process.\n\nAdditionally, we are not reporting registrations as global for the SENDRECV\nprotocol because the SENDRECV protocol currently does not correctly handle\nthe truncated send case (send size > recv size) which NCCL may use when\nregIsGlobal=1.\n\nThis reverts commit d9c416ff5291dabe1264dd0f3307d415815012f0, with\nadditional changes.\n\nSigned-off-by: Amedeo Sapio ","shortMessageHtmlLink":"Report global registrations to NCCL"}},{"before":"0d236fb6bc8a9233aebf6500077180a4200b3c1b","after":"e431f03217cd8aec89f4566d27e603d70bc7f4c7","ref":"refs/heads/mrcache","pushedAt":"2024-07-06T06:56:03.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"AmedeoSapio","name":"Amedeo Sapio","path":"/AmedeoSapio","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/10835281?s=80&v=4"},"commit":{"message":"Report global registrations to NCCL\n\nWhen net-plugin returns regIsGlobal=1 to NCCL (as part of net-plugin\ngetProperties() API), it signals to NCCL that registered MRs are global,\nin the sense that they can be used by all communicators. In addition, it\nalso signals to NCCL that the net-plugin have a fast MR cache such that\ncalling regMr() on same buffer (address and size), will quickly return a\npreviously globally registered MR on same buffer.\n\nWhen user registers a buffer with NCCL by using ncclCommRegister() API,\nif net-plugin supports regIsGlobal=1, NCCL will register the buffer\nglobally once (On each net device) with regMr() API. When the net\nproxy-thread starts to execute a communication task on a previously\nregistered user buffer, it will call the net-plugin regMr() to quickly\nfetch the previously globally registered MR from the plugin managed MR\ncache.\n\nSince we now have such MR cache in the plugin, we can report registrations\nas global if 1. the MR scope for the libfabric provider is the domain, and\n2. if the plugin is using one domain per process.\n\nAdditionally, we are not reporting registrations as global for the SENDRECV\nprotocol because the SENDRECV protocol currently does not correctly handle\nthe truncated send case (send size > recv size) which NCCL may use when\nregIsGlobal=1.\n\nThis reverts commit d9c416ff5291dabe1264dd0f3307d415815012f0, with\nadditional changes.\n\nSigned-off-by: Amedeo Sapio ","shortMessageHtmlLink":"Report global registrations to NCCL"}},{"before":"acf7521517dfe75c7eb86a727fd4c2c3c1d67a56","after":"0d236fb6bc8a9233aebf6500077180a4200b3c1b","ref":"refs/heads/mrcache","pushedAt":"2024-07-01T19:08:55.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"rauteric","name":"Eric Raut","path":"/rauteric","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/10216922?s=80&v=4"},"commit":{"message":"transports: use mr in plugin","shortMessageHtmlLink":"transports: use mr in plugin"}},{"before":"701fdb6c807d89f47e48bc353c39d91c88b98a02","after":"acf7521517dfe75c7eb86a727fd4c2c3c1d67a56","ref":"refs/heads/mrcache","pushedAt":"2024-06-25T01:49:20.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"rauteric","name":"Eric Raut","path":"/rauteric","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/10216922?s=80&v=4"},"commit":{"message":"Introduce a memory registration cache for the net plugin\n\nWith user buffer registration capability, when a network plugin reports\nsupport for regIsGlobal, NCCL does maintain a cache of registration\nhandles (originally registered with a loopback communicator). At the\ntime of a send, it still calls into the regMr hook of the network plugin\nfor the actual communicator that will be used for the data transfer (in\ncase the net plugin requires communicator-specific state for the\nregistration). With regIsGlobal guarantee, it is possible for NCCL to\nreuse the handle it has cached, but it does not do that today. This\ncommit introduces a MR cache that is similar in design to NCCL's\ninternal cache (with a linear search in the cache to find a registration\nthat fully covers the list of pages of the buffer in question) to avoid\nredundant (and expensive) registrations with the underlying device.\n\nSigned-off-by: Raghu Raja ","shortMessageHtmlLink":"Introduce a memory registration cache for the net plugin"}}],"hasNextPage":true,"hasPreviousPage":false,"activityType":"all","actor":null,"timePeriod":"all","sort":"DESC","perPage":30,"cursor":"djE6ks8AAAAEu0yyBAA","startCursor":null,"endCursor":null}},"title":"Activity ยท rajachan/aws-ofi-nccl"}