You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hope this message finds you well. I tried to run the code on the multinode cluster, however, it runs on the single node even if I am providing multiple nodes in the configuration 'cifar.sh'. When I ran the code, it only allocates the node from which I ran the code, however, other nodes are idle as per the python process, cpu and memory usage. Code assigns different client indexes: [ 0 1 2 3 4 5 6 7 8 9 10 11] and mask values: [3989642.0, 3988118.0, 3991490.0, 3988348.0, 3992552.0, 3991390.0, 3990810.0, 3991162.0, 3989368.0, 3988814.0, 3989450.0, 2389378.0] but it utilizes the same node. As per the readme instruction file, I ran the code using the /job/DisPFL/fedml_experiments/standalone/DisPFL directory.
Appreciate your response in providing indications where it has to change.
Results on 3 node cluster which utilizes only the single node with 3 client indexes: client_indexes- [0 1 2], mask_values- [0.0, 4551738.0, 4550940.0]
Thank You.
Best regards,
Anshita Saxena
The text was updated successfully, but these errors were encountered:
How is this code reproduced? What is the operating system required? thank you
Hi @SUNLup, I ran this code on cluster of 12 servers. However, I saw that the code was running only on single server.
That single server having the operating system: NAME="Ubuntu" and VERSION_ID="22.04" and 8 core cpu with model: "AMD Ryzen Embedded V1807B with Radeon Vega Gfx". So, I didn't use any GPU to reproduce the results.
I wholeheartedly appreciate @rong-dai help in telling how can we run this code on multi-node distributed settings. This repo is structured in a way that it is using 'standalone' keyword which I am assuming is there because the code uses only single server, and @rong-dai is assigning the client_index based on the server whichever has data and compute. So, for example, in this case, when I put 3 clients, this code is running steps iteratively and using the same server as client 0,1, and 2.
Hi @rong-dai,
Hope this message finds you well. I tried to run the code on the multinode cluster, however, it runs on the single node even if I am providing multiple nodes in the configuration 'cifar.sh'. When I ran the code, it only allocates the node from which I ran the code, however, other nodes are idle as per the python process, cpu and memory usage. Code assigns different client indexes:
[ 0 1 2 3 4 5 6 7 8 9 10 11]
and mask values:[3989642.0, 3988118.0, 3991490.0, 3988348.0, 3992552.0, 3991390.0, 3990810.0, 3991162.0, 3989368.0, 3988814.0, 3989450.0, 2389378.0]
but it utilizes the same node. As per the readme instruction file, I ran the code using the/job/DisPFL/fedml_experiments/standalone/DisPFL
directory.Appreciate your response in providing indications where it has to change.
Results on 3 node cluster which utilizes only the single node with 3 client indexes: client_indexes- [0 1 2], mask_values- [0.0, 4551738.0, 4550940.0]
Thank You.
Best regards,
Anshita Saxena
The text was updated successfully, but these errors were encountered: