Code run in the distributed node cluster settings #4

Anshita1Saxena · 2024-04-02T15:16:07Z

Hope this message finds you well. I tried to run the code on the multinode cluster, however, it runs on the single node even if I am providing multiple nodes in the configuration 'cifar.sh'. When I ran the code, it only allocates the node from which I ran the code, however, other nodes are idle as per the python process, cpu and memory usage. Code assigns different client indexes: [ 0 1 2 3 4 5 6 7 8 9 10 11] and mask values: [3989642.0, 3988118.0, 3991490.0, 3988348.0, 3992552.0, 3991390.0, 3990810.0, 3991162.0, 3989368.0, 3988814.0, 3989450.0, 2389378.0] but it utilizes the same node. As per the readme instruction file, I ran the code using the /job/DisPFL/fedml_experiments/standalone/DisPFL directory.
Appreciate your response in providing indications where it has to change.

Results on 3 node cluster which utilizes only the single node with 3 client indexes: client_indexes- [0 1 2], mask_values- [0.0, 4551738.0, 4550940.0]

Thank You.
Best regards,
Anshita Saxena

The text was updated successfully, but these errors were encountered:

SUNLup · 2024-04-04T06:01:57Z

How is this code reproduced? What is the operating system required?
thank you

Anshita1Saxena · 2024-04-04T17:28:11Z

How is this code reproduced? What is the operating system required? thank you

Hi @SUNLup, I ran this code on cluster of 12 servers. However, I saw that the code was running only on single server.
That single server having the operating system: NAME="Ubuntu" and VERSION_ID="22.04" and 8 core cpu with model: "AMD Ryzen Embedded V1807B with Radeon Vega Gfx". So, I didn't use any GPU to reproduce the results.

I wholeheartedly appreciate @rong-dai help in telling how can we run this code on multi-node distributed settings. This repo is structured in a way that it is using 'standalone' keyword which I am assuming is there because the code uses only single server, and @rong-dai is assigning the client_index based on the server whichever has data and compute. So, for example, in this case, when I put 3 clients, this code is running steps iteratively and using the same server as client 0,1, and 2.

Thank You.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Code run in the distributed node cluster settings #4

Code run in the distributed node cluster settings #4

Anshita1Saxena commented Apr 2, 2024 •

edited

Loading

SUNLup commented Apr 4, 2024

Anshita1Saxena commented Apr 4, 2024

Code run in the distributed node cluster settings #4

Code run in the distributed node cluster settings #4

Comments

Anshita1Saxena commented Apr 2, 2024 • edited Loading

SUNLup commented Apr 4, 2024

Anshita1Saxena commented Apr 4, 2024

Anshita1Saxena commented Apr 2, 2024 •

edited

Loading