Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code run in the distributed node cluster settings #4

Open
Anshita1Saxena opened this issue Apr 2, 2024 · 2 comments
Open

Code run in the distributed node cluster settings #4

Anshita1Saxena opened this issue Apr 2, 2024 · 2 comments

Comments

@Anshita1Saxena
Copy link

Anshita1Saxena commented Apr 2, 2024

Hi @rong-dai,

Hope this message finds you well. I tried to run the code on the multinode cluster, however, it runs on the single node even if I am providing multiple nodes in the configuration 'cifar.sh'. When I ran the code, it only allocates the node from which I ran the code, however, other nodes are idle as per the python process, cpu and memory usage. Code assigns different client indexes: [ 0 1 2 3 4 5 6 7 8 9 10 11] and mask values: [3989642.0, 3988118.0, 3991490.0, 3988348.0, 3992552.0, 3991390.0, 3990810.0, 3991162.0, 3989368.0, 3988814.0, 3989450.0, 2389378.0] but it utilizes the same node. As per the readme instruction file, I ran the code using the /job/DisPFL/fedml_experiments/standalone/DisPFL directory.
Appreciate your response in providing indications where it has to change.

Results on 3 node cluster which utilizes only the single node with 3 client indexes: client_indexes- [0 1 2], mask_values- [0.0, 4551738.0, 4550940.0]
accuracy_vs_communication_round

Thank You.
Best regards,
Anshita Saxena

@SUNLup
Copy link

SUNLup commented Apr 4, 2024

How is this code reproduced? What is the operating system required?
thank you

@Anshita1Saxena
Copy link
Author

How is this code reproduced? What is the operating system required? thank you

Hi @SUNLup, I ran this code on cluster of 12 servers. However, I saw that the code was running only on single server.
That single server having the operating system: NAME="Ubuntu" and VERSION_ID="22.04" and 8 core cpu with model: "AMD Ryzen Embedded V1807B with Radeon Vega Gfx". So, I didn't use any GPU to reproduce the results.

I wholeheartedly appreciate @rong-dai help in telling how can we run this code on multi-node distributed settings. This repo is structured in a way that it is using 'standalone' keyword which I am assuming is there because the code uses only single server, and @rong-dai is assigning the client_index based on the server whichever has data and compute. So, for example, in this case, when I put 3 clients, this code is running steps iteratively and using the same server as client 0,1, and 2.

Thank You.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants