Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New improved modelling for LLM Deepspeed. #230

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

hariharan-devarajan
Copy link
Collaborator

The logic is as follows now.

Assume we have 40 layers with tensor parallelism of 4 and pipeline parallelism of 8
Then, the checkpointing would have 44 layers (40 + 4 tensor pipeline layers) spread across every 32 ranks.
So, given pipeline_rank being every four ranks in this case, rank 0-3 is pipeline rank 0, 4-7 is pipeline rank 1, and so on.

Then, I expect a layer distribution among each pipeline rank to be
(pipeline_rank, start_layer_index, end_layer_index) both the start and end are inclusive.
(0, 0, 5)
(1, 6, 11)
(2, 12, 17)
(3, 18, 23)
(4, 24, 28)
(5, 29, 33)
(6, 34, 38)
(7, 39, 43)

Also, a tensor parallelism of 4 would mean each layer tensor would be divided by four on each rank.
So if a layer was (1MB, 1GB) tensors
They would be stored as 256KB, 256MB tensors by each rank.

@wvaske
Copy link

wvaske commented Oct 7, 2024

I hit one issue while testing this. If the checkpoint files did not exist, I would see writes after doing the checkpoint and a comm.barrier(). If the checkpoint files DID exist and I was overwriting them I didn't see this behavior.

I was abled to "fix" this by adding a fsync in the pytorch_checkpointing.py file, but I'm not sure if that's the best way to fix this or if it's a system issue but it ensures that the checkpoint is written as a blocking operation.

    @dlp.log
    def save_state(self, suffix, state):
        name = self.get_name(suffix)
        with open(name, "wb") as f:
            torch.save(state, f)
            os.fsync(f.fileno())
            f.close()

@hariharan-devarajan
Copy link
Collaborator Author

hariharan-devarajan commented Oct 7, 2024 via email

@hariharan-devarajan
Copy link
Collaborator Author

I hit one issue while testing this. If the checkpoint files did not exist, I would see writes after doing the checkpoint and a comm.barrier(). If the checkpoint files DID exist and I was overwriting them I didn't see this behavior.

I was abled to "fix" this by adding a fsync in the pytorch_checkpointing.py file, but I'm not sure if that's the best way to fix this or if it's a system issue but it ensures that the checkpoint is written as a blocking operation.

    @dlp.log
    def save_state(self, suffix, state):
        name = self.get_name(suffix)
        with open(name, "wb") as f:
            torch.save(state, f)
            os.fsync(f.fileno())
            f.close()

I am hesitating for doing sync as it will significantly slow down the system. Can u describe the filesystem on which your writing the checkpoints?

We probably need a flag in dlio_benchmark to be enable fsync for some filesystems.

@wvaske
Copy link

wvaske commented Oct 8, 2024

I hit one issue while testing this. If the checkpoint files did not exist, I would see writes after doing the checkpoint and a comm.barrier(). If the checkpoint files DID exist and I was overwriting them I didn't see this behavior.
I was abled to "fix" this by adding a fsync in the pytorch_checkpointing.py file, but I'm not sure if that's the best way to fix this or if it's a system issue but it ensures that the checkpoint is written as a blocking operation.

    @dlp.log
    def save_state(self, suffix, state):
        name = self.get_name(suffix)
        with open(name, "wb") as f:
            torch.save(state, f)
            os.fsync(f.fileno())
            f.close()

I am hesitating for doing sync as it will significantly slow down the system. Can u describe the filesystem on which your writing the checkpoints?

We probably need a flag in dlio_benchmark to be enable fsync for some filesystems.

I'm using XFS with a single local NVMe drive. I'm OK tracking this change in my local branch for now until I can better confirm if it's a real issue or an artifact of some system configuration issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants