New improved modelling for LLM Deepspeed. #230

hariharan-devarajan · 2024-10-05T07:55:41Z

The logic is as follows now.

Assume we have 40 layers with tensor parallelism of 4 and pipeline parallelism of 8
Then, the checkpointing would have 44 layers (40 + 4 tensor pipeline layers) spread across every 32 ranks.
So, given pipeline_rank being every four ranks in this case, rank 0-3 is pipeline rank 0, 4-7 is pipeline rank 1, and so on.

Then, I expect a layer distribution among each pipeline rank to be
(pipeline_rank, start_layer_index, end_layer_index) both the start and end are inclusive.
(0, 0, 5)
(1, 6, 11)
(2, 12, 17)
(3, 18, 23)
(4, 24, 28)
(5, 29, 33)
(6, 34, 38)
(7, 39, 43)

Also, a tensor parallelism of 4 would mean each layer tensor would be divided by four on each rank.
So if a layer was (1MB, 1GB) tensors
They would be stored as 256KB, 256MB tensors by each rank.

wvaske · 2024-10-07T22:32:03Z

I hit one issue while testing this. If the checkpoint files did not exist, I would see writes after doing the checkpoint and a comm.barrier(). If the checkpoint files DID exist and I was overwriting them I didn't see this behavior.

I was abled to "fix" this by adding a fsync in the pytorch_checkpointing.py file, but I'm not sure if that's the best way to fix this or if it's a system issue but it ensures that the checkpoint is written as a blocking operation.

    @dlp.log
    def save_state(self, suffix, state):
        name = self.get_name(suffix)
        with open(name, "wb") as f:
            torch.save(state, f)
            os.fsync(f.fileno())
            f.close()

hariharan-devarajan · 2024-10-07T23:20:00Z

Which file system your on? I tested this on lustre and it was working fine. Maybe the file system synchronization is different on your file system. From: Wes Vaske ***@***.***> Date: Monday, October 7, 2024 at 3:32 PM To: argonne-lcf/dlio_benchmark ***@***.***> Cc: Hariharan Devarajan ***@***.***>, Author ***@***.***> Subject: Re: [argonne-lcf/dlio_benchmark] New improved modelling for LLM Deepspeed. (PR #230) I hit one issue while testing this. If the checkpoint files did not exist, I would see writes after doing the checkpoint and a comm.barrier(). If the checkpoint files DID exist and I was overwriting them I didn't see this behavior. I was abled to "fix" this by adding a fsync in the pytorch_checkpointing.py file, but I'm not sure if that's the best way to fix this or if it's a system issue but it ensures that the checkpoint is written as a blocking operation. @dlp.log def save_state(self, suffix, state): name = self.get_name(suffix) with open(name, "wb") as f: torch.save(state, f) os.fsync(f.fileno()) f.close() — Reply to this email directly, view it on GitHub<#230 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AFB2NRFE2FUMC3INOCO7ZU3Z2MDXTAVCNFSM6AAAAABPND2P72VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGOJYGAZTQNJXGU>. You are receiving this because you authored the thread.Message ID: ***@***.***>

hariharan-devarajan · 2024-10-08T19:15:01Z

I hit one issue while testing this. If the checkpoint files did not exist, I would see writes after doing the checkpoint and a comm.barrier(). If the checkpoint files DID exist and I was overwriting them I didn't see this behavior.

I was abled to "fix" this by adding a fsync in the pytorch_checkpointing.py file, but I'm not sure if that's the best way to fix this or if it's a system issue but it ensures that the checkpoint is written as a blocking operation.
    @dlp.log
    def save_state(self, suffix, state):
        name = self.get_name(suffix)
        with open(name, "wb") as f:
            torch.save(state, f)
            os.fsync(f.fileno())
            f.close()

I am hesitating for doing sync as it will significantly slow down the system. Can u describe the filesystem on which your writing the checkpoints?

We probably need a flag in dlio_benchmark to be enable fsync for some filesystems.

wvaske · 2024-10-08T20:35:49Z

I hit one issue while testing this. If the checkpoint files did not exist, I would see writes after doing the checkpoint and a comm.barrier(). If the checkpoint files DID exist and I was overwriting them I didn't see this behavior.
I was abled to "fix" this by adding a fsync in the pytorch_checkpointing.py file, but I'm not sure if that's the best way to fix this or if it's a system issue but it ensures that the checkpoint is written as a blocking operation.
    @dlp.log
    def save_state(self, suffix, state):
        name = self.get_name(suffix)
        with open(name, "wb") as f:
            torch.save(state, f)
            os.fsync(f.fileno())
            f.close()
I am hesitating for doing sync as it will significantly slow down the system. Can u describe the filesystem on which your writing the checkpoints?

We probably need a flag in dlio_benchmark to be enable fsync for some filesystems.

I'm using XFS with a single local NVMe drive. I'm OK tracking this change in my local branch for now until I can better confirm if it's a real issue or an artifact of some system configuration issue.

hariharan-devarajan added 5 commits October 5, 2024 00:54

New improved modelling for LLM Deepspeed.

Verified

This commit was signed with the committer’s verified signature.

RafaelAPB Rafael Belchior

GPG key ID: 730376340BF65F4D

Learn about vigilant mode

b2bf926

make size of tensor even integer.

9f12a50

fixed formula

c412725

fixed calculation

b557f08

switch order of faster tests

acc2b4f

hariharan-devarajan requested a review from zhenghh04 October 5, 2024 09:21

hariharan-devarajan and others added 8 commits October 10, 2024 10:20

fix epochs

b4000d3

more accurate representation of deepspeed.

aabe149

use mpiio to generate offsets

7104043

fxied deepspeed

6b50605

fixed type for overflow warning

b05eb73

added preprocessing step

85f8490

Merge branch 'main' into bugfix/checkpointing

2ffa7cb

Update indexed_binary_mmap_reader.py

ecc384a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New improved modelling for LLM Deepspeed. #230

New improved modelling for LLM Deepspeed. #230

hariharan-devarajan commented Oct 5, 2024

wvaske commented Oct 7, 2024

hariharan-devarajan commented Oct 7, 2024 via email

hariharan-devarajan commented Oct 8, 2024

wvaske commented Oct 8, 2024

New improved modelling for LLM Deepspeed. #230

Are you sure you want to change the base?

New improved modelling for LLM Deepspeed. #230

Conversation

hariharan-devarajan commented Oct 5, 2024

wvaske commented Oct 7, 2024

hariharan-devarajan commented Oct 7, 2024 via email

hariharan-devarajan commented Oct 8, 2024

wvaske commented Oct 8, 2024