Skip to content

[S3 storage_plugin] Seeing No credential issue at random intervals when saving / restoring snapshot from S3. #142

Open
@hbikki

Description

@hbikki

🐛 Describe the bug

When loading snapshot from s3 we are seeing Nocredentials issue happening, this issue happens at random intervals.
The issue is very similar to this from aiobotocore aio-libs/aiobotocore#1006.
This didn't happen when running <=5 process(assumption based on running tests with varying process.), but the error is consistent when running >5 process.

 Snapshot.take(path=str(save_dir), app_state=app_state)
  • Experimented adding retry with exponential back offs for restoring the snapshot.
  • Tried using different versions of aiobototcore.
  • verified from the logs , the _credential value is present.
  • verified credentials are available form the logs
    /0 [6]:[2023-05-14 00:49:02,211][aiobotocore.credentials][INFO] - Found credentials from IAM Role:
  • The issue doesn't happen when the credentials are set via ~/.aws/credentials file or environment variables.

NOTE:
I don't see the failure when I updated and tested the S3 storage_plugin with botot3 s3 client or using botocore.session
testing time is (2hrs) ~ 100 checkpoints.

Logs:

checkpointing_ddp/0 [3]:Traceback (most recent call last):
checkpointing_ddp/0 [3]:  File "/home/User/torchsnapshot/torchsnapshot/scheduler.py", line 369, in read_buffer
checkpointing_ddp/0 [3]:    await self.storage.read(read_io=read_io)
checkpointing_ddp/0 [0]:[2023-05-14 00:17:58,589][asyncio][ERROR] - Task was destroyed but it is pending!
checkpointing_ddp/0 [0]:task: <Task pending name='Task-35' coro=<_ReadPipeline.read_buffer() running at /home/User/torchsnapshot/torchsnapshot/scheduler.py:369> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f5978155640>()]>>
checkpointing_ddp/0 [6]:[2023-05-14 00:17:58,590][aiobotocore.credentials][INFO] - Found credentials from IAM Role: ShopQADeveloperASGRole
checkpointing_ddp/0 [3]:  File "/home/User/torchsnapshot/torchsnapshot/storage_plugins/s3.py", line 60, in read
checkpointing_ddp/0 [3]:    response = await client.get_object(
checkpointing_ddp/0 [3]:  File "/home/User/aiobotocore/aiobotocore/client.py", line 354, in _make_api_call
checkpointing_ddp/0 [3]:    http, parsed_response = await self._make_request(
checkpointing_ddp/0 [3]:  File "/home/User/aiobotocore/aiobotocore/client.py", line 379, in _make_request
checkpointing_ddp/0 [6]:[2023-05-14 00:17:58,610][aiobotocore.credentials][INFO] - Found credentials from IAM Role: ShopQADeveloperASGRole
checkpointing_ddp/0 [0]:[2023-05-14 00:17:58,589][asyncio][ERROR] - Task was destroyed but it is pending!
checkpointing_ddp/0 [3]:    return await self._endpoint.make_request(
checkpointing_ddp/0 [3]:  File "/home/User/aiobotocore/aiobotocore/endpoint.py", line 96, in _send_request
checkpointing_ddp/0 [3]:    request = await self.create_request(request_dict, operation_model)
checkpointing_ddp/0 [3]:  File "/home/User/aiobotocore/aiobotocore/endpoint.py", line 84, in create_request
checkpointing_ddp/0 [0]:task: <Task pending name='Task-36' coro=<_ReadPipeline.read_buffer() running at /home/User/torchsnapshot/torchsnapshot/scheduler.py:369> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f5978155790>()]>>
checkpointing_ddp/0 [6]:[2023-05-14 00:17:58,634][aiobotocore.credentials][INFO] - Found credentials from IAM Role: ShopQADeveloperASGRole
checkpointing_ddp/0 [0]:[2023-05-14 00:17:58,590][asyncio][ERROR] - Task was destroyed but it is pending!
checkpointing_ddp/0 [0]:task: <Task pending name='Task-37' coro=<_ReadPipeline.read_buffer() running at /home/User/torchsnapshot/torchsnapshot/scheduler.py:369> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f5978155550>()]>>
checkpointing_ddp/0 [0]:[2023-05-14 00:17:58,590][asyncio][ERROR] - Task was destroyed but it is pending!
checkpointing_ddp/0 [0]:task: <Task pending name='Task-38' coro=<_ReadPipeline.read_buffer() running at /home/User/torchsnapshot/torchsnapshot/scheduler.py:369> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f5978007c10>()]>>
checkpointing_ddp/0 [0]:[2023-05-14 00:17:58,590][asyncio][ERROR] - Task was destroyed but it is pending!
checkpointing_ddp/0 [0]:task: <Task pending name='Task-39' coro=<_ReadPipeline.read_buffer() running at /home/User/torchsnapshot/torchsnapshot/scheduler.py:369> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f5978007ac0>()]>>
checkpointing_ddp/0 [0]:[2023-05-14 00:17:58,590][asyncio][ERROR] - Task was destroyed but it is pending!
checkpointing_ddp/0 [0]:task: <Task pending name='Task-40' coro=<_ReadPipeline.read_buffer() running at /home/User/torchsnapshot/torchsnapshot/scheduler.py:369> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f596ea95fa0>()]>>
checkpointing_ddp/0 [3]:    await self._event_emitter.emit(
checkpointing_ddp/0 [3]:  File "/home/User/aiobotocore/aiobotocore/hooks.py", line 66, in _emit
checkpointing_ddp/0 [3]:    response = await resolve_awaitable(handler(**kwargs))
checkpointing_ddp/0 [3]:  File "/home/User/aiobotocore/aiobotocore/_helpers.py", line 15, in resolve_awaitable
checkpointing_ddp/0 [3]:    return await obj
checkpointing_ddp/0 [3]:  File "/home/User/aiobotocore/aiobotocore/signers.py", line 24, in handler
checkpointing_ddp/0 [3]:    return await self.sign(operation_name, request)
checkpointing_ddp/0 [3]:  File "/home/User/aiobotocore/aiobotocore/signers.py", line 82, in sign
checkpointing_ddp/0 [3]:    auth.add_auth(request)
checkpointing_ddp/0 [3]:  File "/opt/conda/envs/User/lib/python3.9/site-packages/botocore/auth.py", line 418, in add_auth
checkpointing_ddp/0 [3]:    raise NoCredentialsError()
checkpointing_ddp/0 [3]:botocore.exceptions.NoCredentialsError: Unable to locate credentials


Versions

pytorch = 2.0.0+cu117
torchx-nightly>=2023.3.15
torchsnapshot=0.1.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions