Skip to content

Commit

Permalink
[Data] Remove read task warning if size bytes not set in metadata (#4…
Browse files Browse the repository at this point in the history
…6765)

`Datasource.get_read_tasks()` returns a list of `ReadTask`, where each
`ReadTask` encapsulates a data reading function and it's associated
output metadata. If a serialized `ReadTask` is more than 100KB and
`size_bytes` isn't set in the output metadata (which is often the case),
then Ray Data emits a warning per read task:

```
WARNING: the read task size (288451 bytes) is larger than the reported output size of the task (None bytes). This may be a size reporting bug in the datasource being read from.
```

This warning isn't helpful and usually doesn't indicate an actual issue,
so I'm removing it.

Signed-off-by: Balaji Veeramani <[email protected]>
  • Loading branch information
bveeramani authored Jul 26, 2024
1 parent 89a728d commit 8d2b459
Showing 1 changed file with 19 additions and 10 deletions.
29 changes: 19 additions & 10 deletions python/ray/data/_internal/planner/plan_read_op.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
import logging
from typing import Iterable, List

import ray
Expand Down Expand Up @@ -25,22 +26,30 @@
READ_FILE_MAX_ATTEMPTS = 10
READ_FILE_RETRY_MAX_BACKOFF_SECONDS = 32

logger = logging.getLogger(__name__)


# Defensively compute the size of the block as the max size reported by the
# datasource and the actual read task size. This is to guard against issues
# with bad metadata reporting.
def cleaned_metadata(read_task: ReadTask):
block_meta = read_task.get_metadata()
task_size = len(cloudpickle.dumps(read_task))
if (
block_meta.size_bytes is not None
and task_size > block_meta.size_bytes
and task_size > TASK_SIZE_WARN_THRESHOLD_BYTES
):
logger.warning(
f"The read task size ({task_size} bytes) is larger "
"than the reported output size of the task "
f"({block_meta.size_bytes} bytes). This may be a size "
"reporting bug in the datasource being read from."
)

# Defensively compute the size of the block as the max size reported by the
# datasource and the actual read task size. This is to guard against issues
# with bad metadata reporting.
if block_meta.size_bytes is None or task_size > block_meta.size_bytes:
if task_size > TASK_SIZE_WARN_THRESHOLD_BYTES:
print(
f"WARNING: the read task size ({task_size} bytes) is larger "
"than the reported output size of the task "
f"({block_meta.size_bytes} bytes). This may be a size "
"reporting bug in the datasource being read from."
)
block_meta.size_bytes = task_size

return block_meta


Expand Down

0 comments on commit 8d2b459

Please sign in to comment.