Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need resolve_files() method #301

Closed
volkfox opened this issue Aug 14, 2024 · 11 comments · Fixed by #313
Closed

Need resolve_files() method #301

volkfox opened this issue Aug 14, 2024 · 11 comments · Fixed by #313
Labels
enhancement New feature or request priority-p2

Comments

@volkfox
Copy link
Contributor

volkfox commented Aug 14, 2024

Description

Datachain cannot enforce immutability of storage, so it helps to check if the files are still there.
Currently there is no way to verify, and a chain will just crash when missing a file.

Suggestion is to introduce a method

resolve_files(signal = "file")

Which checks file object under "signal" name, and marks File object as valid or invalid.

This requires introduction of a new field of the type Boolean or Date which will mark validity as True/False or None/"last accessed" fashion.

@volkfox volkfox added enhancement New feature or request priority-p2 labels Aug 14, 2024
This was referenced Aug 16, 2024
@shcheklein
Copy link
Member

@volkfox wdyt about implementing this as a regular mapper? some users might prefer to just check the existence in the UDF when actual file is passed (or does it crash before we even open it? - then it can be considered as a bug)

@volkfox
Copy link
Contributor Author

volkfox commented Aug 19, 2024

May not work well from the data consistency standpoint.
If the dataset is broken, we need to gracefully fail and tell the user to recheck it instead of silently skipping the gaps.

@shcheklein
Copy link
Member

May not work well from the data consistency standpoint.

I'm not sure I understand this. Could you elaborate please?

What I mean is that users (from what I understand) can implement this additional signal with a map or gen. I mean this:

This requires introduction of a new field of the type Boolean or Date which will mark validity as True/False or None/"last accessed" fashion.

and then can decide on the next steps (if they want to filter those out or not).

Or do you mean that our UDFs break if there are non existent files referenced in the DB?

@dmpetrov
Copy link
Member

wdyt about implementing this as a regular mapper?

Yes, it should be a separate mapper/generator because we will need multiple file operations and it explodes DC API (we already have issues with multiple json/csv that needs to be extracted).

I'd specify requirement that way:

from datachain.file import resolve_files
...
dc.from_storage("s3:/...", output_name="file").map(file2=resolve_files)

This should populate the following columns:

class File(DataModel):
    source: str = Field(default="")
    path: str
    size: int = Field(default=0)           # <--- 
    version: str = Field(default="")    # <--- 
    etag: str = Field(default="")         # <--- 
    is_latest: bool = Field(default=True)  # <--- 
    last_modified: datetime = Field(default=TIME_ZERO)  # <--- 
    location: Optional[Union[dict, list[dict]]] = Field(default=None)  # <--- 
    vtype: str = Field(default="")

In case of issues with a file, I'd avoid introducing is_valid signal at this point (it breaks APIs). Just keep the specified above signals empty. Later, we can introduce is_valid if there is a need.

@volkfox
Copy link
Contributor Author

volkfox commented Aug 21, 2024 via email

@volkfox
Copy link
Contributor Author

volkfox commented Aug 21, 2024

How about this API design:

Use case 1. Verify existing file object:

mutate(file=resolve("file"))

^^
if successful, the File object will have valid size, otherwise -1

Use case 2. Create file object from a field:

map(file=resolve(uri="uri"))

the rules same as above.

@shcheklein
Copy link
Member

mutate(file=resolve("file"))

it contradicts the mutate though? mutate now doesn't allow mutating an existing column (we just merged a fix for this #297 ). My 2cs - introducing special edge cases into an API like mutate complicates it and complicates the perception.

Use case 2. Create file object from a field:

seems like it's solving a different issue?


is it possible with just current set of methods and syntax to implement an additional signal valid / not, or check and populate size field in the file.size column? how would that look like?

if it's not possible to modify an existing file, can users just generate a new column, name it they way they want and use it downstream?

@dmpetrov
Copy link
Member

dmpetrov commented Aug 21, 2024

mutate(file=resolve("file"))

Yes, but dc.map(file=resolve_files) should overwrite it (when we support overwriting).

@dmpetrov
Copy link
Member

mutate now doesn't allow mutating an existing column (we just merged a fix for this #297 ). My 2cs - introducing special edge cases into an API like mutate complicates it and complicates the perception.

#297 seems like a workaround to me (explicitly disable it until we can properly implement).
It would be great to enable overwriting for both mutate as well as map/gen/agg. It's kind of a syntax sugar that merges 3 commands (rename, remove, rename) into a single one.

@shcheklein
Copy link
Member

It would be great to enable overwriting for both mutate as well as map/gen/agg.

sounds good.

@dmpetrov
Copy link
Member

After live discussion with @volkfox

It seems we need 2 resolve functions:

First (in this issue),

from datachain.file import resolve
# Singnature: def resolve(file: File) -> File: ...

dc.map(file1=resolve)

Second (we can create a separate issue),

from datachain.file import resolve_uri
# Singnature: def resolve(uri: str) -> File: ...

dc.map(file=resolve_uri, params="link_to_file")

Both should create File record with all resolved fields such as etag

Additionally, the overwrites go to separate issues #337 & #336

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request priority-p2
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants