-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Need resolve_files() method #301
Comments
@volkfox wdyt about implementing this as a regular mapper? some users might prefer to just check the existence in the UDF when actual file is passed (or does it crash before we even open it? - then it can be considered as a bug) |
May not work well from the data consistency standpoint. |
I'm not sure I understand this. Could you elaborate please? What I mean is that users (from what I understand) can implement this additional signal with a
and then can decide on the next steps (if they want to filter those out or not). Or do you mean that our UDFs break if there are non existent files referenced in the DB? |
Yes, it should be a separate mapper/generator because we will need multiple file operations and it explodes DC API (we already have issues with multiple json/csv that needs to be extracted). I'd specify requirement that way: from datachain.file import resolve_files
...
dc.from_storage("s3:/...", output_name="file").map(file2=resolve_files) This should populate the following columns:
In case of issues with a file, I'd avoid introducing |
unsure what is the intention of the below example.
Create yet another file record?
And if i check the file existence 5 multiple times, i will have 5 file
records?
Create yet another file object, this time with
…On Wed, Aug 21, 2024 at 12:59 PM Dmitry Petrov ***@***.***> wrote:
wdyt about implementing this as a regular mapper?
Yes, it should be a separate mapper/generator because we will need
multiple file operations and it explodes DC API (we already have issues
with multiple json/csv that needs to be extracted).
I'd specify requirement that way:
from datachain.file import resolve_files
...dc.from_storage("s3:/...", output_name="file").map(file2=resolve_files)
This should populate the following columns:
class File(DataModel):
source: str = Field(default="")
path: str
size: int = Field(default=0) # <---
version: str = Field(default="") # <---
etag: str = Field(default="") # <---
is_latest: bool = Field(default=True) # <---
last_modified: datetime = Field(default=TIME_ZERO) # <---
location: Optional[Union[dict, list[dict]]] = Field(default=None) # <---
vtype: str = Field(default="")
In case of issues with a file, I'd avoid introducing is_valid signal at
this point (it breaks APIs). Just keep the specified above signals empty.
Later, we can introduce is_valid if there is a need.
—
Reply to this email directly, view it on GitHub
<#301 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEC4S3VDQHYA6LT2E5M5OODZSTWQXAVCNFSM6AAAAABMRFSWT2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMBSHEYTAMZXHE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
How about this API design: Use case 1. Verify existing file object: mutate(file=resolve("file")) ^^ Use case 2. Create file object from a field: map(file=resolve(uri="uri")) the rules same as above. |
it contradicts the mutate though?
seems like it's solving a different issue? is it possible with just current set of methods and syntax to implement an additional signal valid / not, or check and populate size field in the file.size column? how would that look like? if it's not possible to modify an existing |
Yes, but |
#297 seems like a workaround to me (explicitly disable it until we can properly implement). |
sounds good. |
After live discussion with @volkfox It seems we need 2 resolve functions: First (in this issue), from datachain.file import resolve
# Singnature: def resolve(file: File) -> File: ...
dc.map(file1=resolve) Second (we can create a separate issue), from datachain.file import resolve_uri
# Singnature: def resolve(uri: str) -> File: ...
dc.map(file=resolve_uri, params="link_to_file") Both should create File record with all resolved fields such as Additionally, the overwrites go to separate issues #337 & #336 |
Description
Datachain cannot enforce immutability of storage, so it helps to check if the files are still there.
Currently there is no way to verify, and a chain will just crash when missing a file.
Suggestion is to introduce a method
Which checks file object under "signal" name, and marks File object as valid or invalid.
This requires introduction of a new field of the type Boolean or Date which will mark validity as True/False or None/"last accessed" fashion.
The text was updated successfully, but these errors were encountered: