-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add file utils #745
Add file utils #745
Conversation
sources = ["tests/resources/test.txt"] | ||
files = utils.load_files(sources) | ||
loader = TextLoader(max_tokens=MAX_TOKENS, embedding_driver=MockEmbeddingDriver()) | ||
collection = loader.load_collection(list(files.values())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if load_files
should just return a list so it's easier to pass into load_collection
? Don't think we can load the files concurrently if we change that though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could have two functions, something like load_file_collection
and load_file_list
. Then you can identify individual files in the response if you need to, otherwise the load_file_list
will be more convenient. The implementation of load_file_list
could just call load_file_collection
(obviously).
Another more heavy handed alternative could be to allow passing a dict[str, bytes]
into load_collection in addition to list[bytes]
for each of the relevant loaders. A pro to this option is that you could even use the original key passed in to the input, so the client needed to map from a file name to the file contents, they wouldn't need to do two lookups.
Currently:
sources = [ 'foo.txt' ]
content_by_filename_hash = utils.load_files(sources)
artifacts_by_content_hash = loader.load_collection(list(files_by_filename.values()))
foo_content = content_by_filename_hash[hash_from_str('foo.txt')]
foo_artifact = artifacts_by_content_hash[loader.to_key(foo_content)]
With suggestion:
sources = [ 'foo.txt' ]
content_by_filename_hash = utils.load_files(sources)
artifacts_by_filename_hash = loader.load_collection(content_by_filename_hash)
# One less lookup if `load_collection` reuses keys when passed a dict
foo_artifact = artifacts_by_filename_hash[hash_from_str('foo.txt')]
If the input is the same shape as the output of load_collection
, then it'd make it easier to chain together loaders in general, however I can't think of a use case besides this one.
Another alternative that I'm not necessarily advocating for due to increase in scope, but we could make loaders composable and re-introduce file loader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
Approving since the core use case is solved, but added some alternative suggestions for you to consider at your discretion.
sources = ["tests/resources/test.txt"] | ||
files = utils.load_files(sources) | ||
loader = TextLoader(max_tokens=MAX_TOKENS, embedding_driver=MockEmbeddingDriver()) | ||
collection = loader.load_collection(list(files.values())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could have two functions, something like load_file_collection
and load_file_list
. Then you can identify individual files in the response if you need to, otherwise the load_file_list
will be more convenient. The implementation of load_file_list
could just call load_file_collection
(obviously).
Another more heavy handed alternative could be to allow passing a dict[str, bytes]
into load_collection in addition to list[bytes]
for each of the relevant loaders. A pro to this option is that you could even use the original key passed in to the input, so the client needed to map from a file name to the file contents, they wouldn't need to do two lookups.
Currently:
sources = [ 'foo.txt' ]
content_by_filename_hash = utils.load_files(sources)
artifacts_by_content_hash = loader.load_collection(list(files_by_filename.values()))
foo_content = content_by_filename_hash[hash_from_str('foo.txt')]
foo_artifact = artifacts_by_content_hash[loader.to_key(foo_content)]
With suggestion:
sources = [ 'foo.txt' ]
content_by_filename_hash = utils.load_files(sources)
artifacts_by_filename_hash = loader.load_collection(content_by_filename_hash)
# One less lookup if `load_collection` reuses keys when passed a dict
foo_artifact = artifacts_by_filename_hash[hash_from_str('foo.txt')]
If the input is the same shape as the output of load_collection
, then it'd make it easier to chain together loaders in general, however I can't think of a use case besides this one.
Another alternative that I'm not necessarily advocating for due to increase in scope, but we could make loaders composable and re-introduce file loader.
Also just learned a new python syntax, maybe these changes aren't needed? with open("tests/resources/cities.csv", "r") as cities, open("tests/resources/addresses.csv", "r") as addresses:
CsvLoader().load_collection([cities.read(), addresses.read()]) |
griptape/utils/file_utils.py
Outdated
A dictionary where the keys are a hash of the path and the values are the content of the files. | ||
""" | ||
|
||
futures_executor = futures.ThreadPoolExecutor() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this into a method parameter with a default value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can, but do we still want this PR even with the discussed syntax above? You can see it live here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha! Does it parallelize open
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm struggling to find a first-party resource that discusses the concurrency. But all mentions of the syntax online seem to suggest it is.
Even still, it might be nice to provide these utils as a slightly less painful migration path for users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ended up adding file utils to docs too
b4d5547
to
cfe5306
Compare
a4a8ad9
to
1adc429
Compare
tests/resources/foobar.txt
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously used test.txt
is generated by other tests and is not a reliable source.
cfe5306
to
a4a8ad9
Compare
Bump on this question |
"""Load multiple files concurrently and return a dictionary of their content. | ||
|
||
Args: | ||
paths: The paths to the files to load. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arg doc for futures_executor
is missing.
1adc429
to
90d89e8
Compare
10e9e2d
to
bdab643
Compare
c712b18
to
68405dd
Compare
I can't figure out why tests are failing on 3.9, and 3.11. Some sort of file race condition? The file should be there... |
68405dd
to
fe5007e
Compare
fe5007e
to
86661a4
Compare
9d58a70
to
bf808e1
Compare
Adds file utils to fill void left by removal of file loading logic in Loaders.
Specifically relevant in doc examples that use
load_collection
.