Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add file utils #745

Merged
merged 6 commits into from
Apr 29, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 28 additions & 1 deletion docs/griptape-framework/data/loaders.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,17 +13,24 @@ Inherits from the [TextLoader](../../reference/griptape/loaders/text_loader.md)

```python
from griptape.loaders import PdfLoader
from griptape.utils import load_files, load_file
import urllib.request

urllib.request.urlretrieve("https://arxiv.org/pdf/1706.03762.pdf", "attention.pdf")

# Load a single PDF file
with open("attention.pdf", "rb") as f:
PdfLoader().load(f.read())
# You can also use the load_file utility function
PdfLoader().load(load_file("attention.pdf"))

urllib.request.urlretrieve("https://arxiv.org/pdf/1706.03762.pdf", "CoT.pdf")

# Load multiple PDF files
with open("attention.pdf", "rb") as attention, open("CoT.pdf", "rb") as cot:
PdfLoader().load_collection([attention.read(), cot.read()])
# You can also use the load_files utility function
PdfLoader().load_collection(list(load_files(["attention.pdf", "CoT.pdf"]).values()))
```

## Sql Loader
Expand Down Expand Up @@ -53,12 +60,19 @@ Can be used to load CSV files into [CsvRowArtifact](../../reference/griptape/art

```python
from griptape.loaders import CsvLoader
from griptape.utils import load_file, load_files

# Load a single CSV file
with open("tests/resources/cities.csv", "r") as f:
CsvLoader().load(f.read())
# You can also use the load_file utility function
CsvLoader().load(load_file("tests/resources/cities.csv"))

# Load multiple CSV files
with open("tests/resources/cities.csv", "r") as cities, open("tests/resources/addresses.csv", "r") as addresses:
CsvLoader().load_collection([cities.read(), addresses.read()])
# You can also use the load_files utility function
CsvLoader().load_collection(list(load_files(["tests/resources/cities.csv", "tests/resources/addresses.csv"]).values()))
```


Expand Down Expand Up @@ -140,19 +154,32 @@ The Image Loader is used to load an image as an [ImageArtifact](./artifacts.md#i

```python
from griptape.loaders import ImageLoader
from griptape.utils import load_file

# Load an image from disk
with open("tests/resources/mountain.png", "rb") as f:
disk_image_artifact = ImageLoader().load(f.read())
# You can also use the load_file utility function
ImageLoader().load(load_file("tests/resources/mountain.png"))
```

By default, the Image Loader will load images in their native format, but not all models work on all formats. To normalize the format of Artifacts returned by the Loader, set the `format` field.

```python
from griptape.loaders import ImageLoader
from griptape.utils import load_files, load_file

# Image data in artifact will be in BMP format.
# Load a single image in BMP format
with open("tests/resources/mountain.png", "rb") as f:
image_artifact_jpeg = ImageLoader(format="bmp").load(f.read())
# You can also use the load_file utility function
ImageLoader(format="bmp").load(load_file("tests/resources/mountain.png"))

# Load multiple images in BMP format
with open("tests/resources/mountain.png", "rb") as mountain, open("tests/resources/cow.png", "rb") as cow:
ImageLoader().load_collection([mountain.read(), cow.read()])
# You can also use the load_files utility function
ImageLoader().load_collection(list(load_files(["tests/resources/mountain.png", "tests/resources/cow.png"]).values()))
```


Expand Down
6 changes: 4 additions & 2 deletions griptape/utils/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,8 @@
from .futures import execute_futures_dict
from .token_counter import TokenCounter
from .prompt_stack import PromptStack
from .dict_utils import remove_null_values_in_dict_recursively
from .dict_utils import dict_merge
from .dict_utils import remove_null_values_in_dict_recursively, dict_merge
from .file_utils import load_file, load_files
from .hash import str_to_hash
from .import_utils import import_optional_dependency
from .import_utils import is_dependency_installed
Expand Down Expand Up @@ -43,4 +43,6 @@ def minify_json(value: str) -> str:
"constants",
"load_artifact_from_memory",
"deprecation_warn",
"load_file",
"load_files",
]
35 changes: 35 additions & 0 deletions griptape/utils/file_utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
import griptape.utils as utils
from concurrent import futures
from typing import Optional


def load_file(path: str) -> bytes:
"""Load a file from the given path and return its content as bytes.

Args:
path (str): The path to the file to load.

Returns:
The content of the file.
"""
with open(path, "rb") as f:
return f.read()


def load_files(paths: list[str], futures_executor: Optional[futures.ThreadPoolExecutor] = None) -> dict[str, bytes]:
"""Load multiple files concurrently and return a dictionary of their content.

Args:
paths: The paths to the files to load.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arg doc for futures_executor is missing.

futures_executor: The executor to use for concurrent loading. If None, a new ThreadPoolExecutor will be created.

Returns:
A dictionary where the keys are a hash of the path and the values are the content of the files.
"""

if futures_executor is None:
futures_executor = futures.ThreadPoolExecutor()

return utils.execute_futures_dict(
{utils.str_to_hash(str(path)): futures_executor.submit(load_file, path) for path in paths}
)
26 changes: 26 additions & 0 deletions tests/resources/foobar-many.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar

foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar

foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar

foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar

foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar

foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar

foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar

foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar

foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar

foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar

foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar

foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar

foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar foobar

53 changes: 53 additions & 0 deletions tests/unit/utils/test_file_utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
import os
from griptape.loaders import TextLoader
from griptape import utils
from concurrent import futures
from tests.mocks.mock_embedding_driver import MockEmbeddingDriver

MAX_TOKENS = 50


class TestFileUtils:
def test_load_file(self):
dirname = os.path.dirname(__file__)
file = utils.load_file(os.path.join(dirname, "../../resources/foobar-many.txt"))

assert file.decode("utf-8").startswith("foobar foobar foobar")
assert len(file.decode("utf-8")) == 4563

def test_load_files(self):
dirname = os.path.dirname(__file__)
sources = ["resources/foobar-many.txt", "resources/foobar-many.txt", "resources/small.png"]
sources = [os.path.join(dirname, "../../", source) for source in sources]
files = utils.load_files(sources, futures_executor=futures.ThreadPoolExecutor(max_workers=1))
assert len(files) == 2

test_file = files[utils.str_to_hash(sources[0])]
assert len(test_file) == 4563
assert test_file.decode("utf-8").startswith("foobar foobar foobar")

small_file = files[utils.str_to_hash(sources[2])]
assert len(small_file) == 97
assert small_file[:8] == b"\x89PNG\r\n\x1a\n"

def test_load_file_with_loader(self):
dirname = os.path.dirname(__file__)
file = utils.load_file(os.path.join(dirname, "../../", "resources/foobar-many.txt"))
artifacts = TextLoader(max_tokens=MAX_TOKENS, embedding_driver=MockEmbeddingDriver()).load(file)

assert len(artifacts) == 39
assert isinstance(artifacts, list)
assert artifacts[0].value.startswith("foobar foobar foobar")

def test_load_files_with_loader(self):
dirname = os.path.dirname(__file__)
sources = ["resources/foobar-many.txt"]
sources = [os.path.join(dirname, "../../", source) for source in sources]
files = utils.load_files(sources)
loader = TextLoader(max_tokens=MAX_TOKENS, embedding_driver=MockEmbeddingDriver())
collection = loader.load_collection(list(files.values()))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if load_files should just return a list so it's easier to pass into load_collection? Don't think we can load the files concurrently if we change that though.

Copy link
Contributor

@dylanholmes dylanholmes Apr 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could have two functions, something like load_file_collection and load_file_list. Then you can identify individual files in the response if you need to, otherwise the load_file_list will be more convenient. The implementation of load_file_list could just call load_file_collection (obviously).

Another more heavy handed alternative could be to allow passing a dict[str, bytes] into load_collection in addition to list[bytes] for each of the relevant loaders. A pro to this option is that you could even use the original key passed in to the input, so the client needed to map from a file name to the file contents, they wouldn't need to do two lookups.

Currently:

sources = [ 'foo.txt' ]
content_by_filename_hash = utils.load_files(sources)
artifacts_by_content_hash = loader.load_collection(list(files_by_filename.values()))

foo_content = content_by_filename_hash[hash_from_str('foo.txt')]
foo_artifact = artifacts_by_content_hash[loader.to_key(foo_content)]

With suggestion:

sources = [ 'foo.txt' ]
content_by_filename_hash = utils.load_files(sources)
artifacts_by_filename_hash = loader.load_collection(content_by_filename_hash)

# One less lookup if `load_collection` reuses keys when passed a dict
foo_artifact = artifacts_by_filename_hash[hash_from_str('foo.txt')]

If the input is the same shape as the output of load_collection, then it'd make it easier to chain together loaders in general, however I can't think of a use case besides this one.


Another alternative that I'm not necessarily advocating for due to increase in scope, but we could make loaders composable and re-introduce file loader.


test_file_artifacts = collection[loader.to_key(files[utils.str_to_hash(sources[0])])]
assert len(test_file_artifacts) == 39
assert isinstance(test_file_artifacts, list)
assert test_file_artifacts[0].value.startswith("foobar foobar foobar")
Loading