Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ls / listdir fails when using a public gateway because a non-CID path is requested #39

Open
mxmlnkn opened this issue Oct 11, 2024 · 2 comments · May be fixed by #40
Open

Ls / listdir fails when using a public gateway because a non-CID path is requested #39

mxmlnkn opened this issue Oct 11, 2024 · 2 comments · May be fixed by #40

Comments

@mxmlnkn
Copy link

mxmlnkn commented Oct 11, 2024

All of these do work:

IPFS_GATEWAY='https://ipfs.io python3' -c "
import fsspec
with fsspec.open('ipfs://QmZ4tDuvesekSs4qM5ZBKpXiZGun7S2CYtEZRB3DYXkjGx', 'r') as f:
    print(f.read())
"  # hello worlds

# Folder with a single file to emulate a named file.
folderCID=bafybeicn7i3soqdgr7dwnrwytgq4zxy7a5jpkizrvhm5mv6bgjd32wm3q4

ipfs daemon & sleep 3
ipfs get "$folderCID"
stat -c %s "$folderCID/welcome-to-IPFS.jpg" # 663082

python3 -c "
import fsspec
fs, _ = fsspec.url_to_fs('ipfs://$folderCID')
print(fs.ls('$folderCID'))
" 
# [{'name': 'bafybeicn7i3soqdgr7dwnrwytgq4zxy7a5jpkizrvhm5mv6bgjd32wm3q4/welcome-to-IPFS.jpg',
#  'CID': 'bafkreie7ohywtosou76tasm7j63yigtzxe7d5zqus4zu3j6oltvgtibeom', 'type': 'file', 'size': 663082}]

The last command does fail when specifying the IPFS_GATEWAY environment variable

IPFS_GATEWAY='https://ipfs.io' python3 -c "
import fsspec; fs, _ = fsspec.url_to_fs('ipfs://$folderCID'); print(fs.ls('$folderCID'))" 

Error:

Traceback (most recent call last):
  File "<string>", line 4, in <module>
  File "~/.local/lib/python3.12/site-packages/fsspec/asyn.py", line 118, in wrapper
    return sync(self.loop, func, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/.local/lib/python3.12/site-packages/fsspec/asyn.py", line 103, in sync
    raise return_result
  File "~/.local/lib/python3.12/site-packages/fsspec/asyn.py", line 56, in _runner
    result[0] = await coro
                ^^^^^^^^^^
  File "~/.local/lib/python3.12/site-packages/ipfsspec/async_ipfs.py", line 302, in _ls
    return await self.gateway.ls(path, session, detail=detail)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/.local/lib/python3.12/site-packages/ipfsspec/async_ipfs.py", line 148, in ls
    return await asyncio.gather(*(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "~/.local/lib/python3.12/site-packages/ipfsspec/async_ipfs.py", line 73, in info
    self._raise_not_found_for_status(res, path)
  File "~/.local/lib/python3.12/site-packages/ipfsspec/async_ipfs.py", line 162, in _raise_not_found_for_status
    response.raise_for_status()
  File "~/.local/lib/python3.12/site-packages/aiohttp/client_reqrep.py", line 1157, in raise_for_status
    raise ClientResponseError(
aiohttp.client_exceptions.ClientResponseError: 406, message='Not Acceptable', url='https://trustless-gateway.link/ipfs/bafybeicn7i3soqdgr7dwnrwytgq4zxy7a5jpkizrvhm5mv6bgjd32wm3q4/welcome-to-IPFS.jpg?format=raw'

That error 406 can be reproduced with a simple wget on the URL shown in the error message.
The URL already looks wrong. Instead of the structured https://trustless-gateway.link/ipfs/<CID>/welcome-to-IPFS.jpg?format=raw, I would have expected it to be simply a CID.
Using the simple file CID shown in the working ls output, i.e., wget 'https://trustless-gateway.link/ipfs/bafkreie7ohywtosou76tasm7j63yigtzxe7d5zqus4zu3j6oltvgtibeom?format=raw' gets me the desired file without an error.

Btw, does that mean that a simple ls will download all files? Or will it only download the header to determine the file type? Else, this might be running into the same performance problem as the fsspec HTTP backend: fsspec/filesystem_spec#1707

@d70-t
Copy link
Collaborator

d70-t commented Oct 13, 2024

Thanks for opening this!
I couldn't run the snipped as-is, as I guess there's some issue with the placement of some quotes. Anyways, I guess I found the source of the problem. But first, let me answer to the last comemnt:

Btw, does that mean that a simple ls will download all files? Or will it only download the header to determine the file type?

Partly, an yes, that's unfortunate, and related to the issue with the HTTP backend. In order to determine whether a CID refers to a directory or a file, and to determine the exact file size, it is required to inspect the root IPLD block behind the CID of the file or directory. To my knowledge, there's no way to request this information from either an IPFS path gateway or a trustless gateway using only a single request to the containing folder. There's also the recommendation for implementing a directory index for an HTTP Gateway, which suggests not to show the actual file size, as it is expensive to compute. Long story short: if using .ls(..., detail=True), we go the extra mile fetching that information using addional requests. If detail=False, the additional requests would be skipped. The additional requests however are for ?format=raw, which means, only the (one) root IPLD block of that file or folder is returned. Larger files (> 1-2MB) are usually split into multiple IPLD blocks, in which case the other blocks are not fetched. So it's not a full-file download, but it still can be pretty bad.


Now the error 406 appears at an interesting place. It seems like ipfs.io forwards requests including ?format=raw to trustless-gateway.link.

curl -vL 'https://ipfs.io/ipfs/bafybeicn7i3soqdgr7dwnrwytgq4zxy7a5jpkizrvhm5mv6bgjd32wm3q4/welcome-to-IPFS.jpg?format=raw' > /dev/null

...
> GET /ipfs/bafybeicn7i3soqdgr7dwnrwytgq4zxy7a5jpkizrvhm5mv6bgjd32wm3q4/welcome-to-IPFS.jpg?format=raw HTTP/2
> Host: ipfs.io
...
< HTTP/2 301 
...
< location: https://trustless-gateway.link/ipfs/bafybeicn7i3soqdgr7dwnrwytgq4zxy7a5jpkizrvhm5mv6bgjd32wm3q4/welcome-to-IPFS.jpg?format=raw
...
> GET /ipfs/bafybeicn7i3soqdgr7dwnrwytgq4zxy7a5jpkizrvhm5mv6bgjd32wm3q4/welcome-to-IPFS.jpg?format=raw HTTP/2
> Host: trustless-gateway.link
...
< HTTP/2 406

I think the redirect is a resonable choice (it at least should improve the ability to cache things), but it creates an interesting problem:
ipfs.io is a path gateway, which to my interpretation may support ?format=raw requests for all path requests (and the Gateway in kubo indeed does), however trustless-gateway.link is a trustless gateway which must support ?format=raw requests, but should not accept paths for ?format=raw, hence the 406. There are likely a few options to fix this issue:

  • do ?format=car requests
    • that's better from a trustless point of view, as this should include all intermediate CIDs, and thus should allows verifying of the entire path. It however requires parsing the CAR and the additional blocks.
  • do requests directly to CIDs only, where ?format=raw is always allowed
    • that may reduce some traffic and may or may not speed up path resolution
    • this may require some refactoring when handling ipns:// links, as e.g. a ls on an IPNS name may require to then ask for CIDs through ipfs:// protocol
  • rethink, if there might be another way to gather the detail information using HEAD requests
  • ask ipfs.io to resolve the path to a CID prior to forwarding to trustless-gateway.link

d70-t added a commit that referenced this issue Oct 14, 2024
Trustless gateways may not always return raw IPLD blocks when a path is
requested. They should however always return car.

Fixes #39
d70-t added a commit that referenced this issue Oct 14, 2024
Trustless gateways may not always return raw IPLD blocks when a path is
requested. They should however always return car.

Fixes #39
@mxmlnkn
Copy link
Author

mxmlnkn commented Oct 17, 2024

So it's not a full-file download, but it still can be pretty bad.

That already sounds much better than I feared. But, I guess some caching at some level should be done to avoid getting this information more than once.

As for the suggestions for remedying the 406 error, that's a bit over my head.
I'm pretty new to IPFS. Although, I am very intrigued and would like to delve a bit deeper when I find time.
That's partly why I did not reply immediately. But, the last option seems the worst to me. It feels like if this is fixed for ipfs.io, then there can appear any number of other gateways with the same problem, except maybe if there is some kind of standardization... But, if ipfs.io clearly behaves wrongly, then they should be notified.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants