Description
I think that the interface regarding path inputs should be more consistent(ly defined).
- While some file systems do accept simple paths such as
/
or/folder
, others do not, e.g., for HTTPFileSystem, I have to specify the whole URL including the port again and again for each access, e.g.,result = fs.ls("http://127.0.0.1:8000")
. - I would have expected that I could use the
name
returned byls
calls to be usable for the nextls
orstat
call, but some file systems, again HTTP (but also WebDAV, maybe all HTTP based ones), may or may not require the name to further be escaped. However, simply escaping all paths also does not work because that would result in 404 errors in other file systems, I believe - Some file systems return full paths in ls, some relative paths. I don't think this is specified sufficiently.
E.g., consider this case where ls
even returns some mix of escaped and unescaped symbols, which seems to be clearly a bug:
Server set up with:
mkdir -p /tmp/served
echo foo > /tmp/served/'#not-a-good-name!?'
ruby -run -e httpd /tmp/served/ --port 8000 --bind-address=127.0.0.1
import pprint
import urllib
from fsspec.implementations.http import HTTPFileSystem as HFS
url = "http://127.0.0.1:8000"
fs = HFS(url)
# What I would have expected to work:
# result = fs.ls("/")
results = fs.ls(url)
pprint.pprint(results)
print()
for result in results:
if '/?' not in result['name']:
# Neither do work
# pprint.pprint(fs.stat(urllib.parse.quote(result['name'])))
pprint.pprint(fs.stat(result['name']))
Output:
[{'name': 'http://127.0.0.1:8000/%23not-a-good-name!?',
'size': None,
'type': 'file'},
{'name': 'http://127.0.0.1:8000/?N=D', 'size': None, 'type': 'file'},
{'name': 'http://127.0.0.1:8000/?S=D', 'size': None, 'type': 'file'},
{'name': 'http://127.0.0.1:8000/?M=D', 'size': None, 'type': 'file'}]
{'ETag': '4013b0-4-670a77e8',
'mimetype': 'application/octet-stream',
'name': 'http://127.0.0.1:8000/%23not-a-good-name%21%3F',
'size': 4,
'type': 'file',
'url': 'http://127.0.0.1:8000/%23not-a-good-name!%3F'}
Traceback (most recent call last):
File "~/.local/lib/python3.12/site-packages/fsspec/implementations/http.py", line 422, in _info
await _file_info(
File "~/.local/lib/python3.12/site-packages/fsspec/implementations/http.py", line 832, in _file_info
r = await session.get(url, allow_redirects=ar, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "~/.local/lib/python3.12/site-packages/aiohttp/client.py", line 590, in _request
raise err_exc_cls(url)
aiohttp.client_exceptions.InvalidUrlClientError: http://127.0.0.1:8000/%2523not-a-good-name!%3F
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/media/d/Myself/projects/ratarmount/worktrees/1/test-http.py", line 15, in <module>
pprint.pprint(fs.stat(urllib.parse.quote(result['name'])))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "~/.local/lib/python3.12/site-packages/fsspec/spec.py", line 1605, in stat
return self.info(path, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "~/.local/lib/python3.12/site-packages/fsspec/asyn.py", line 118, in wrapper
return sync(self.loop, func, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "~/.local/lib/python3.12/site-packages/fsspec/asyn.py", line 103, in sync
raise return_result
File "~/.local/lib/python3.12/site-packages/fsspec/asyn.py", line 56, in _runner
result[0] = await coro
^^^^^^^^^^
File "~/.local/lib/python3.12/site-packages/fsspec/implementations/http.py", line 435, in _info
raise FileNotFoundError(url) from exc
FileNotFoundError: http%3A//127.0.0.1%3A8000/%2523not-a-good-name%21%3F
The #
is escaped as %23
but ?
is not escaped leading to the observed FileNotFoundError
.
Using urllib.parse.quote
also does not work because it would escape %
again even after going through the trouble of parsing the whole URL because we do not want to escape the special characters in http://127.0.0.1:8000
only those in the path. Even this would result in %2523not-a-good-name%21%3F
, which also leads to a FileNotFoundError
.
HTTPFileSystem
returning those sorting actions such as ?N=D
is another matter altogether, but because it parses generated HTML listing from arbitrary HTTP servers, I am not surprised that this is not stable and might be impossible to fix for all cases.
Possibly related issues: