-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for data proxy #16
Conversation
dvc_webhdfs/__init__.py
Outdated
# If target data_proxy provided construct the source from host/port | ||
if "data_proxy_target" in self.fs_args: | ||
host = self.fs_args["host"] | ||
port = self.fs_args["port"] | ||
|
||
protocol = "https" if self.fs_args.get("use_https") else "http" | ||
|
||
source_url = f"{protocol}://{host}:{port}/webhdfs/v1" | ||
self.fs_args["data_proxy"] = { | ||
source_url: self.fs_args["data_proxy_target"] | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like we need to do this in _prepare_credentials
above instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw that some things happened in the _prepare_credentials
, but since this had nothing to do with credentials I did not include it in there. Happy to move it though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name is unfortunate, I agree. We use it for transforming our config to fs config. E.g. https://github.com/iterative/dvc-s3/blob/d153738232ac1a6b3920ee7a72b9004f07c193d2/dvc_s3/__init__.py#L131 So let's move it there indeed, please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just pushed a new version. It looks ok to me (famous last words 😉) but will need to wait till tomorrow till I can give this modified code a test at work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Then waiting for confirmation from your side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a test with this new code and was able to push and pull data from the remote WebHDFS server
Addressing comment of @efiop
@gdiepen Thank you! dvc-webhdfs 2.20 is out |
If the edgenode is behind a High Availability Proxy, the problem is that this proxy might require a rewrite of the original url.
For example, if the hostname is
clusterhost
and the port is 8443, the normal address used by webhdfs API endpoint is:https://clusterhost:8443/webhdfs/v1
.In the case of a HA proxy, this might require the following URL:
https://clusterhost:8443/gateway/cluster/webhdfs/v1
The webhdfs fsspec has support for a
data_proxy
parameter that allows you to either provide a callable or a dictionary with find/replace values.After a quick discussion on the DVC discord, opted to assume the source url for this find/replace dictionary will always be constructed from the url provided, the protocol provided and the hardcoded suffix
/webhdfs/v1
. Therefore, the user will only have to provide the target part of the rewrite (in the example abovehttps://clusterhost:443/gateway/cluster/webhdfs/v1
) to ensure every occurence ofhttps://clusterhost:8443/webhdfs/v1
will be rewritten tohttps://clusterhost:443/gateway/cluster/webhdfs/v1
I am not sure what kind of additional unit tests would be useful for this (since I also don't see any unit tests with regards to the ssl_verify parameter).
I tested this code in combination with a patched version of DVC (add
user
anddata_proxy_target
parameters to the config_schema.json to support the parameters in DVC config) with the version of fsspec that has my recent PR for supporting basic authentication (fsspec/filesystem_spec#1409) and this has been running without any problems.Related PRs: