Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for data proxy #16

Merged
merged 3 commits into from
Nov 9, 2023

Conversation

gdiepen
Copy link
Contributor

@gdiepen gdiepen commented Nov 8, 2023

If the edgenode is behind a High Availability Proxy, the problem is that this proxy might require a rewrite of the original url.

For example, if the hostname is clusterhost and the port is 8443, the normal address used by webhdfs API endpoint is:
https://clusterhost:8443/webhdfs/v1.

In the case of a HA proxy, this might require the following URL:
https://clusterhost:8443/gateway/cluster/webhdfs/v1

The webhdfs fsspec has support for a data_proxy parameter that allows you to either provide a callable or a dictionary with find/replace values.

After a quick discussion on the DVC discord, opted to assume the source url for this find/replace dictionary will always be constructed from the url provided, the protocol provided and the hardcoded suffix /webhdfs/v1. Therefore, the user will only have to provide the target part of the rewrite (in the example above https://clusterhost:443/gateway/cluster/webhdfs/v1) to ensure every occurence of https://clusterhost:8443/webhdfs/v1 will be rewritten to https://clusterhost:443/gateway/cluster/webhdfs/v1

I am not sure what kind of additional unit tests would be useful for this (since I also don't see any unit tests with regards to the ssl_verify parameter).

I tested this code in combination with a patched version of DVC (add user and data_proxy_target parameters to the config_schema.json to support the parameters in DVC config) with the version of fsspec that has my recent PR for supporting basic authentication (fsspec/filesystem_spec#1409) and this has been running without any problems.

Related PRs:

Comment on lines 52 to 63
# If target data_proxy provided construct the source from host/port
if "data_proxy_target" in self.fs_args:
host = self.fs_args["host"]
port = self.fs_args["port"]

protocol = "https" if self.fs_args.get("use_https") else "http"

source_url = f"{protocol}://{host}:{port}/webhdfs/v1"
self.fs_args["data_proxy"] = {
source_url: self.fs_args["data_proxy_target"]
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we need to do this in _prepare_credentials above instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw that some things happened in the _prepare_credentials, but since this had nothing to do with credentials I did not include it in there. Happy to move it though

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name is unfortunate, I agree. We use it for transforming our config to fs config. E.g. https://github.com/iterative/dvc-s3/blob/d153738232ac1a6b3920ee7a72b9004f07c193d2/dvc_s3/__init__.py#L131 So let's move it there indeed, please.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just pushed a new version. It looks ok to me (famous last words 😉) but will need to wait till tomorrow till I can give this modified code a test at work

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Then waiting for confirmation from your side.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a test with this new code and was able to push and pull data from the remote WebHDFS server

@efiop efiop merged commit b1b4fe8 into iterative:main Nov 9, 2023
12 checks passed
@efiop
Copy link
Contributor

efiop commented Nov 9, 2023

@gdiepen Thank you! dvc-webhdfs 2.20 is out

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants