-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable outgoing ports for IPFS peering #2069
Comments
Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! 🤗 |
Thank you for opening this, @d70-t! It was very nice to meet you at the zarr / IPFS meeting yesterday. /cc @rabernat who introduced and invited me to it, and @bollwyvl who is the ardent IPFS fan in the Jupyter community. Currently, we explicitly allow only a certain set of ports for outgoing network traffic ( mybinder.org-deploy/mybinder/values.yaml Line 49 in a988f3e
However, the question is wether just opening up port 4001 outgoing is enough, or if incoming connections also need to be allowed. Incoming connections aren't really possible as we're really in an environment with a couple of NATs between the IPFS process and the open internet. So, do you know how we can test this, @d70-t? |
#2070 should open the outbound port - one option is to just try that and see if it works. |
I'm not familiar with IPFS, but it soudns like HTTP gateways are available: https://docs.ipfs.io/how-to/address-ipfs-on-web/#http-gateways |
Thanks @yuvipanda and @manics for having a look into this. For checking, it should be sufficient to start up my testing repo, open a console window and try if some basic IPFS commands work. In particular, I'd check if the node is able to connect to any peers by issuing ipfs swarm peers and to check if some files may be retrieved, e.g.: ipfs cat /ipfs/QmSgvgwxZGaBLqkGyWemEDqikCqU52XxsYLKtdy3vGZ8uq > spaceship-launch.jpg should download an image. For the HTTP Gateways: yes, that's a current workaround, but which has several drawbacks. Using HTTP (especially over the Network) kind of defeats the purpose of IPFS (although it is a helpful transitioning technology). The goal of the IPFS protocol is to retrieve data based on its content-id from any server which is able to provide that dataset. As such, it connects to many (a configurable amount) peers to ask them for data or advice where to find the data. One particular benefit of that is that one removes any single point of failure in the data retrieval step. For choosing HTTP gateways, there will be two options: using a public one or using a private / self-hosted / self-operated one. The public gateways can more easily be overloaded and tend to return data way slower than any private / local node would do. Also it would introduce further dependencies on services which explicitly state that one should not rely on their operation. Hosting an own gateway would speed things up (one could make it prefer own data). But operating it is not a trivial task, as one has to choose between providing everything (which enables the use of datasets of other colleagues) and only provide own data (which will save you from accidentally re-providing illegal content). Both options are not ideal. This problem disappears when accessing the data via the IPFS protocol, because there is not intermediary which re-provides (unknown) data. Instead, the requesting node would be forwarded directly to the machine of the colleague to request data from there. In the case of mybinder.org, I'd see the benefit in quickly starting up an environment which "just works". In that context, I would argue that having datasets delivered quickly and having the option to experiment with other, similar datasets are both very valuable and that's why I would prefer using the IPFS protocol directly over using HTTP on the "first mile". |
Wooop woop 🚂 it works like a charm: (I've updated the testing-repo to polish things a bit more...) |
Awesome stuff.
We've got go-ipfs up at conda-forge, so it would be quite easy for folk to
try it out by normal means, especially with jupyter-server-proxy. This
would get one to a dashboard.
Effective use in binder might increase ipfs' value in mamba, where @wolfv
had been considering it as a candidate for swappable backends...
Unfortunately, there are some information theory problems (one can't just
turn a shasum into an ipfs CID) but they really would be a lovely match.
As to abuse: I don't think anybody is going to even want to _use_ the
ipfs-adjacent filecoin in binder, much less mine with it:
-
https://docs.filecoin.io/get-started/lotus/installation/#minimal-requirements
-
https://docs.filecoin.io/mine/hardware-requirements/#general-hardware-requirements
|
Thanks for the input @bollwyvl! I'd love seeing IPFS being more commonly used. That would greatly reduce the amount of convincing necessary to get people involved :-) So I'm not yet very much into conda nor mamba. But if there are simpler ways of having an IPFS node up and running that what's shown above, I'd strongly appreciate more input on that. Not only on binder but also e.g. in HPC environments. There's also a related thread at pangeo-forge, where we're thinking about other uses of zarr datasets on top of IPFS. ... if packages, scripts and data would be served across IPFS (or other content addressable systems), it might really be possible to have reproducible computing environments at relatively little extra cost. Very exciting... |
Dear mybinder people,
I'm a happy user of binder and like the possibility to quickly open an interactive notebook, especially for teaching. There has been a field campaign with many participating people and institutions, creating a larger amount of atmospheric and oceanographic datasets which we hope to make more visible and usable by creating the How to EUREC4A online book, mainly consisting of notebooks. We want to give users the opportunity to quickly interact with the book using the magnificent binder project.
We also face the issue that our datasets are scattered around the world and due to various reasons, it is not easy to collect all of them at a single place as well as to take care that datasets are not modified by accident. Furthermore, the currently used servers are unreachable from time to time, which severely degrades the user experience. To solve these issues, we experiment on using IPFS, a content addressable distributed file system, to hold our data (as zarr). IPFS is a peer to peer system which makes it possible for everyone to host copies of the data, which should make data retrieval more reliable and enables us to keep the data distributed (which we like).
So now to the issue: I've been trying to set up a sample repository which should run an IPFS node within a binder environment such that I would be able to retrieve datasets stored on IPFS from within binder using the native IPFS protocol (which enables the desired redundancy in dataset retrieval opportunities). As far as I can tell, the IPFS node was able to start but could not connect to any peers. I assume that the reason for this is that outgoing traffic on Port 4001 (TCP and UDP) is not allowed on binder.
What do you think about accessing IPFS content from binder? Would it be possible to open up port 4001 for this purpose? Do you see other possibilities to access IPFS data?
I'm tagging @yuvipanda, as we've briefly talked about this before.
The text was updated successfully, but these errors were encountered: