Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Enable HTTP range header support for data requests #61

Closed
wants to merge 3 commits into from

Conversation

pescennius
Copy link
Contributor

Pull Request: HTTP Range Header Support

Change Summary

This pull request introduces support for the HTTP Range header in our project, offering enhanced capabilities for node runners. The primary motivation behind this change is to enable developers to leverage SQLite and DuckDB more seamlessly with Arweave. Both of these technologies rely on range requests, enabling the browser to function as an edge compute platform for file-based databases.

Motivation

The introduction of HTTP Range header support brings several benefits to users:

  1. Empowering Developers: This change opens up exciting opportunities for developers to take full advantage of technologies such as sql.js-httpvfs and DuckDB Wasm. It allows the browser to efficiently manage data retrieval, optimizing the user experience.

  2. Efficient Resource Utilization: With support for the HTTP Range header, users can intelligently manage client applications, reducing egress and conserving bandwidth. This enhancement promotes efficient resource utilization. In the long run, pushing down the data slicing optimizations further could potentially increase some efficiency of the node itself.

How to Verify

To confirm that HTTP Range header support is functioning as expected, you can easily perform a test. Here's an example using the curl command:

  1. Open your terminal.

  2. Execute the following commands with the ar-io folder as the current directory:

    docker build . -t ar-io-core:http_range
    docker run -p 4000:4000 -v ar-io-data:/app/data ar-io-core:http-range
    curl -r 100-199 http://localhost:4000/IjdF4EVfdpTGvLymQjFqbMYOIaBn0Q9oUSz3tMTiQkw/
    

Resources

For more information about the technologies I'm enabling, please refer to the following resources:

@djwhitt
Copy link
Collaborator

djwhitt commented Oct 30, 2023

Hey, @pescennius. Thanks for the PR! This is definitely a feature we'd like to integrate. I think there are a couple things we'll want to change in terms of where the range application to the stream happens in the code, but this is a great start. I'll get some notes on that written up in the next day or two about proposed, and then we can discuss. You, and anyone else interested in this functionality, should be feel comfortable building on it though prior to integration. The API for this is already standardized by itself HTTP, so that won't change.

@djwhitt
Copy link
Collaborator

djwhitt commented Oct 30, 2023

Side note - I love this motivation:

Empowering Developers: This change opens up exciting opportunities for developers to take full advantage of technologies such as sql.js-httpvfs and DuckDB Wasm. It allows the browser to efficiently manage data retrieval, optimizing the user experience.

Shared data sets on Arweave is one of the applications I'm personally most excited about.

@pescennius
Copy link
Contributor Author

@djwhitt awesome thanks for the quick reply. Def open to notes on implementation. Thanks for the opportunity to contribute 😅

Copy link
Collaborator

@djwhitt djwhitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I think we should work on this this as follows:

  1. Address the inline comments.
  2. Add some tests. We don't currently have any for handlers, but they're structured to support testing without too much hassle.
  3. Push some of the range handling down into the ContiguousDataSource implementations so that we can avoid reading all the data from caches (FS and S3 in the future).

Let's focus on 1 and 2 and not worry too much about 3 yet. For 2 (tests) don't worry about getting them complete. I'd just like something minimal in place to help prevent regressions. If you get stuck on any part of that, please reach out. We (ar.io team) will help fill in the gaps as needed.

src/routes/data/handlers.ts Show resolved Hide resolved
@@ -64,6 +65,70 @@ const setDataHeaders = ({
res.header('Content-Length', data.size.toString());
};

const handlePartialDataResponse = (log: Logger, rangeHeader: string, res: Response, data: ContiguousData, dataAttributes: ContiguousDataAttributes | undefined ) => {
const totalSize = data.size;
const parts = rangeHeader.replace(/bytes=/, "").split("-");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should probably use a range request parsing library here. We've used https://github.com/jshttp/range-parser elsewhere internally and it's worked well.

const chunkSize = end - start + 1;

// Check if the range is valid
if (start >= 0 && end < totalSize && start <= end) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should try to handle multiple ranges too since that's part of the spec. https://www.npmjs.com/package/byte-range-stream should work for that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the library but my testing with curl doesn't seem to be making the multi range work. I still just get back the content of the first part of the range

@pescennius
Copy link
Contributor Author

@djwhitt I'll handle the smaller requests you've noted in terms of code structure this evening. I agree #3 is a good optimization but a bit out of my depth right now. Do you have any pointers on where to start with a test? Go/Python is my normal backend toolchain so some of this is a bit new for me.

@djwhitt
Copy link
Collaborator

djwhitt commented Nov 6, 2023

@pescennius Sounds good. It's fine to hold off on tests. I'll try to get some basic test scaffolding in place for HTTP handlers this week that you can build off of.

@djwhitt
Copy link
Collaborator

djwhitt commented Nov 6, 2023

Also, re #3 - no worries, I figured that would end up landing on us, but wanted to offer just in case. I think we can merge without it too. It's really just a performance and memory optimization. It doesn't need to hold up the core functionality.

@djwhitt
Copy link
Collaborator

djwhitt commented Nov 10, 2023

@pescennius Finally found some time to add an example test here: 1538561

I'll try to spend some more time on this next week too, adding more tests and possibly getting into the lower level refactoring.

@pescennius
Copy link
Contributor Author

@djwhitt let me know how I can best be helpful from here, including handing things off to you to wrap up.

@djwhitt
Copy link
Collaborator

djwhitt commented Nov 19, 2023

@pescennius No worries. I'll play around with it this week and see if I can make some progress.

@djwhitt
Copy link
Collaborator

djwhitt commented Nov 22, 2023

@pescennius Give this branch a try when you find time and let me know that it works okay for your use case: https://github.com/ar-io/ar-io-node/tree/PE-4908-range-request-integration

It includes a bit of refactoring and improved error handling, but ignores multiple ranges for now (handles them with 200 + full data). I started to add multiple ranges too but realized it would be terribly inefficient till I got range handling integrated into the lower layers of the data stack. I didn't want to hold up integration for that though, so I'll do that in a future branch/PR.

@pescennius
Copy link
Contributor Author

@djwhitt took a look at a branch and the changes look good 👍 . My only thought is that the multi range should raise a 416 instead of a 200. imo its better for users if they know explicitly that their request didn't return what they were expecting rather than return the full payload. That way they don't have to try to detect this case themselves since most languages request libraries won't validate that.

@djwhitt
Copy link
Collaborator

djwhitt commented Nov 27, 2023

@pescennius Thanks for the feedback. I think you're right. I switched it to 416 and merged it develop.

Closing this PR since the integration work happened in #64.

I'll try to get support for multiple ranges added soon too, but realistically it might not land till the new year.

Thanks again for the contribution!

@djwhitt djwhitt closed this Nov 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants