Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Remote Vector Index Build] Add download + indexOuput#write implementation to RemoteIndexBuildStrategy #2554

Merged
merged 1 commit into from
Feb 27, 2025

Conversation

jed326
Copy link
Contributor

@jed326 jed326 commented Feb 24, 2025

Description

As a follow-up to #2550, this PR includes the implementation + testing for downloading a graph file from the remote repository and then writing to an IndexOutput.

The key change here is that in order to manage memory consumption, we use a buffer of size 64kb, the same as used in the IndexOutputWithBuffer class, to buffer the graph file from the remote repository into the IndexOutput, else we will go OOM on the download of larger graph files.

For now, we are only using sequential download for the graph file. Parallel download of the graph file has several difficulties associated with it, both on the download side as well as on the writing to IndexOutput side. For more details on these, see: #2464. We take on parallel blob download as a future improvement.

Related Issues

Relates #2465
Relates #2392
Relates #2391

Check List

  • New functionality includes testing.
    - [ ] New functionality has been documented.
    - [ ] API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
    - [ ] Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@jed326 jed326 force-pushed the remote-vector-dev-2 branch from 3674aa2 to 6d950eb Compare February 24, 2025 22:27
@jed326 jed326 force-pushed the remote-vector-dev-2 branch 2 times, most recently from bc0f66a to c945c88 Compare February 25, 2025 00:02
@jed326 jed326 force-pushed the remote-vector-dev-2 branch from c945c88 to 6b6c3b1 Compare February 25, 2025 02:39
navneet1v
navneet1v previously approved these changes Feb 26, 2025
@VisibleForTesting
public void readFromRepository(String path, IndexOutputWithBuffer indexOutputWithBuffer) throws IOException {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be public if you are adding @VisibleForTesting

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that's an artifact from rebasing. Since we're implementing an interface now, it has to be public. I will make the VectorRepositoryAccessor package private though.

@jed326 jed326 force-pushed the remote-vector-dev-2 branch from e7afbc4 to 6187dfb Compare February 26, 2025 23:26
@jed326 jed326 added the v3.0.0 label Feb 26, 2025
* @see IndexOutputWithBuffer#writeFromStreamWithBuffer(InputStream, byte[])
*/
public void writeFromStreamWithBuffer(InputStream inputStream) throws IOException {
writeFromStreamWithBuffer(inputStream, this.buffer);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Not sure why we need to wrap the method again

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can remove this if you think it would be better, but the idea is that the buffer we use for downloading graph files for remote index builds is not necessarily the same buffer as the one we use to buffer bytes out of native memory.

As a part of #2464 we will tune this buffer size (as well as the other ones used in the upload path), but for now we just use the same buffer as a starting point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See previous discussion here: #2554 (comment)

@navneet1v navneet1v merged commit 64c63e7 into opensearch-project:main Feb 27, 2025
38 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants