Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements for reference integrity checking for large collection inventories (~>100MB) #751

Merged
merged 4 commits into from
Nov 15, 2023

Conversation

al-niessner
Copy link
Contributor

@al-niessner al-niessner commented Nov 3, 2023

🗒️ Summary

Expand so that validate-refs can use all available memory before failing. Well, 128 GB anyway. Also fixed XPath expression problem.

⚙️ Test Data and/or Report

All automated unit tests pass below

♻️ Related Issues

Closes #748
Closes #750

@al-niessner al-niessner self-assigned this Nov 3, 2023
@al-niessner al-niessner requested a review from a team as a code owner November 3, 2023 16:35
@al-niessner al-niessner changed the title use 128 GB of memory if available but probably just run out of heap improvements to reference integrity checking Nov 3, 2023
@jordanpadams
Copy link
Member

@al-niessner is there a way we can paginate this into smaller chunks to avoid requiring so much memory?

@jordanpadams
Copy link
Member

@al-niessner See comment above ☝️

Copy link
Member

@jordanpadams jordanpadams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible, I would like us to find a way to cut down on the amount of memory required for this type of task. The example that caused the software to break is not going to be the largest data set we have. We need to be able to support much much larger.

@al-niessner
Copy link
Contributor Author

@jordanpadams

The code is not downloading 1000000 labels at once which would allow for pagination. It loads one lidvid at a time over a loop. It keeps some stuff in local memory but not much.

The actual size error #748 looks like it is trying to get a very, very large collection and HTTP stack in java could not load it. Again, nothing there to paginate.

@jordanpadams
Copy link
Member

@al-niessner Sorry for the confusion here. Removing paginate from the question, is there any way we can improve performance from very very large collections where it would not require a huge amount of memory in order to perform this operation?

Would it be more performant to use the API instead?

@al-niessner
Copy link
Contributor Author

@jordanpadams

Given the restrictions/constraints of the tools selected (spring, etc) it seems that large memory is what you need. Since it takes an int as its max size, 2GB is the largest collection that could ever be managed without a tooling change. This one was 187+ MB.

No, there is really no simpler way. It may be that when you ask the API for the collection, it does not bother to give you the big part saving the download and giving the appearance that the API would do it -- all that magic keyword stuff that is done API where it decides what you need to know for you. Unfortunately, this tool needs the part (list of references) that the API usually decides you do not need. Now, the API might get a bit unhappy if you were to ask it for all the products in the collection.

The increase in buffer is not what Java will allocate. It is what Java may allocate. If you ask for a big item, then give it lots of memory.

@jordanpadams
Copy link
Member

@al-niessner hmmmm copy. ok. we may need to figure out how to create a max size for those collection lists (>2GB). Will merge this, and then re-evaluate a new solution.

@jordanpadams jordanpadams changed the title improvements to reference integrity checking Improvements for reference integrity checking for large collection inventories (~>100MB) Nov 15, 2023
@jordanpadams jordanpadams merged commit f2583d8 into main Nov 15, 2023
2 checks passed
@jordanpadams jordanpadams deleted the issue_748 branch November 15, 2023 01:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants