Improvements for reference integrity checking for large collection inventories (~>100MB) #751

al-niessner · 2023-11-03T16:35:06Z

🗒️ Summary

Expand so that validate-refs can use all available memory before failing. Well, 128 GB anyway. Also fixed XPath expression problem.

⚙️ Test Data and/or Report

All automated unit tests pass below

♻️ Related Issues

Closes #748
Closes #750

…emory

src/main/java/gov/nasa/pds/validate/ri/OpensearchDocument.java

jordanpadams · 2023-11-07T01:09:17Z

@al-niessner is there a way we can paginate this into smaller chunks to avoid requiring so much memory?

jordanpadams · 2023-11-10T23:21:26Z

@al-niessner See comment above ☝️

jordanpadams

If possible, I would like us to find a way to cut down on the amount of memory required for this type of task. The example that caused the software to break is not going to be the largest data set we have. We need to be able to support much much larger.

al-niessner · 2023-11-13T20:14:00Z

@jordanpadams

The code is not downloading 1000000 labels at once which would allow for pagination. It loads one lidvid at a time over a loop. It keeps some stuff in local memory but not much.

The actual size error #748 looks like it is trying to get a very, very large collection and HTTP stack in java could not load it. Again, nothing there to paginate.

jordanpadams · 2023-11-13T20:41:21Z

@al-niessner Sorry for the confusion here. Removing paginate from the question, is there any way we can improve performance from very very large collections where it would not require a huge amount of memory in order to perform this operation?

Would it be more performant to use the API instead?

al-niessner · 2023-11-13T21:18:35Z

@jordanpadams

Given the restrictions/constraints of the tools selected (spring, etc) it seems that large memory is what you need. Since it takes an int as its max size, 2GB is the largest collection that could ever be managed without a tooling change. This one was 187+ MB.

No, there is really no simpler way. It may be that when you ask the API for the collection, it does not bother to give you the big part saving the download and giving the appearance that the API would do it -- all that magic keyword stuff that is done API where it decides what you need to know for you. Unfortunately, this tool needs the part (list of references) that the API usually decides you do not need. Now, the API might get a bit unhappy if you were to ask it for all the products in the collection.

The increase in buffer is not what Java will allocate. It is what Java may allocate. If you ask for a big item, then give it lots of memory.

jordanpadams · 2023-11-14T20:23:11Z

@al-niessner hmmmm copy. ok. we may need to figure out how to create a max size for those collection lists (>2GB). Will merge this, and then re-evaluate a new solution.

use 128 GB of memory if available but probably just run out of heap m…

6a9ce15

…emory

al-niessner self-assigned this Nov 3, 2023

al-niessner requested a review from a team as a code owner November 3, 2023 16:35

bad xpath expression

a45d317

al-niessner changed the title ~~use 128 GB of memory if available but probably just run out of heap~~ improvements to reference integrity checking Nov 3, 2023

nutjob4life requested changes Nov 3, 2023

View reviewed changes

src/main/java/gov/nasa/pds/validate/ri/OpensearchDocument.java Outdated Show resolved Hide resolved

al-niessner added 2 commits November 3, 2023 12:20

manual gig

d8b3700

Python spoils the developer

7691765

nutjob4life approved these changes Nov 3, 2023

View reviewed changes

jordanpadams requested changes Nov 11, 2023

View reviewed changes

jordanpadams changed the title ~~improvements to reference integrity checking~~ Improvements for reference integrity checking for large collection inventories (~>100MB) Nov 15, 2023

jordanpadams approved these changes Nov 15, 2023

View reviewed changes

jordanpadams merged commit f2583d8 into main Nov 15, 2023
2 checks passed

jordanpadams deleted the issue_748 branch November 15, 2023 01:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvements for reference integrity checking for large collection inventories (~>100MB) #751

Improvements for reference integrity checking for large collection inventories (~>100MB) #751

al-niessner commented Nov 3, 2023 •

edited

Loading

jordanpadams commented Nov 7, 2023

jordanpadams commented Nov 10, 2023

jordanpadams left a comment

al-niessner commented Nov 13, 2023

jordanpadams commented Nov 13, 2023

al-niessner commented Nov 13, 2023

jordanpadams commented Nov 14, 2023

Improvements for reference integrity checking for large collection inventories (~>100MB) #751

Improvements for reference integrity checking for large collection inventories (~>100MB) #751

Conversation

al-niessner commented Nov 3, 2023 • edited Loading

🗒️ Summary

⚙️ Test Data and/or Report

♻️ Related Issues

jordanpadams commented Nov 7, 2023

jordanpadams commented Nov 10, 2023

jordanpadams left a comment

Choose a reason for hiding this comment

al-niessner commented Nov 13, 2023

jordanpadams commented Nov 13, 2023

al-niessner commented Nov 13, 2023

jordanpadams commented Nov 14, 2023

al-niessner commented Nov 3, 2023 •

edited

Loading