-
Notifications
You must be signed in to change notification settings - Fork 494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster combined query for retrieving datasets via API #9684
Faster combined query for retrieving datasets via API #9684
Conversation
I have run the integration tests locally, and they have passed. I have no idea why the tests failed on jenkins. Can someone check it for me? Thanks! |
Tests passed, it looks OK now. |
I have closed it: it is better to go for a real solution, this one is more a workaround. Also, it is not essential and does require merging... |
Reopened and merged. |
@pdurbin |
@ErykKul I'm not the strongest with databases so when I see |
src/main/java/edu/harvard/iq/dataverse/api/AbstractApiBean.java
Outdated
Show resolved
Hide resolved
@@ -127,7 +127,8 @@ public Dataset findDeep(Object pk) { | |||
.setHint("eclipselink.left-join-fetch", "o.files.dataFileTags") | |||
.setHint("eclipselink.left-join-fetch", "o.files.fileMetadatas") | |||
.setHint("eclipselink.left-join-fetch", "o.files.fileMetadatas.fileCategories") | |||
//.setHint("eclipselink.left-join-fetch", "o.files.guestbookResponses") | |||
.setHint("eclipselink.left-join-fetch", "o.files.fileMetadatas.varGroups") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this part of the optimization here? I would guess this isn't often accessed, so I'm not sure how much of a performance help it would be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had seen in in the query log even when it was not used. For large datasets it does make a difference (one query less for each file).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, looks OK. I requested a couple changes and there are merge conflicts to resolve, but the overall change to get the 'deep' version of a dataset for the GET api/datasets/{id} looks fine. Once this is updated, should be ready to go.
@ErykKul heads up that there are merge conflicts. |
@qqmyers I merged the devlop branch into this PR and done some changes.
I hope I did not break anything... |
👋🏼 @ErykKul I created a large Dataset at https://beta.dataverse.org/dataset.xhtml?persistentId=doi:10.5072/FK2/F1MCGB and also published the tools on https://github.com/IQSS/dataverse-sample-data to create this so let me know if that is helpful 😄 |
@jp-tosca I just tested with a dataset with 10000 files, like I did for other PRs. Getting that dataset on the current develop branch took 24 seconds. With this PR it took 8 seconds, 3 times faster! I think this PR aged well (it is from 10 months ago). I will try the |
The files API got worse: for 10 files from a 10000 files dataset it went from 100 ms to 3 seconds. For getting all 10000 files it went from 22 seconds to 25 seconds. It looks like it adds 3 seconds to the result (I guess it is the time needed for find deep to finish?). I will remove the find deep from that method, and leave it only for getting the dataset. |
Is it just me or would some Continuous Benchmarking be good here? |
It looks great! It would be very useful for some of the frequently used calls, like getting datasets. Probably out of scope for this PR. I am not sure why integration tests fail now, but I have seen on Slack that there might be a problem with Jenkins. Other than that, I think that this PR can go to QA? |
I would ask maybe @GPortas or @ekraffmiller to check what method is used by the SPA 😄 |
@qqmyers Only one API call is impacted and it does work faster, regardless of from where it is called. I think that this PR can go to "ready for QA" |
Jenkins failed, I am trying to figure out how to check why (I remember that there was a way to see the build log). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks OK. For some reason, the last run couldn't launch the AWS machine so the IT tests didn't run, so perhaps a rerun should get triggered before this passes QA.
What this PR does / why we need it:
It reduces the number of queries needed for retrieving a dataset via API. Especially for large dataset, but even for smaller datasets when there is large traffic on the server, this makes the usage of resources more efficient.
Which issue(s) this PR closes:
Closes #9683
Observed time for retrieving a dataset with 10000 files went from 1 minute to 35 seconds with this PR. Real life applications would be less spectacular, but still useful.