Fix sorting of dataset drafts and minor versions when sorting by "newest first" #11180

vera · 2025-01-23T10:44:12Z

What this PR does / why we need it:

This PR fixes an issue where draft versions of datasets were sorted using the release timestamp of their most recent major version.
This caused newer drafts to appear incorrectly alongside their corresponding major version, instead of at the top, when sorted by "newest first". This affects the search results page and the "My data" page, both of which are sorted by newest by default.
Sorting now uses the last update timestamp when sorting draft datasets. The sorting behavior of published dataset versions (major and minor) is unchanged.

See bug description with screenshot etc in #11178

Which issue(s) this PR closes:

Closes Bug?: unexpected sorting of results when sorting by "newest first" (Search + My Data) #11178

Special notes for your reviewer:

/

Suggestions on how to test this:

I've added a test that can be run with: mvn test -Dtest="DataRetrieverApiIT#testRetrieveMyDataAsJsonStringSortOrder"

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

/

Is there a release notes update needed for this change?:

I think it would be good to include a release note for this bug fix, since it affects the default sorting of datasets on the search results and "My data" page. I've added a release note as part of this PR

Additional documentation:

/

…of datasets

… of release time of most recent major version)

…tasets

pdurbin · 2025-01-23T11:35:19Z

TODO: Review the sorting rules from https://docs.google.com/document/d/1DWsEqT8KfheKZmMB3n_VhJpl9nIxiUjai_AIQPAjiyA/edit?usp=sharing and update this comment.

cmbz · 2025-01-29T16:11:24Z

2025/01/29: Julian will review and decide if/when to move forward back into the Sprint queue.

vera · 2025-01-29T16:19:00Z

If it makes the merging decision easier, we could also leave the sorting of minor versions untouched for now (possibly split a sorting change for minor versions into a second PR). But it would be nice if the sorting of the draft versions could be fixed. As I mentioned in the issue, it's been noted by one of our curators that the sorting of draft versions in review based on the publication date of the most recent major version is confusing.

This bug was discovered by one of our curators who used "My data" to get a list of datasets to be reviewed. A dataset which was just submitted for review today was not to be found on page 1, as expected, but on one of the latest pages.

It seems that the dataset was sorted based on the dateSort timestamp which was copied from the latest published version (which was published September 24). This means the draft was sorted as if it was created/submitted last September instead of today.

jggautier · 2025-01-29T16:59:39Z

Thanks for your comment in the GitHub issue @vera.

I've always thought, and I might've heard this from someone years ago, that sorting works the way it has because folks thought that insignificant dataset updates shouldn't make the dataset more "new" than newly published datasets and datasets with significant updates.

This reasoning isn't in the Google Doc that @pdurbin shared. The effects of this decision are discussed a bit in #2607, but the why isn't discussed there either.

@vera what you shared from your curator makes perfect sense to me, too. It sounds like "new" is being thought of in different ways.

And of course it's possible that no one will mind that minor versions start causing datasets to appear at the top when sorting by Newest. I just wasn't sure if this reasoning had been considered here and wanted to make sure that it was before it's changed.

vera · 2025-01-30T10:39:08Z

I've always thought, and I might've heard this from someone years ago, that sorting works the way it has because folks thought that insignificant dataset updates shouldn't make the dataset more "new" than newly published datasets and datasets with significant updates.

That does make sense. Perhaps that means minor versions should stay sorted as they currently are, but drafts should be sorted according to their own timestamp. I would say that if you are able to see a draft, you are usually either a curator or a contributor, and in both cases it makes sense for a new draft to show up on the top, because you are interested in seeing that a new draft exists, checking what's changed, making further edits, reviewing the draft for publication, etc.

jggautier · 2025-01-30T15:34:00Z

I would say that if you are able to see a draft, you are usually either a curator or a contributor, and in both cases it makes sense for a new draft to show up on the top, because you are interested in seeing that a new draft exists, checking what's changed, making further edits, reviewing the draft for publication, etc.

It's really helpful for me that you grounded this change in a user story here (and in your GitHub issue). And I think it'll be helpful for evaluating this change of having drafts appear at the top when sorted by "Newest".

And then minor versions should stay sorted as they currently, as you wondered, until we're able to think about this change by grounding it in other user stories, like if and how it would affect users who are searching for datasets and how it might affect the display of datasets by "Newest" or latest. For example, some repository homepages display the top x number of latest datasets.

@vera, so could this PR be adjusted so that minor versions are sorted as they currently are, while new draft versions show up at the top so that it's easier for curators and other contributors to see that a new draft exists, check what's changed, make further edits and review drafts for publication?

qqmyers · 2025-01-30T22:34:10Z

Still seems odd to me that you'll be able to see your draft and then, when you publish, the new minor version disappears back to the major version's date. I'll also note that we now have the update-current-version option to help avoid truly trivial edits from requiring new minor versions.

vera · 2025-02-03T14:56:45Z

I've just pushed three commits, reverting the sorting behavior of non-draft datasets (so it will not be changed by this PR). I've also added a test confirming that minor versions are sorted based on the publication date of their most recent major versions.

jggautier · 2025-02-03T19:20:49Z

Thanks for the heads up @vera!

I think it's good we recorded why things may have been designed the way they were; the effects of relatively newer features, like being able to overwrite minor versions; and concerns to look out for as folks experience this change. Hopefully it helps folks who have questions or are able to review how this different types of users.

qqmyers · 2025-02-04T14:49:48Z

@vera - looks like your new test is failing:

vera · 2025-02-04T15:15:55Z

@qqmyers that's strange, I ran the test before pushing and just ran it again to be sure, it's passing for me:

$ mvn test -Dtest="DataRetrieverApiIT#testRetrieveMyDataAsJsonStringSortOrder".
...
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 52.64 s -- in edu.harvard.iq.dataverse.api.DataRetrieverApiIT
[INFO] 
[INFO] Results:
[INFO] 
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0

(edit: I've run it a few times now to make sure it's not a flakey test, but it always passes)

I will try to investigate tomorrow.

vera · 2025-02-05T14:12:10Z

@qqmyers Unfortunately, I haven't been able to reproduce the test failure, so I am unsure how to fix it. (I just pushed a commit renaming some variables but that's unrelated to fixing the test failure.) Could it be some interaction with one of the other tests that is run before it in the CI pipeline? (Although the test should be encapsulated since it is using its own newly created user account, dataverse collection etc. so that shouldn't be a problem, I think.) Or do you have some other idea why the test might be failing?

qqmyers · 2025-02-05T15:37:00Z

I'm not positive, but one guess would be that the order of the returned values isn't fixed, so your dataset may be in [0] or [1] in the array. I recently had to fix another test (in #11081) where the order was random, causing the test to fail intermittently. (In that test, there was already code to verify the entries existed via a path-based lookup, so all I did was delete the lines relying on the array order. In your case you might need something similar using paths if you need to find each dataset versus just verifying that there are two.)

vera · 2025-02-05T15:52:01Z

Hmm, the order should be fixed in this case. The results on the "my data" page should be ordered according to the dateSort field in Solr. So unless the two datasets are published in the exact same millisecond, they should be ordered in the way the test expects it.

This is the line that the test is failing at according to your screenshot:

        // Expect newest dataset (dataset 2) first
        assertEquals(datasetTwoId, jsonPathTwoPublishedDatasets.getInt("data.items[0].entity_id"));

In the lines leading up to it, dataset 1 is published first, and then dataset 2:

        // Publish dataset 1
        Response publishDatasetOne = UtilIT.publishDatasetViaNativeApi(datasetOneId, "major", superUserApiToken);
        publishDatasetOne.prettyPrint();
        publishDatasetOne.then().assertThat().statusCode(OK.getStatusCode());

        // Publish dataset 2
        Response publishDatasetTwo = UtilIT.publishDatasetViaNativeApi(datasetTwoId, "major", superUserApiToken);
        publishDatasetTwo.prettyPrint();
        publishDatasetTwo.then().assertThat().statusCode(OK.getStatusCode());

        // Request datasets belonging to user
        Response twoPublishedDatasetsResponse = UtilIT.retrieveMyDataAsJsonString(userApiToken, "", new ArrayList<>(Arrays.asList(6L)));
        twoPublishedDatasetsResponse.prettyPrint();
        assertEquals(OK.getStatusCode(), twoPublishedDatasetsResponse.getStatusCode());
        JsonPath jsonPathTwoPublishedDatasets = twoPublishedDatasetsResponse.getBody().jsonPath();
        assertEquals(2, jsonPathTwoPublishedDatasets.getInt("data.total_count"));

So dataset 2 should receive the newer timestamp and be sorted first. I could try adding a sleep inbetween the publish calls, to ensure they really cannot occur in the same millisecond, although I think that's already highly unlikely...? If it's not that, I'm not sure what the failure reason might be. Could be a bug in the code, although in that case I wonder why the test is passing when I run it.

It's a little hard to debug without being able to look at what's indexed in dateSort in Solr.

qqmyers · 2025-02-05T16:30:25Z

Ah - sorry - looking further into the error response from the test: I think you're getting the wrong value because there is still a draft version in Solr (see below). With the new optimizations, we rely on solr using soft commits with a 1 second default lag time (#11206 is related). So you may need to wait 1 second for solr to be up-to-date before making your call.

If adding a delay doesn't work, let me know and I can cut/paste the whole error response for you and/or see if we can get you direct access to Jenkins to dig further.

{
"success": true,
"data": {
"pagination": {
"isNecessary": false,
"numResults": 2,
"numResultsString": "2",
"docsPerPage": 10,
"selectedPageNumber": 1,
"pageCount": 1,
"hasPreviousPageNumber": false,
"previousPageNumber": 1,
"hasNextPageNumber": false,
"nextPageNumber": 1,
"startCardNumber": 1,
"endCardNumber": 2,
"startCardNumberString": "1",
"endCardNumberString": "2",
"remainingCards": 0,
"numberNextResults": 0,
"pageNumberList": [
1
]
},
"items": [
{
"name": "Darwin's Finches",
"type": "dataset",
"url": "https://doi.org/10.5072/FK2/TDGTPV",
"global_id": "doi:10.5072/FK2/TDGTPV",
"description": "Darwin's finches (also known as the Galápagos finches) are a group of about fifteen species of passerine birds.",
"published_at": "2025-02-05T13:51:12Z",
"publisher": "dvecc60aef",
"citationHtml": "Finch, Fiona, 2025, "Darwin's Finches", <a href="https://doi.org/10.5072/FK2/TDGTPV" target="_blank">https://doi.org/10.5072/FK2/TDGTPV, Root, V1",
"identifier_of_dataverse": "dvecc60aef",
"name_of_dataverse": "dvecc60aef",
"citation": "Finch, Fiona, 2025, "Darwin's Finches", https://doi.org/10.5072/FK2/TDGTPV, Root, V1",
"matches": [

            ],
            "score": 70.50257873535156,
            "entity_id": 116,
            "publicationStatuses": [
                "Published"
            ],
            "storageIdentifier": "file://10.5072/FK2/TDGTPV",
            "subjects": [
                "Medicine, Health and Life Sciences"
            ],
            "fileCount": 0,
            "versionId": 34,
            "versionState": "RELEASED",
            "majorVersion": 1,
            "minorVersion": 0,
            "createdAt": "2025-02-05T13:51:06Z",
            "updatedAt": "2025-02-05T13:51:12Z",
            "contacts": [
                {
                    "name": "Finch, Fiona",
                    "affiliation": ""
                }
            ],
            "api_url": "http://ec2-54-83-149-25.compute-1.amazonaws.com/api/datasets/116",
            "authors": [
                "Finch, Fiona"
            ],
            "publication_statuses": [
                "Published"
            ],
            "is_draft_state": false,
            "is_in_review_state": false,
            "is_unpublished_state": false,
            "is_published": true,
            "is_deaccesioned": false,
            "is_valid": true,
            "date_to_display_on_card": "Feb 5, 2025",
            "parentId": "115",
            "parentName": "dvecc60aef",
            "parent_alias": "dvecc60aef",
            "user_roles": [
                "Contributor"
            ]
        },
        {
            "name": "Darwin's Finches",
            "type": "dataset",
            "url": "https://doi.org/10.5072/FK2/XLRU0K",
            "global_id": "doi:10.5072/FK2/XLRU0K",
            "description": "Darwin's finches (also known as the Galápagos finches) are a group of about fifteen species of passerine birds.",
            "publisher": "dvecc60aef",
            "citationHtml": "Finch, Fiona, 2025, \"Darwin's Finches\", <a href=\"https://doi.org/10.5072/FK2/XLRU0K\" target=\"_blank\">https://doi.org/10.5072/FK2/XLRU0K</a>, Root, DRAFT VERSION",
            "identifier_of_dataverse": "dvecc60aef",
            "name_of_dataverse": "dvecc60aef",
            "citation": "Finch, Fiona, 2025, \"Darwin's Finches\", https://doi.org/10.5072/FK2/XLRU0K, Root, DRAFT VERSION",
            "matches": [
                
            ],
            "score": 70.50257873535156,
            "entity_id": 117,
            "publicationStatuses": [
                "Unpublished",
                "Draft",
                "In Review"
            ],
            "storageIdentifier": "file://10.5072/FK2/XLRU0K",
            "subjects": [
                "Medicine, Health and Life Sciences"
            ],
            "fileCount": 0,
            "versionId": 35,
            "versionState": "RELEASED",
            "createdAt": "2025-02-05T13:51:08Z",
            "updatedAt": "2025-02-05T13:51:15Z",
            "contacts": [
                {
                    "name": "Finch, Fiona",
                    "affiliation": ""
                }
            ],
            "api_url": "http://ec2-54-83-149-25.compute-1.amazonaws.com/api/datasets/117",
            "authors": [
                "Finch, Fiona"
            ],
            "publication_statuses": [
                "Unpublished",
                "Draft",
                "In Review"
            ],
            "is_draft_state": true,
            "is_in_review_state": true,
            "is_unpublished_state": true,
            "is_published": false,
            "is_deaccesioned": false,
            "is_valid": true,
            "date_to_display_on_card": "Feb 5, 2025",
            "parentId": "115",
            "parentName": "dvecc60aef",
            "parent_alias": "dvecc60aef",
            "user_roles": [
                "Contributor"
            ]
        }
    ],
    "total_count": 2,
    "start": 0,
    "search_term": "*:*",
    "dvobject_counts": {
        "dataverses_count": 0,
        "files_count": 0,
        "datasets_count": 2
    },
    "pubstatus_counts": {
        "unpublished_count": 1,
        "draft_count": 1,
        "published_count": 1,
        "in_review_count": 1,
        "deaccessioned_count": 0
    },
    "selected_filters": {
        "publication_statuses": [
            "Published",
            "Unpublished",
            "Draft",
            "In Review",
            "Deaccessioned"
        ],
        "role_names": [
            "Contributor"
        ]
    }
}

}

vera · 2025-02-05T16:57:20Z

Ahh I see, thanks for checking again! I just added some sleeps, let's see if that fixes things.

qqmyers · 2025-02-05T18:55:06Z

That worked - all tests passing. Thanks!

vera added 4 commits January 23, 2025 11:43

test: add failing test in DataRetrieverApiIT for expected sort order …

aee4ae4

…of datasets

fix: use last update time of dataset version during indexing (instead…

8e6b47f

… of release time of most recent major version)

test: extend test in DataRetrieverApiIT for expected sort order of da…

ef8ab7e

…tasets

docs: add release note for 11178-bug-fix-sort-by-newest-first

e432024

vera mentioned this pull request Jan 23, 2025

Bug?: unexpected sorting of results when sorting by "newest first" (Search + My Data) #11178

Open

pdurbin modified the milestone: 6.6 Jan 23, 2025

ofahimIQSS added the Size: 3 A percentage of a sprint. 2.1 hours. label Jan 28, 2025

cmbz added the FY25 Sprint 15 FY25 Sprint 15 (2025-01-15 - 2025-01-29) label Jan 29, 2025

cmbz assigned jggautier Jan 29, 2025

vera added 2 commits February 3, 2025 15:31

fix: revert sorting behaviour of published datasets

8baae03

test: add test for sorting behaviour of minor dataset versions

e181110

docs: update release note for 11178-bug-fix-sort-by-newest-first

bad4df7

test: fix inaccurate variable names in DataRetrieverApiIT

c68e0ba

fix: add missing sleeps to try to fix failing test in DataRetrieverApiIT

761b1af

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix sorting of dataset drafts and minor versions when sorting by "newest first" #11180

Fix sorting of dataset drafts and minor versions when sorting by "newest first" #11180

vera commented Jan 23, 2025 •

edited

Loading

pdurbin commented Jan 23, 2025

cmbz commented Jan 29, 2025

vera commented Jan 29, 2025

jggautier commented Jan 29, 2025

vera commented Jan 30, 2025

jggautier commented Jan 30, 2025

qqmyers commented Jan 30, 2025

vera commented Feb 3, 2025 •

edited

Loading

jggautier commented Feb 3, 2025

qqmyers commented Feb 4, 2025

vera commented Feb 4, 2025 •

edited

Loading

vera commented Feb 5, 2025 •

edited

Loading

qqmyers commented Feb 5, 2025

vera commented Feb 5, 2025 •

edited

Loading

qqmyers commented Feb 5, 2025

vera commented Feb 5, 2025

qqmyers commented Feb 5, 2025

Fix sorting of dataset drafts and minor versions when sorting by "newest first" #11180

Are you sure you want to change the base?

Fix sorting of dataset drafts and minor versions when sorting by "newest first" #11180

Conversation

vera commented Jan 23, 2025 • edited Loading

pdurbin commented Jan 23, 2025

cmbz commented Jan 29, 2025

vera commented Jan 29, 2025

jggautier commented Jan 29, 2025

vera commented Jan 30, 2025

jggautier commented Jan 30, 2025

qqmyers commented Jan 30, 2025

vera commented Feb 3, 2025 • edited Loading

jggautier commented Feb 3, 2025

qqmyers commented Feb 4, 2025

vera commented Feb 4, 2025 • edited Loading

vera commented Feb 5, 2025 • edited Loading

qqmyers commented Feb 5, 2025

vera commented Feb 5, 2025 • edited Loading

qqmyers commented Feb 5, 2025

vera commented Feb 5, 2025

qqmyers commented Feb 5, 2025

vera commented Jan 23, 2025 •

edited

Loading

vera commented Feb 3, 2025 •

edited

Loading

vera commented Feb 4, 2025 •

edited

Loading

vera commented Feb 5, 2025 •

edited

Loading

vera commented Feb 5, 2025 •

edited

Loading