-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spike: Inventory and prioritize all existing Harvesting related issues #24
Comments
There are a few issues in the Harvard Dataverse Repository GitHub repo about harvesting:
And some support emails saved in the RT system about harvesting issues in the Harvard repository: |
(the lists below are work in progress, I'm actively working on them!) I do believe that the third item under the "definition of done" - "prioritize" - was the actual important part of this spike. I also believe that most of that effort of prioritizing what's important can only be done within the dev. team. I can't think of how anyone outside of it could be more qualified to make these calls. So I'm going to make such an attempt. The single most important harvesting issue: (ok, maybe not the most important - but seriously, this should be the first step of any meaningful cleanup of our harvesting implementation; should be fairly easy to wrap up too) The following issues are important in that fixing them will make harvesting more reliable and robust overall (for example, in the current implementation a single missing metadata export that's supposed to be cached is going to break the entire harvesting run). All of the issues on the list below are defined clearly enough that they are ready to be worked on and fixed, without needing to conduct any extra research first. Some of them may be VERY OLD; but they look like something we should fix.
the following 3 issues are basically the same thing - people requesting extra ISO language codes to be added as legitimate controlled vocab. values (this is just a matter of adding extra values to citation.tsv); these are NOT duplicates, different things are being requested to be added in the issues below, but makes sense to get all 3 out of the way at the same time:
The following issues are about the DDI exporter producing XML that is not valid under the schema. I would consider creating a wider scope umbrella issue, something like "Make sure our DDI is valid against the schema (and maybe add a real time validation step to the export?)
Similarly, the following issues are requests for changes in how we export DC; I believe these need to be reviewed/discussed, perhaps together?
The following issues are proposed changes to the design of the harvesting framework and/or metadata exports. Meaning this is something we probably need to discuss as a team, before we decide that these are good ideas and proceed to implement them. But IMO they are (I opened all of them 😄):
There is of course this issue that was opened for figuring out what needs to be added specifically for the NIH/GREI grant:
The list above is by no means complete. If an issue is not listed, it does not necessarily mean that it's not important. But the ones that are listed above should be a good subset to start with. |
The list(s) above should allow us to start working on cleaning up and improving harvesting. |
I'm going to go ahead and close this issue as the spike work of identifying and prioritizing is done above. We've started adding a few of these issues onto the board (in Next Sprint) and we'll continue to use this as a reference to add more as that work gets completed. But no need to keep the actual spike open. |
Also:
(I may have missed some. I tried to add the "Harvesting" label here and there as needed: https://github.com/IQSS/dataverse/labels/Feature%3A%20Harvesting ) |
Grooming note:
|
sizing:
That decision will be reflected in a list in |
Priority Review with Stefano:
|
Sizing:
|
Daily
|
Yeah, I started a thread on Slack about this: https://iqss.slack.com/archives/C010LA04BCG/p1674160762504939 |
2. Already estimated and/or prioritized: ("prioritized" in this context means have at least been reviewed and deemed important/necessary to be addressed soon, and have been assigned NIH grant labels; and specifically
2a. The following issues have NOT been tagged
FInally, IQSS/dataverse#9309 + the fix pr IQSS/dataverse#9316 - a bug introduced in 5.12.1 hasn't been formally prioritized or sized; but it's been discussed and mostly approved for being included in 5.13; it's got the label "Size: Queued" on it, but I proposed 10. So because of this I'm not including it on the list of remaining issues. |
3. Finally, some choice candidates for the short-term queuing, in roughly prioritized order: I would petition to prioritize this one:
This one is a good candidate: The following one is a feature request that came in from an external contributor with a fix PR accompanying it; it may have got stuck in review limbo, we owe it to them to process it quickly: Same as above; small-ish issue, with an accompanying PR from an external dev. From having taken a look, it's not as trivial as they think. But should not be difficult to resolve and we owe it to them, etc. May already be resolved, but if not, def. a good thing to address promptly: An interesting feature request, very specific and should be easy to address. Can definitely be useful to other instances. Not sure if explicitly "useful" from the point of view of the NIH grant though. Change in specific harvesting behavior requested; lots of detailed discussion in the issue. Should be ready to be addressed: This would be a genuinely useful format to add harvesting support for: Not strictly harvesting-related - but if we were willing to handle it under this umbrella, I would give it high priority: This is a very recent problem report from a remote installation. Their problem can be translated into a feature request, asking for more generic OAI_DC records (ones without persistent identifiers) to be importable. Could be a useful thing? - But I'm on a fence somewhat, about how it should be prioritized. |
Sprint review.
|
Oh, this is the one I forgot to mention:
|
... and maybe I should explicitly list the remaining "bulk" queue - these are all labeled with the "feature: Harvesting" label. Basically, by omitting them, I communicated that these should not be in the next wave of prioritization/scheduling. But I'll post that remaining list, w/ my comments I was adding in the process. 4. Remaining bulk list; I believe these can/should wait to be addressed, but opinions welcome.
And just because I think this could be wait, for the purposes of the immediate planning, doesn't mean that I don't want them close. We should continue looking at this list. As I mention in some of the comments above, some of these could be fixed/resolved by the open issues we've already prioritized, so we'll need to confirm that and close those accordingly. |
Moved to dataverse-pm. |
This is in support of:
The first step is to figure out what has already been done by the dataverse team and by the community towards this aim and what still remains to be done.
For example:
And then to prioritize which issues are to be fixed.
Def of done
As completely as is reasonably possible in a 2 week period (sprint):
We need to keep in mind that to harvest something from a particular source requires that that source be bug free. Identify which sources have which bugs so that bugs for a particular source can be targeted. for example: ICPSR as an example. Zenodo is another.
More information:
There is a lot packaged into Aim 4
The scope for this issue is Harvesting via the OAI-PMH standard
Aim 4:
Improve harvesting and packaging standards to share metadata and data across repositories
Our proposed project will significantly improve the widely-used Harvard Dataverse repository to better support NIH-funded research.
A critical measure of the GREI program’s success is to standardize the discoverability across generalist repositories.
To help with this, **we propose to improve the existing harvesting functionality in the Dataverse software based on the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) standard, and coordinate with other repository packaging standards to share or move metadata and data. **
Dataverse already supports the Bags as defined by the Research Data Alliance (RDA) Research Data Repository Interoperability Working Group. Here we proposed to improve the support for Bags, test it for NIH-funded datasets, and explore and define the appropriate standard to use to move the metadata and data across generalist repositories. This will help with a sustainable and succession plan - if one repository cannot support anymore a specific dataset, it will allow to easily move the dataset to another repository without losing any information about the dataset.
Additionally we propose to implement Signposting in the Dataverse software. By adding additional http link headers throughout the application, we can more easily support automated metadata and data discovery in the repository, and allow for other applications and services to more accurately and completely represent the content in the Harvard Dataverse repository.
Related documents
The text was updated successfully, but these errors were encountered: