-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Harvesting kit v. 2 #140
Comments
@Dziolas thanks for this proposal. Indeed to standardize is a great idea. The package could depend on the great Regarding validation, this is offered by the |
Generally, the steps will probably look like:
Even though knowledge of Invenio could give us access to a bunch of goodies, I suggest having Harvesting Kit provide useful utilities and common operations as a "leaf package". Then the clients (overlays or custom invenio-modules) decide the flow of things, pass parameters and make use of what Harvesting Kit offers. By using plugins (or contribs) as @Dziolas suggests we could also share the code to harvest and convert a certain feed. For example: In your instance overlay (this is how you glue harvesting kit and your ingestion workflows): # overlay/config.py
APS_URL = "http://example.org"
# overlay/tasks: (e.g. scheduled periodically in Celery Beat on the server)
# this function can be generalised (import_string etc. and celery args)
@celery.task
def harvest_aps(workflow, *args, ...):
from invenio_workflows.api import start_delayed
from harvestingkit.contrib.aps import harvest, convert
last_harvested = # some way to retrieve last harvested date
for harvested_record, harvested_files in harvest(
url=cfg.get("APS_URL"),
from=last_harvested_date,
):
clean_record_dict = convert(harvested_record)
payload = {
"files": harvested_files,
"record": clean_record_dict
}
start_delayed(workflow, data=[payload])
# now your very own ingestion workflow takes over (processing) Then in Harvesting Kit: # harvestingkit/contrib/aps/__init__.py
from .getter import harvest
from .converter import convert
__all__ = ["harvest", "convert"]
# harvestingkit/contrib/aps/getter.py
def harvest(url, from, until, ...):
# implement core retrieval logic (could be spread over several files)
yield record, files
# harvestingkit/contrib/aps/convert.py
def convert(some_record):
# implement conversion logic (could be using classes in common files or functions etc.)
return cleaned_record The main body of harvesting kit could then be solely for common utilities:
It's a slightly different take on the initial idea, but what do you think? EDIT: For validation we can pass a jsonschema, or simply do it on the client-side in the celery task or workflow |
I like @jalavik idea to keep the harvestingkit separated from Invenio as it is now. I think that flask-registry as it is used in Invenio modules is good too. I will start implementing things and let you know on the progress. |
We all know that harvesting kit is not perfect. Here are some problems that it has:
Solution
Other ideas:
harvestingkit elsevier
but parameters that can be passed to scripts need to be standardized (or something)Some pseudo-code sample:
Let me know what do you think about this idea.
The text was updated successfully, but these errors were encountered: