Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port GrandTheatreQuebec Huginn to Ruby #1

Open
1 of 4 tasks
saumier opened this issue Oct 23, 2023 · 9 comments
Open
1 of 4 tasks

Port GrandTheatreQuebec Huginn to Ruby #1

saumier opened this issue Oct 23, 2023 · 9 comments
Assignees

Comments

@saumier
Copy link
Member

saumier commented Oct 23, 2023

The GrandTheatreQuebec already has a Planet. This is to remove the crawling still happening on Huginn. The workflow in Huginn has an extra step when crawling each page, that is to scrape the html for the keywords of each event page. The keywords is missing from the JSON-LD and is added to JSON-LD by the workflow and then mapped to the GrandTheatreQuebec event type SKOS.

If needed, I can give you access to Huginn

So I propose working in steps (each step can be loaded into Artsdata for review)

  • normal crawl using Orion to get JSON-LD of each webpage into Artsdata
  • custom scrape to extract keywords from event webpages
  • mapping keywords to GrandTheatreQuebec event type SKOS
  • add specific SPARQL transforms (review Huginn's list of SPARQLs with Gregory, some may not be needed)
@saumier saumier changed the title Crawl GrandTheatreQuebec Huginn to Planet Move GrandTheatreQuebec Huginn to Planet Dec 19, 2023
@saumier saumier changed the title Move GrandTheatreQuebec Huginn to Planet Port GrandTheatreQuebec Huginn to Ruby Jan 9, 2024
@saumier saumier transferred this issue from culturecreates/nebula Jan 10, 2024
@saumier saumier removed the status in Artsdata Jan 23, 2024
@saumier saumier moved this to Todo in Artsdata Sep 5, 2024
@saumier
Copy link
Member Author

saumier commented Sep 5, 2024

@dev-aravind Only a couple of Huginn scenarios left to migrate ;-)

Image

@dev-aravind
Copy link

@saumier will add the huginn crawling details here.

@dev-aravind dev-aravind assigned saumier and unassigned dev-aravind Nov 4, 2024
@dev-aravind dev-aravind moved this from In Progress to In Review in Artsdata Nov 4, 2024
@saumier
Copy link
Member Author

saumier commented Nov 4, 2024

@dev-aravind Here is the agent from Huginn. Instead of a CSS class it uses "xpath": "//article[@class=\"show\"]//a" to get the list of @href for the events.

{
  "expected_update_period_in_days": "100",
  "url": [
    "https://grandtheatre.qc.ca/programmation/"
  ],
  "type": "html",
  "mode": "all",
  "extract": {
    "url": {
      "xpath": "//article[@class=\"show\"]//a",
      "value": "concat(\"https://grandtheatre.qc.ca\",@href)"
    }
  },
  "template": {
    "graph_name": "{{graph_name}}"
  }
}

@saumier saumier assigned dev-aravind and unassigned saumier Nov 4, 2024
@dev-aravind
Copy link

dev-aravind commented Nov 7, 2024

@saumier The reason why the workflow was stalled is because the replace blank nodes SPARQL is taking more time than expected to execute. The initial output of the crawler before the transformation is a 400,000 line JSON-LD file with 16,000+ xhv:role entities. These roles are blank and top-level nodes, hence the SPARQL will try to assign a temporary URI for them. What do you suggest?

@dev-aravind dev-aravind assigned saumier and unassigned dev-aravind Nov 7, 2024
@dev-aravind dev-aravind moved this from In Progress to In Review in Artsdata Nov 7, 2024
@dev-aravind
Copy link

Task for @dev-aravind - remove the vocab role types and reorder the SPARQL run to make the replace blank nodes to run last.

@dev-aravind dev-aravind assigned dev-aravind and unassigned saumier Nov 7, 2024
@dev-aravind dev-aravind moved this from In Review to Todo in Artsdata Nov 7, 2024
@saumier
Copy link
Member Author

saumier commented Nov 8, 2024

@dev-aravind Here is the list of the SPARQLs created for GTQ. Please look at each one and decide if they can be added to our pipeline for all crawls. Some will already be covered in our Github Artsdata Pipeline Action (like specific/lavitrine/fix-schemaorg-date-datatype), some will need to be added (like specific/lavitrine/fix-isni).

The question to ask: Can this SPARQL apply to all data feeds and improve the data quality for everyone? If yes then it should be added to the Action for everyone.

This is an important step and I will want to review each individual SPARQL in the list to check that we are using the best approach for maintainability. We need to make it as easy as we can for someone else to understand what each SPARQL does.

The paths are relative to this folder https://github.com/culturecreates/sparql-library/tree/master/artsdata/ETL/huginn

  • 1. specific/lavitrine/fix-schemaorg-https-objects
  • 2. specific/lavitrine/fix-wikidata-uri
  • 3. specific/lavitrine/add-artsdata-uri-using-wikidata-bridge
  • 4. specific/lavitrine/fix-schemaorg-date-datatype
  • 5. specific/lavitrine/create-eventseries
  • 6. specific/lavitrine/copy-subevent-data-to-eventseries
  • 7. specific/lavitrine/fix-isni
  • 8. specific/lavitrine/add-artsdata-uri-using-isni-bridge
  • 9. specific/lavitrine/collapse_duplicate_contact_point_blanknodes
  • 10. specific/lavitrine/add-keywords-additional-type-mapping
  • 11. specific/lavitrine/fix-offer-availability

Things to consider in SPARQL

  • several SPARQLs from the Huginn pipeline use graph <graph_name_placeholder> {... } which should be removed from the SPARQLs added to the artsdata pipeline action. Unless the graph is inside a federated part using an external end point (like service <http://db.artsdata.ca/repositories/artsdata>)
  • the Huginn pipeline runs the SPARQLs in graphdb with inferencing turned on. The Artsdata pipeline does not (yet) using inferencing. Special attention should be given to check that inferencing is not needed. For example: select * {?s a schema:Event} will include all subtypes such as schema:MusicEvent, schema:DanceEvent, etc.
  • several SPARQLs use federated queries to get artsdata uris. This SPARQL specific/lavitrine/add-keywords-additional-type-mapping is used to add the mapping of keyword to event type (using additionalType) without having to download the mapping file. However, for maintainability, I think this can be replaced by loading the event type mapping file gtq-event-type-mapping.ttl into the graph data uploaded to Artsdata so everything is in a single data feed. This way the workflow will not depend on the mapping file being already loaded into Artsdata.

@dev-aravind
Copy link

dev-aravind commented Nov 13, 2024

@saumier I added a set of SPARQLs to our pipeline and the PR can be found here. Please approve this if you find everything to be okay.

I was not able to run the add-artsdata-uri-using-wikidata-bridge and add-artsdata-uri-using-isni-bridge because of this error:
SERVICE operator not implemented (NotImplementedError). Have you encountered this previously?

@dev-aravind dev-aravind assigned saumier and unassigned dev-aravind Nov 13, 2024
@dev-aravind dev-aravind moved this from In Progress to In Review in Artsdata Nov 13, 2024
@dev-aravind dev-aravind added the question Further information is requested label Nov 13, 2024
@saumier
Copy link
Member Author

saumier commented Nov 21, 2024

@dev-aravind The error you received NotImplementedError is because the feature called "federated SPARQL" is not implemented.

I will move these 2 SPARQLs to the Artsdata pipeline. https://github.com/culturecreates/artsdata-api/issues/20

I reviewed the PR.

@saumier saumier removed the question Further information is requested label Nov 21, 2024
@saumier saumier assigned dev-aravind and unassigned saumier Nov 21, 2024
@saumier saumier moved this from In Review to Todo in Artsdata Nov 22, 2024
@dev-aravind dev-aravind moved this from Todo to In Progress in Artsdata Dec 3, 2024
@dev-aravind
Copy link

@saumier I updated the tests to run as separate files for each SPARQLS in the artsdata pipeline repository, and updated the GTQ repository to add the workflow to fetch the data, add the concepts and push the data to artsdata. You can find the PR here.

@dev-aravind dev-aravind assigned saumier and unassigned dev-aravind Dec 3, 2024
@dev-aravind dev-aravind moved this from In Progress to In Review in Artsdata Dec 3, 2024
@saumier saumier moved this from In Review to Todo in Artsdata Dec 17, 2024
@saumier saumier removed their assignment Dec 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Todo
Development

No branches or pull requests

2 participants