Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pipeline Action support for multiple index pages #12

Open
saumier opened this issue Feb 8, 2025 · 5 comments
Open

Pipeline Action support for multiple index pages #12

saumier opened this issue Feb 8, 2025 · 5 comments
Assignees

Comments

@saumier
Copy link
Member

saumier commented Feb 8, 2025

@dev-aravind Please design an enhancement to the Artsdata Pipeline Action that would allow multiple page-url and entity-identifier to be crawled for the same artifact.

The test case can be with theplayhouse-ca which has 2 sitemaps: https://theplayhouse.ca/fr/sitemap.xml and https://theplayhouse.ca/en/sitemap.xml. The idea is to load all pages from both sitemaps into the artifact orion/theplayhouse-ca

@dev-aravind
Copy link
Contributor

@saumier I propose this design. Let me know what you think.

with:
  mode: "fetch-push"
  artifact: "artifact123"
  publisher: "${{ secrets.PUBLISHER_URI_GREGORY }}"
  downloadFile: "file.jsonld"
  page-url: '["www.example.com/events", "www.example.com/evenements]'
  entity-identifier: '["identifier1", "identifier2"]'
  token: "${{ secrets.GITHUB_TOKEN }}"

@dev-aravind dev-aravind assigned saumier and unassigned dev-aravind Feb 11, 2025
@saumier
Copy link
Member Author

saumier commented Feb 11, 2025

@dev-aravind I have a question. Do you see one entity-identifier applied per page-url?

In my use case:
page-url: ["https://theplayhouse.ca/fr/sitemap.xml", "https://theplayhouse.ca/en/sitemap.xml"]
entity-identifier: "loc"

I am wondering if we need to have a list of entity-identifier or not.

Perhaps the entity-identifier could apply to all page-urls and when we come across a need for multiple entity-identifiers then we add a check if entity-identifier is a list and if true we apply one entity-identifier per page-url?

But I like the consistency of your solution. So let's go ahead with one entity-identifier applied per page-url as you propose.

Please go ahead to implementation of a minor version as long as it remains backwards compatible.

@saumier saumier assigned dev-aravind and unassigned saumier Feb 11, 2025
@dev-aravind dev-aravind moved this from Todo to In Progress in Artsdata Feb 17, 2025
@dev-aravind dev-aravind moved this from In Progress to Todo in Artsdata Feb 18, 2025
@dev-aravind dev-aravind moved this from Todo to In Progress in Artsdata Feb 20, 2025
@dev-aravind
Copy link
Contributor

@saumier You can test this now using our custom crawl test workflow(use enhancement/issue-12 branch). Please let me know if you find any issues, if not we can move on to release this.

If you want to test the multiple page crawl, the input should be like:

page-url: "https://theplayhouse.ca/fr/sitemap.xml,https://theplayhouse.ca/en/sitemap.xml"
entity-identifier: 'loc' 

both without the quotes.

@dev-aravind dev-aravind assigned saumier and unassigned dev-aravind Feb 20, 2025
@dev-aravind dev-aravind moved this from In Progress to In Review in Artsdata Feb 20, 2025
@saumier
Copy link
Member Author

saumier commented Feb 20, 2025

@dev-aravind I ran a test with theplayhouse.ca and it collected all the pages. Excellent. Please go ahead and release.

@saumier saumier closed this as completed Feb 20, 2025
@github-project-automation github-project-automation bot moved this from In Review to Done in Artsdata Feb 20, 2025
@saumier saumier reopened this Feb 20, 2025
@github-project-automation github-project-automation bot moved this from Done to Todo in Artsdata Feb 20, 2025
@saumier saumier assigned dev-aravind and unassigned saumier Feb 20, 2025
@dev-aravind dev-aravind moved this from Todo to In Progress in Artsdata Feb 21, 2025
@dev-aravind
Copy link
Contributor

dev-aravind commented Feb 21, 2025

@saumier Done, let me know if you want to add any workflows in orion to use this feature.

Release notes

@dev-aravind dev-aravind assigned saumier and unassigned dev-aravind Feb 21, 2025
@dev-aravind dev-aravind moved this from In Progress to In Review in Artsdata Feb 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Review
Development

No branches or pull requests

2 participants