Create action/workflow for running Harvester periodically #5116

pjquirk · 2020-12-18T18:57:46Z

This adds a workflow that runs an adapted version of "Harvester" to collect the number of hist/repos for a given extension/query.

Description

Keeping track of the usage of a language is heavily manual process, see for example #4219. This monitoring is done by a tool known as Harvester, which uses the github.com search page to look for the total number of hits across unique repositories.

This workflow uses an action that does the same thing, using the actual REST APIs instead of the browser, which allows us to more easily automate the work. The action expects a JSON file that contains the search terms to process, and generates a markdown file with the results.

Example input:

[
  {
    "pr": 4827,
    "extensions": [
      {
        "extension": "kql"
      },
      {
        "extension": "kusto"
      },
      {
        "extension": "csl",
        "extendedSearch": "where+NOT+xml"
      }
    ]
  }
]

Example output:

Extension	Total Hits	Unique Repositories	PR
kql	1007	66	4827
kusto	65	19	4827
csl	558	80	4827

This came out of a discussion in a PR.

Checklist:

I am adding new or changing current functionality
- N/A: I have added or updated the tests for the new or changed functionality.

pjquirk · 2020-12-18T18:59:30Z

.github/workflows/harvest.yml

+    steps:
+      - uses: actions/checkout@v2
+
+      - uses: pjquirk/harvest-action@main


This workflow should be moved somewhere other than my personal account, to avoid bus-factor issues. Since this is so tied into Linguist, you might prefer having it in this repo under a top-level /actions folder.

pjquirk · 2020-12-18T19:07:17Z

The workflow or action could do more as well, such as creating an issue once an extension crosses some threshold of unique repositories.

Alhadis · 2021-01-05T20:51:41Z

This adds a workflow that runs an adapted version of "Harvester"

Which parts were adapted, exactly? I don't recognise anything that resembles my own code…

Anyway, you might be interested in adding feedback to Alhadis/Harvester#15, as Harvester is long overdue for a complete rewrite. This will greatly simplify automation because it'll be a proper CLI tool, rather than a throwaway script from 2016 that's now looking incredibly long in the tooth…

Feedback and ideas welcome.

pjquirk · 2021-01-05T21:21:30Z

Which parts were adapted, exactly? I don't recognise anything that resembles my own code…

Mostly adapted the ideas/algorithm, I didn't copy any code verbatim since it relied on being run in the browser rather than using the REST APIs.

Anyway, you might be interested in adding feedback to Alhadis/Harvester#15, as Harvester is long overdue for a complete rewrite. This will greatly simplify automation because it'll be a proper CLI tool, rather than a throwaway script from 2016 that's now looking incredibly long in the tooth…

Good call, I didn't see that issue. I definitely don't have the context for this script that you do, and my implementation is admittedly more focused on bulk extension monitoring rather than individual ones. I went with a rewrite since I didn't think that version was still being maintained, and it was a day of learning project around running the script over dozens of extensions :D

Alhadis · 2021-01-05T22:14:58Z

Reporting extension usage is the primary use-case for the tool, and I plan on having an option to select output formats (markdown for GitHub discussion copy+pasta, null- or tab-separated values for easier CLI wrangling, etc). More ambitious options include mass-downloading of collected URLs and reusing locally-cached results. Having some way of going through search results to identify unrelated formats would greatly simplify the task of verifying a candidate language addition.

Don't hesitate to suggest other functions that relate to extension/format research, since now's the time for that sort of discussion.

I went with a rewrite since I didn't think that version was still being maintained

It's only updated whenever a change to GitHub's front-end code forces me to update the CSS selectors used to target relevant page elements (which in itself is rather tenuous, given the utter lack of ID attributes and semantic markup 😢). As a sidenote, if I haven't archived a repository, it means it's maintained (even if it's not been updated in a while).

Anyway, enough rambling. I won't derail this thread any further. 👍
^D

lildude · 2021-01-22T17:29:36Z

I've been thinking about this and the one thing that's kinda niggling to me is the fact we've got to keep updating the
candidates.json file which is going to involve PRs each time.

What would be really cool is if the action automatically ran against PRs that are adding a new language, possibly based on a label, using a field or comment in the PR.

pjquirk · 2021-01-22T19:40:42Z

@lildude I agree that'd be cool, though given existing PRs I don't see a great way to do that. I could take the diff of the languages.yml file and just add the extensions, but that doesn't always work. E.g. see my PR where csl is already associated with XML so I need to add some additional search terms.

You could require via template some JSON or other data in the PR body that contains the same data as the candidates file above, and then make the action a check on the PR (I talked about that here), assuming the metrics can be defined. I didn't want to suggest that immediately as that's a bit of a workflow change for your team, but I'd be happy to modify the action to do that.

lildude · 2021-01-23T11:37:43Z

Yeah, I was thinking that maybe parsing a value from the template or a comment would do the trick. We can always go back and add it to already open PRs and manually trigger the workflow.

* Add CITATION.cff as YAML filename * Add CITATION.cff sample * Add CITATION variants to documentation * Add CITATION(S) as plaintext filename Co-authored-by: Colin Seymour <[email protected]>

Add workflow for harvest

1695632

pjquirk added the Improvement label Dec 18, 2020

pjquirk requested a review from lildude December 18, 2020 18:57

pjquirk requested a review from a team as a code owner December 18, 2020 18:57

pjquirk commented Dec 18, 2020

View reviewed changes

pjquirk changed the title ~~Create action for running Harvest periodically~~ Create action/workflow for running Harvester periodically Dec 18, 2020

pjquirk mentioned this pull request Jan 5, 2021

RFC: Reimplementing Harvester as a CLI program Alhadis/Harvester#15

Closed

Pencab referenced this pull request Jan 10, 2022

Add support for CITATION manifests (#5577)

4ce6df6

* Add CITATION.cff as YAML filename * Add CITATION.cff sample * Add CITATION variants to documentation * Add CITATION(S) as plaintext filename Co-authored-by: Colin Seymour <[email protected]>

lildude closed this Mar 13, 2024

github-linguist locked as resolved and limited conversation to collaborators Jun 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create action/workflow for running Harvester periodically #5116

Create action/workflow for running Harvester periodically #5116

pjquirk commented Dec 18, 2020 •

edited

Loading

pjquirk Dec 18, 2020

pjquirk commented Dec 18, 2020

Alhadis commented Jan 5, 2021

pjquirk commented Jan 5, 2021

Alhadis commented Jan 5, 2021 •

edited

Loading

lildude commented Jan 22, 2021

pjquirk commented Jan 22, 2021

lildude commented Jan 23, 2021

Create action/workflow for running Harvester periodically #5116

Create action/workflow for running Harvester periodically #5116

Conversation

pjquirk commented Dec 18, 2020 • edited Loading

Description

Checklist:

pjquirk Dec 18, 2020

Choose a reason for hiding this comment

pjquirk commented Dec 18, 2020

Alhadis commented Jan 5, 2021

pjquirk commented Jan 5, 2021

Alhadis commented Jan 5, 2021 • edited Loading

lildude commented Jan 22, 2021

pjquirk commented Jan 22, 2021

lildude commented Jan 23, 2021

pjquirk commented Dec 18, 2020 •

edited

Loading

Alhadis commented Jan 5, 2021 •

edited

Loading