Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create action/workflow for running Harvester periodically #5116

Closed
wants to merge 1 commit into from

Conversation

pjquirk
Copy link
Contributor

@pjquirk pjquirk commented Dec 18, 2020

This adds a workflow that runs an adapted version of "Harvester" to collect the number of hist/repos for a given extension/query.

Description

Keeping track of the usage of a language is heavily manual process, see for example #4219. This monitoring is done by a tool known as Harvester, which uses the github.com search page to look for the total number of hits across unique repositories.

This workflow uses an action that does the same thing, using the actual REST APIs instead of the browser, which allows us to more easily automate the work. The action expects a JSON file that contains the search terms to process, and generates a markdown file with the results.

Example input:

[
  {
    "pr": 4827,
    "extensions": [
      {
        "extension": "kql"
      },
      {
        "extension": "kusto"
      },
      {
        "extension": "csl",
        "extendedSearch": "where+NOT+xml"
      }
    ]
  }
]

Example output:

Extension Total Hits Unique Repositories PR
kql 1007 66 4827
kusto 65 19 4827
csl 558 80 4827

This came out of a discussion in a PR.

Checklist:

  • I am adding new or changing current functionality
    • N/A: I have added or updated the tests for the new or changed functionality.

@pjquirk pjquirk requested a review from lildude December 18, 2020 18:57
@pjquirk pjquirk requested a review from a team as a code owner December 18, 2020 18:57
steps:
- uses: actions/checkout@v2

- uses: pjquirk/harvest-action@main
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This workflow should be moved somewhere other than my personal account, to avoid bus-factor issues. Since this is so tied into Linguist, you might prefer having it in this repo under a top-level /actions folder.

@pjquirk pjquirk changed the title Create action for running Harvest periodically Create action/workflow for running Harvester periodically Dec 18, 2020
@pjquirk
Copy link
Contributor Author

pjquirk commented Dec 18, 2020

The workflow or action could do more as well, such as creating an issue once an extension crosses some threshold of unique repositories.

@Alhadis
Copy link
Collaborator

Alhadis commented Jan 5, 2021

This adds a workflow that runs an adapted version of "Harvester"

Which parts were adapted, exactly? I don't recognise anything that resembles my own code…

Anyway, you might be interested in adding feedback to Alhadis/Harvester#15, as Harvester is long overdue for a complete rewrite. This will greatly simplify automation because it'll be a proper CLI tool, rather than a throwaway script from 2016 that's now looking incredibly long in the tooth…

Feedback and ideas welcome.

@pjquirk
Copy link
Contributor Author

pjquirk commented Jan 5, 2021

Which parts were adapted, exactly? I don't recognise anything that resembles my own code…

Mostly adapted the ideas/algorithm, I didn't copy any code verbatim since it relied on being run in the browser rather than using the REST APIs.

Anyway, you might be interested in adding feedback to Alhadis/Harvester#15, as Harvester is long overdue for a complete rewrite. This will greatly simplify automation because it'll be a proper CLI tool, rather than a throwaway script from 2016 that's now looking incredibly long in the tooth…

Good call, I didn't see that issue. I definitely don't have the context for this script that you do, and my implementation is admittedly more focused on bulk extension monitoring rather than individual ones. I went with a rewrite since I didn't think that version was still being maintained, and it was a day of learning project around running the script over dozens of extensions :D

@Alhadis
Copy link
Collaborator

Alhadis commented Jan 5, 2021

Reporting extension usage is the primary use-case for the tool, and I plan on having an option to select output formats (markdown for GitHub discussion copy+pasta, null- or tab-separated values for easier CLI wrangling, etc). More ambitious options include mass-downloading of collected URLs and reusing locally-cached results. Having some way of going through search results to identify unrelated formats would greatly simplify the task of verifying a candidate language addition.

Don't hesitate to suggest other functions that relate to extension/format research, since now's the time for that sort of discussion.

I went with a rewrite since I didn't think that version was still being maintained

It's only updated whenever a change to GitHub's front-end code forces me to update the CSS selectors used to target relevant page elements (which in itself is rather tenuous, given the utter lack of ID attributes and semantic markup 😢). As a sidenote, if I haven't archived a repository, it means it's maintained (even if it's not been updated in a while).

Anyway, enough rambling. I won't derail this thread any further. 👍
^D

@lildude
Copy link
Member

lildude commented Jan 22, 2021

I've been thinking about this and the one thing that's kinda niggling to me is the fact we've got to keep updating the
candidates.json file which is going to involve PRs each time.

What would be really cool is if the action automatically ran against PRs that are adding a new language, possibly based on a label, using a field or comment in the PR.

@pjquirk
Copy link
Contributor Author

pjquirk commented Jan 22, 2021

@lildude I agree that'd be cool, though given existing PRs I don't see a great way to do that. I could take the diff of the languages.yml file and just add the extensions, but that doesn't always work. E.g. see my PR where csl is already associated with XML so I need to add some additional search terms.

You could require via template some JSON or other data in the PR body that contains the same data as the candidates file above, and then make the action a check on the PR (I talked about that here), assuming the metrics can be defined. I didn't want to suggest that immediately as that's a bit of a workflow change for your team, but I'd be happy to modify the action to do that.

@lildude
Copy link
Member

lildude commented Jan 23, 2021

Yeah, I was thinking that maybe parsing a value from the template or a comment would do the trick. We can always go back and add it to already open PRs and manually trigger the workflow.

Pencab referenced this pull request Jan 10, 2022
* Add CITATION.cff as YAML filename

* Add CITATION.cff sample

* Add CITATION variants to documentation

* Add CITATION(S) as plaintext filename

Co-authored-by: Colin Seymour <[email protected]>
@lildude lildude closed this Mar 13, 2024
@github-linguist github-linguist locked as resolved and limited conversation to collaborators Jun 19, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants