Index of Clojure libraries available on github.
Analysis:
Web frontends:
The goal of this project is to make the clojure libraries available on github easier to programmatically list and inspect.
Deps.edn can procure dependencies directly from github. However, finding clojure libraries that are available via github can be more difficult compared to clojars. Clojars provides several data endpoints to list available libraries and metadata. Even though similar info is available from github, it's not quite as easy to obtain.
Pre-retrieved data can be found at releases.
Each release includes the following files in .gz or tar.gz format:
-
dewey.sqlite3.sql.gz
: A sqlite3 dump with all of the analysis data. It can be loaded into sqlite db withgzcat dewey.sqlite3.sql.gz | sqlite3 dewey.sqlite
. -
deps-libs.edn
: This the best place to start if you're using the data. It's a map of library name to library info for all clojure github libraries that havedeps.edn
files on their default branch.- Library info keys:
:description
The github repo description.:lib
Lib coordinate that will be recognized by clojure cli tools. https://clojure.org/reference/deps_and_cli#find-versions:topics
The github topics for the repo.:stars
Number of github stars.:url
URL to page on github.:versions
: vector of lib coordinates based on the tags in the git repo.
- Library info keys:
-
deps
directory: Thedeps.edn
file for every clojure library that has adeps.edn
file on their default branch. The folder structure isdeps/<github username>/<github project>/deps.edn
. -
all-repos.edn
: A vector of all clojure repositories on github that were found (including non deps.edn based projects). Each repository is represented by a map with all of the data returned via the github API https://docs.github.com/en/rest/repos/repos#get-a-repository. -
deps-tags.edn
: This an intermediate file of pairs of github repo information and github tag information. -
analysis.edn.gz
: clj-kondo analysis for every repo. The kondo config turns off all linters and includes the:locals
,:keywords
,:arglists
, and:protocol-impls
analyses. See dewey indexer and clj-kondo docs. available as of 2022-07-25 release
All the .edn
or .edn.gz
files can be read using com.phronemophobic.dewey.util/read-edn
. For example:
(require 'com.phronemophobic.dewey.util)
(def data (com.phronemophobic.dewey.util/read-edn fname))
clj-kondo analyses for each project found can be found in the releases under analysis.edn.gz
. This file can be quite chonky. For an example of how to process the data, see the stats example.
The file contains a vector of maps, with each map containing the following keys:
:repo
: A string name for the repo, eg."phronmophobic/dewey"
.:analyze-instant
: The instant that the repo was analyzed, eg.#inst "2022-12-15T21:09:46.694-00:00"
.:git/sha
: The commit hash of the repository commit analyzed, eg."69dc62aac32f8a2da0a47aaf1dc662f86ff05760"
.:basis
: A single repository can have multiple projects. Basis is the project file used to generate the source paths for clj-kondo to analyze, eg."path/to/project.clj"
orpath/to/deps.edn
.:analysis
: The clj-kondo analysis.
The file is specially formatted edn so that it can be processed without reading the full contents into memory. The first line is [
, the last line is ]
, and every line in between is a single map.
To retrieve the data yourself, follow step 0 and then run:
# creates releases/yyyy-MM-dd/all-repos.edn
clojure -X:update-clojure-repo-index
# downloads all deps files to releases/yyyy-MM-dd/<user>/<project>/deps.edn
# due to rate limits, takes around 3 hours (mostly sleeping).
clojure -X:download-deps
# downloads tags for each deps.edn clojure library to releases/yyyy-MM-dd/deps-tags.edn
clojure -X:update-tag-index
# creates an index of library name to library metadata in releases/yyyy-MM-dd/deps-libs.edn
clojure -X:update-available-git-libs-index
These commands must be run in order.
Github search is quirky and has certain limitations imposed by rate-limiting. Below is a short synopsis of how Dewey attempts to locate clojure projects on github within the limitations imposed by github's API.
- Authentication
- Find all clojure repositories
- Download all deps.edn files
Dewey uses personal access tokens to make github API requests. You can obtain a personal access token by following these docs.
Once you have obtained your personal access token, save it to an edn file called "secrets.edn" in the root project directly using following format:
{:github {:user "my-username"
:token "my-token"}}
Currently, the first step is to paginate through the results of the github repository search language:clojure
sorted by stars in descending order. There's a 1,000 result limit for any specific search so after exhausting the results from language:clojure
, we find repositories for specific numbers of stars starting at the star number from the last result. The search query for these requests look like language:clojure stars:123
, language:clojure stars:122
, etc.
Once we have a list of clojure github repositories, we can then check each repository for its deps.edn
file. Given a repository, the url for the deps.edn file looks like (str "https://raw.githubusercontent.com/" full-name "/" default-branch "/" fname)))
.
- There are some libraries that actually are clojure libraries, but aren't found when searching using
language:clojure
- Clojurescript only libraries are not currently targeted
- Only checks tip of default branch.
- Only 1,000 libraries max per star count. At the time of writing, this only matters for star counts less than 5.
I thought just asking github for all the files named deps.edn might work. The roadblocks I ran into were:
- Hitting secondary rate after 1-2 requests.
- Receiving only 0-3 results even on successful requests.
These are stategies that I didn't try, but might be good alternatives if the main strategy fails.
As suggested by this stackoverflow answer, you can search by a field. The search API currently limits results to a max of 1000, but if you search a small enough window of time, you can scan through all the libraries.
It's possible that github's GraphQL API might provide opportunities for improvement. However, it doesn't appear to have a way to filter by language or any other means of identifying repositories that are clojure related.
Now that we've bothered to catalog of all of the clojure repos on github, there's several interesting projects we can do that use the data:
Download and run static analysis across reposDone! see analysis.edn.gz in releases.Create a website that combines the clojars data API with dewey's data to make it easier to search for clojure libraries.Done! see web search interface.- Integrate the data into tools and IDEs
- deps.edn editor that knows the available libraries and versions
- Find example usages for libraries or specific functions (for example)
- Add support for other git hosting sites like gitlab.
Copyright © 2022 Adrian
Distributed under the Eclipse Public License version 1.0.