Skip to content

Commit

Permalink
Merge branch 'describe-architecture-with-hugo'
Browse files Browse the repository at this point in the history
Now that the site is converted to be built with Hugo and Pagefind, let's
reflect that status quo in the document describing the site's
architecture.

Signed-off-by: Johannes Schindelin <[email protected]>
  • Loading branch information
dscho committed Nov 21, 2023
2 parents 45e7bc2 + f54d152 commit 58f8ad5
Showing 1 changed file with 35 additions and 111 deletions.
146 changes: 35 additions & 111 deletions ARCHITECTURE.md
Original file line number Diff line number Diff line change
@@ -1,138 +1,65 @@
# git-scm.com architecture

This document describes the general setup and architecture that runs the
git-scm.com site. The idea is to document all the moving parts that
_aren't_ checked in to this repository. That may help new people joining
the project to help out, as well provide some continuity in case the
maintainer is hit by a bus.
git-scm.com site.

## Content

Though the site is a rails app, it can _mostly_ be thought of as serving
static content. It's just that we suck in that static content and
pre-process it using nightly scheduled jobs. We never write anything to
the database on behalf of user requests.
This site is served via GitHub Pages and is a [Hugo](https://gohugo.io/) site
with the search implemented using [Pagefind](https://pagefind.app/).

The content is a mix of:

- actual static content in this repository
- original content from this repository

- community book content brought in from https://github.com/progit;
see the `lib/tasks/book2.rake` file.

- manpages from releases of the git project, imported and formatted
via asciidoctor; see the `lib/tasks/index.rake` task.

To deploy to GitHub Pages, it is necessary to turn off the default setting to
"publish from a branch" and instead change the setting to "publish with a
custom GitHub Actions workflow":
https://docs.github.com/en/pages/getting-started-with-github-pages/configuring-a-publishing-source-for-your-github-pages-site#publishing-with-a-custom-github-actions-workflow

## Heroku
## Non-static parts

The app itself is served by Heroku. The app name is `git-scm` (so you
can visit it directly as https://git-scm.herokuapp.com). The site is
owned by the git-scm.com team. If you want to be involved in managing
uptime/deploys/etc, you'll need a Heroku account and request to be added
to that team.
While the site consists mostly of static content, there are a couple of
parts that are sort of dynamic.

We use a few Heroku add-ons:
The search is implemented client-side, via [Pagefind](https://pagefind.app/).

- Bonsai elasticsearch (see below)
A few scheduled GitHub workflows keep the content up to date:

- Heroku Postgres as the database
- `update-git-version-and-manual-pages` and `update-download-data` (pick
up newly released git versions)

- Heroku Redis for rails caching
- `update-translated-manual-pages` (fetch and format translated manual
pages from the jnavila/git-html-l10n repository)

- Heroku scheduler for cron jobs

The nightly scheduled jobs are:

- `rake downloads` (pick up newly released git versions)

- `rake preindex` (pull in and format manpages for released git
versions)

- `rake remote_genbook2` (pull in and format progit2 book content,
- `update-book` (fetch and format progit2 book content,
including translations)

It should be safe to run any of those jobs more frequently. E.g., if you
know there's a new Git release out, then:

heroku run rake preindex
heroku run rake downloads

will get it on the site without waiting for the nightly run.

Merges to the `main` branch on GitHub auto-deploy to Heroku, so unless
you're doing something tricky you generally shouldn't need to manually
deploy.

Note that some of the formatting of manpages and book content happens
when they are imported by the rake tasks. So after fixing some
formatting and deploying, the rake jobs may need to be re-run with a
special flag to re-import (see the individual tasks for details).


## Cloudflare

We get enough requests that it's easy to overwhelm the single Heroku
dyno. So we have Cloudflare sitting in front of it, aggressively caching
everything. That also should make the site faster to serve to regions
far away from Heroku's servers.
These workflows are also marked as `workflow_dispatch`, i.e. they can be run
manually (e.g. to update the download links just after Git for Windows
published a new release).

The Cloudflare setup is mostly pretty simple:

- they serve DNS for the whole domain (that's where they insert the CDN
magic)

- Cloudflare provides `https://` support to the user. Obviously the
site is totally open and doesn't have any sensitive data, so this is
really more about integrity. The certificate is generated by
Cloudflare (and requires SNI on the browser side).

- the Cloudflare connection to Heroku is passed over TLS; they provide an
"internal" certificate that we ask Heroku to use, so the connection
is secured between the two (again, mostly for integrity)

- the most exotic config is that we use "page rules" to mark the whole
site to be cached aggressively, regardless of any caching headers
sent from Heroku. This is a bit of a hack, but there's very little on
the site that can't be cached (which is perhaps a sign that the rails
setup needs to be tweaked to send more reasonable caching headers,
but this has been simple and effective so far).

There are a few special page rules to lift this caching for cases
where we do server-side logic (e.g.,
https://github.com/git/git-scm.com/issues/1129#issuecomment-363067019"),
but the long-term goal is to push that logic onto the client side as
much as possible.

Both domains (c.f., the section on [DNS](#DNS) below) are owned by a
Cloudflare "Team", and membership of that team is required to
administrate the domains. Similar to the Heroku setup, you can ask to
join this team if you wish to help out. The information about the team
setup is in escrow with the Git PLC at Software Freedom Conservancy.
Cloudflare provides the project with enough credits that it doesn't cost
anything (though we're not using very many features, so it's possible
that a free account would be sufficient, too).

## Bonsai Elasticsearch

The search functionality on the site is served by an elasticsearch
cluster. The index can be populated by running `rake search_index`
(manpages) and `rake search_index_book` (book) on Heroku (we only index
the manpages and book). This perhaps should be run nightly, or at least
after pulling in new content, but it currently isn't done automatically.

The elasticsearch cluster is provided by Bonsai via their Heroku plugin.
Our needs are larger than their free tier provides, but we receive
credits from them that provide the service for free.
Merges to the `gh-pages` branch on GitHub auto-deploy to GitHub Pages via the
`deploy` GitHub workflow.

Note that some of the formatting of manual pages and book content happens
when they are imported by the GitHub workflows. Therefore, after fixing some
formatting, these workflows may need the force-rebuild flag to be toggled (see
the individual workflows for details).

## DNS

The actual DNS service is provided by Cloudflare (see above). The domain
itself is registered with Gandi, and is owned by the project via
Software Freedom Conservancy. Funds for the registration are provided
from the Git project's Conservancy funds, and both the Git PLC and
Conservancy have credentials to modify the setup.
The actual DNS service is provided by Cloudflare. The domain itself is
registered with Gandi, and is owned by the project via Software Freedom
Conservancy. Funds for the registration are provided from the Git project's
Conservancy funds, and both the Git PLC and Conservancy have credentials to
modify the setup.

Note that we own both git-scm.com and git-scm.org; the latter redirects
to the former.
Expand All @@ -144,18 +71,15 @@ The site mostly just runs without intervention:

- code merged to `main` is auto-deployed

- new git versions are detected daily and manpages and download links
- new git versions are detected daily and manual pages and download links
updated

- book updates (including translations) are picked up daily

There are a few tasks that still need to be handled by a human:

- new images added to the book have to be copied manually from
progit/progit2

- new languages for book translations need to be added to
`lib/tasks/book2.rake`
`script/book.rb`

- forced re-imports of content (e.g., a formatting fix to imported
manpages) must be triggered manually
manual pages) must be triggered manually with `force-rebuild` toggled

0 comments on commit 58f8ad5

Please sign in to comment.