Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interlinks filter: support extremely large inventory files #249

Closed
machow opened this issue Aug 22, 2023 · 3 comments
Closed

Interlinks filter: support extremely large inventory files #249

machow opened this issue Aug 22, 2023 · 3 comments

Comments

@machow
Copy link
Owner

machow commented Aug 22, 2023

Large inventory files slow the interlinks filter. For sites with many pages (e.g. 100+), this can substantially increase build times.

Background

quartodoc uses this process for handling interlinks:

  • Running quartodoc interlinks downloads inventory files, and saves them as as json (e.g. _inv/python_objects.json)
  • The interlinks lua filter (installation, code) does this for each page being rendered:
    • reads the json files
    • replaces any uses of interlinks syntax with links derived from the json files

However, this raises the following challenges:

  • io, parsing: the files must be read and parsed every time a file is rendered. This delay can add up over a lot of files.
  • lookup: currently, we loop over every entry when looking up a link. This is inefficient, but I suspect fast in human time.

inventories like statsmodels.org are 10MB unzipped, and have ~40,000 entries, so they're not a huge time sink, but it adds up when multiplied by number of files. From what I can tell, it takes the lua filter ~.75 seconds just to read and parse this file (mostly due to parsing json).

Example

Run the following on the files below:

  • quartodoc interlinks (creates _inv folder with inventory files as json)
  • quarto render example.qmd --to gfm --output --

_quarto.yml:

filters:
  - interlinks

# interlinks are slow
interlinks:
  sources:
    python:
      url: https://docs.python.org/3/
    statsmodels:
      url: https://www.statsmodels.org/stable/
#   matplotlib:
#     url: https://matplotlib.org/stable/
#   mizani:
#     url: https://mizani.readthedocs.io/stable/
#   numpy:
#     url: https://numpy.org/doc/stable/
#   scipy:
#     url: https://docs.scipy.org/doc/scipy/
#   pandas:
#     url: https://pandas.pydata.org/pandas-docs/stable/
#   sklearn:
#     url: https://scikit-learn.org/stable/
#   skmisc:
#     url: https://has2k1.github.io/scikit-misc/stable/
#   adjustText:
#     url: https://adjusttext.readthedocs.io/en/latest/
#   patsy:
#      url: https://patsy.readthedocs.io/en/stable/

example.qmd:

---
---

[](`statsmodels.base.distributed_estimation`)

Potential Solutions

In order of complexity:

  • speed up parsing
    • find a way to parse the json files faster, OR
    • use a format that takes advantage of the fact that this data has a fixed structure (e.g. similar to a SQL table)
  • provide some kind of persistence (and only parse once)
    • e.g. run a webserver during site rendering, serve data or answers over a file socket.
    • e.g. as an extreme example we have a "we need something redis-like" problem, due to quarto's rendering approach.
    • (other tools like sphinx or mkdocs provide these mechanisms out of the box)
@machow machow added the .epic label Aug 22, 2023
@machow
Copy link
Owner Author

machow commented Aug 22, 2023

cc @has2k1 who raised this as part of has2k1/plotnine#706, pairing w/ Carlos tomorrow on it

@machow
Copy link
Owner Author

machow commented Sep 25, 2023

Setting the fast option speeds up the loading of interlinks files. Instead of saving as json, it just saves the original inventories as a .txt, and parses in lua. (the json parsing provided by quarto for lua filters is very slow).

interlinks:
  fast: true

@machow machow closed this as completed Sep 25, 2023
@machow
Copy link
Owner Author

machow commented Sep 25, 2023

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant