Profile full monthly generation #52

aelkiss · 2024-07-01T20:47:46Z

Generating a full hathifile takes many, many hours. It's probably worth seeing what's so slow and seeing if we can speed it up. Could be database queries, could be json parsing (could try https://github.com/anilmaurya/fast_jsonparser)..

aelkiss · 2024-07-01T20:57:49Z

It looks like it does a database query for the rights db for every item, but it's only used for the access profile and rights update timestamp. We could consider:

doing the queries in batches, ideally via the rights api
doing a sort of offline join - basically get the zephir file sorted by id, dump the rights database also sorted by id, and then we can iterate through both - this has often been the most performant way I can come up with to handle these kinds of issues. We could potentially add an 'export all' option to the rights API to support this.

moseshll · 2024-07-12T21:00:56Z

Based on a stupid test run locally with a 22k-line upd file, fast_jsonparser screams. Just looking at the raw log timestaps, what takes just over a minute with JSON.parse takes slightly under two seconds with FastJsonparser.parse. All the rights and collection stuff no-opped out. So the evidence is very much in favor of pursuing this in addition to finding a better rights strategy.

Alas, the default behavior of fast_jsonparser is to symbolize keys, which caused the rest of the process to bail out. Hence the ludicrous speed. In practice it does seem to be faster than the build-in JSON parser so for ongoing experiments I will leave it in place.

- Experiment with removing one-by-one rights retrieval in favor of Sequel batch query for multiple htids.

* Addresses #52 Profile full monthly generation - Experiment with removing one-by-one rights retrieval in favor of Sequel batch query for multiple htids. - Add `HathifileWriter` class with limit on number of HTIDs in batch rights query - Use Ettin Settings for DB connection string instead of `ENV` directly - Connect to `ht` instead of `ht_rights` so `ht.ht_collections` can be read - Add `mariadb-client` to both dockerfiles - Update to `hathifiles_database` 0.3.0

moseshll added a commit that referenced this issue Jul 18, 2024

Addresses #52 Profile full monthly generation

7e3ef9d

- Experiment with removing one-by-one rights retrieval in favor of Sequel batch query for multiple htids.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Profile full monthly generation #52

Profile full monthly generation #52

aelkiss commented Jul 1, 2024

aelkiss commented Jul 1, 2024

moseshll commented Jul 12, 2024 •

edited

Loading

Profile full monthly generation #52

Profile full monthly generation #52

Comments

aelkiss commented Jul 1, 2024

aelkiss commented Jul 1, 2024

moseshll commented Jul 12, 2024 • edited Loading

moseshll commented Jul 12, 2024 •

edited

Loading