Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Profile full monthly generation #52

Open
aelkiss opened this issue Jul 1, 2024 · 2 comments
Open

Profile full monthly generation #52

aelkiss opened this issue Jul 1, 2024 · 2 comments

Comments

@aelkiss
Copy link
Member

aelkiss commented Jul 1, 2024

Generating a full hathifile takes many, many hours. It's probably worth seeing what's so slow and seeing if we can speed it up. Could be database queries, could be json parsing (could try https://github.com/anilmaurya/fast_jsonparser)..

@aelkiss
Copy link
Member Author

aelkiss commented Jul 1, 2024

It looks like it does a database query for the rights db for every item, but it's only used for the access profile and rights update timestamp. We could consider:

  1. doing the queries in batches, ideally via the rights api
  2. doing a sort of offline join - basically get the zephir file sorted by id, dump the rights database also sorted by id, and then we can iterate through both - this has often been the most performant way I can come up with to handle these kinds of issues. We could potentially add an 'export all' option to the rights API to support this.

@moseshll
Copy link
Contributor

moseshll commented Jul 12, 2024

Based on a stupid test run locally with a 22k-line upd file, fast_jsonparser screams. Just looking at the raw log timestaps, what takes just over a minute with JSON.parse takes slightly under two seconds with FastJsonparser.parse. All the rights and collection stuff no-opped out. So the evidence is very much in favor of pursuing this in addition to finding a better rights strategy.

Alas, the default behavior of fast_jsonparser is to symbolize keys, which caused the rest of the process to bail out. Hence the ludicrous speed. In practice it does seem to be faster than the build-in JSON parser so for ongoing experiments I will leave it in place.

moseshll added a commit that referenced this issue Jul 18, 2024
- Experiment with removing one-by-one rights retrieval in favor of Sequel batch query for multiple htids.
moseshll added a commit that referenced this issue Aug 19, 2024
* Addresses #52 Profile full monthly generation
- Experiment with removing one-by-one rights retrieval in favor of Sequel batch query for multiple htids.
  - Add `HathifileWriter` class with limit on number of HTIDs in batch rights query
- Use Ettin Settings for DB connection string instead of `ENV` directly
  - Connect to `ht` instead of `ht_rights` so `ht.ht_collections` can be read
- Add `mariadb-client` to both dockerfiles
- Update to `hathifiles_database` 0.3.0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants