-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarification on lvl. 1 pub servers' "in-memory" publications #104
Comments
Hello, thank you for your input!
Just to clarify: by "feed" I assume you do not mean OPDS feed. The creation of an OPDS feed is indeed a time-consuming operation in the current prototype/experimental implementation (i.e. micro-service in r2-streamer-js, for test purposes only), but this is an opt-in feature that must be explicitly requested (i.e. not automatically launched on startup). So, by "feed" I think you refer to the fact that the current r2-streamer-js CLI utility starts-up an instance of the server by scanning a designated folder on the filesystem, in order to find publications file paths to serve. Yes, I can imagine that this may be a time-consuming operation when many EPUBs are present,.
This is exactly what r2-streamer-js does. Although the designated filesystem folder is scanned on startup, the actual publications are lazy-loaded (i.e. the actual file loading and EPUB parsing only occurs when a request is received). Once a publication is loaded/parsed, the WebPubManifest is ready and it is stored in an in-memory runtime cache, in order to avoid costly loading/parsing in subsequent requests. See server.ts loadOrGetCachedPublication(). |
@danielweck Actually I did mean ODPS feed, but what you explained was useful anyway. That's why I'm thinking of moving the ODPS feed off the publication server, because it would be nice for it to remain database-less. I'm planning on making my publication server compatible with multi-tenant setups anyway, so it would make sense for each tenant to have their own feed. I'm open to better ideas |
actually I came up with a better way to do things: make the features that require a db and search indexing optional and expandable by using the publication server as a package in a codebase that includes that functionality |
There's no requirement to do that. A smarter approach to this problem would be to define a LRU cache when initializing the server and dynamically fetching/parsing packaged publications as you need them. Thanks to the LRU cache, frequently accessed publications would remain in memory, while less frequently accessed publications would eventually get swapped out of it. |
@HadrienGardeur , this quote is an incorrect / misleading statement, so I would like to clarify once again for readers who will miss / skip the previous messages: In the r2-streamer-js implementation, there is an optional OPDS "micro service" which constructs a feed that corresponds to all the publications currently registered within the server instance. This OPDS feed is created by a non-blocking process, the first time a well-known HTTP route is explicitly requested (i.e. not at server startup), or whenever the URL is fetched after the OPDS feed is invalidated (i.e. when a publication is added to / removed from the streamer's internal state). This experimental / prototype OPDS feature is not part of the Readium2 architecture for the "streamer" component, it is provided specifically in the r2-streamer-js implementation to demonstrate + test the OPDS2 format which is based on the ReadiumWebPubManifest model. Note that there is also a JSON-Schema validation pass (when pretty-printing the feed) which can itself be quite time-consuming, but once again this does not affect server startup. Now, regarding the "streamer"'s internal state: there is an in-memory registry that records paths/URLs where publications can be fetched, and there is a lazy-loading strategy to avoid unnecessarily stressing the server instance at startup. The processing costs related to loading + parsing publications (i.e. computing the actual ReadiumWebPubManifest models) are incurred only on-demand, during incoming HTTP requests. The RWPM definitions are stored in an in-memory cache to optimize subsequent requests. The decision to destroy cache entries and to remove publications from the internal registry is an integration concern. PS: the r2-streamer-js cache of loaded ReadiumWebPubManifest models currently grows indefinitely. There has been a "TODO" comment from day-1 in the TypeScript source code to implement a LRU (Least Recently Used) caching strategy. As the sole developer contributor (to date) to r2-streamer-js, I have not implemented LRU due to this being unnecessary in the context of the Readium "desktop" application (which is the primary official/known integration of r2-streamer-js, as far as I know). I have now filed an issue to track this for developers who need to integrate r2-streamer-js inside long-lived / rarely-restarted server containers: readium/r2-streamer-js#47 |
At the moment, I have started expanding what I would call my golang publication server/streamer to be closer to the reference implementations so that it is more closely compatible with the official spec. Something that I've noticed the reference JS streamer does is load up all publications in-memory as described in the Level 1 spec (https://github.com/readium/architecture/tree/master/server#level-1): "They must have an in-memory representation of the publications that they serve".
Am I correct in assuming that "in-memory" means the server's RAM (if not ignore next paragraph or so)? When starting the JS streamer on a collection of thousands of EPUBs, it takes a long time to start because it is loading up every single publication and creating a feed. I would like to conform to the spec, however caching metadata on thousands of publications in-memory results in slow startup times and large memory usage. In my implementation, I have been waiting until the publications are requested for the first time before parsing them and then caching them for a while (I also haven't had to deal with archives, as I've only been loading exploded publications), however I understand the need to have them loaded so that a feed can be generated for them. This seems to call for a proper database of some sorts or directory sorting (like I do).
Would it be possible to have a lvl. 1+ compliant publication server that does not "have an in-memory representation of the publications that they serve"?
Edit: the r2 golang streamer does the same thing, loading all publications in-memory (although in a non-blocking manner)
Edit 2: Another option I've been considering is moving the generation of the ODPS feed away from being the publication server's direct responsibility
The text was updated successfully, but these errors were encountered: