Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add a way to reduce the number of relatedness values stored for cache #78

Closed
lfoppiano opened this issue Jun 13, 2018 · 5 comments
Closed
Assignees
Milestone

Comments

@lfoppiano
Copy link
Collaborator

After hammering the system for several hours, the cache containing the list of all calculated relatedness reach the heap memory limit, causing the GC to be triggered too often (high CPU).

We should find an efficient solution to invalidate older element in this cache in order to keep the memory footprint lower.

@lfoppiano lfoppiano added this to the 0.0.4 milestone Jun 13, 2018
@lfoppiano lfoppiano changed the title cache system to reduce relatedness information over time add a way to reduce the number of relatedness values stored for cache Jun 14, 2018
@kermitt2
Copy link
Owner

Relatedness might disappear in version 0.0.4 for a LSTM-based architecture.

@lfoppiano
Copy link
Collaborator Author

the issue affect any batch process running for more than a day, a relatively simple cache implementation can be used via Guava https://github.com/google/guava/wiki/CachesExplained#caches

@tantikristanti
Copy link
Collaborator

Regarding this issue, the concept of cache is implemented in another branch called "cacheGuava". The implementation is done in a Relatedness class.

@lfoppiano
Copy link
Collaborator Author

lfoppiano commented Oct 19, 2018

Implemented in branch: https://github.com/kermitt2/entity-fishing/tree/cacheGuava

Tests using entity-fishing-client-python (branch: multitasking) - if problems, use goFishing, processBatch.py (with 1, 5, 10 threads), one after the other:

  • Pubmed central (1943 pdfs)
  • corpus of pdfs we used to test (pdf directory)

Tests to run (3 time per test):

  1. master version without cache
  2. master version with cache (branch 'guavaCache')

@tantikristanti
Copy link
Collaborator

tantikristanti commented Nov 15, 2018

Here are some tests that have been done so far with GoFishing:

screen shot 2018-11-15 at 16 14 12

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants