-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
historical_indexer runs out of memory and must start from beginning #2
Comments
I just added a progress cursor to the historical indexer, so if you restart it it should remember and continue there instead of starting again from the start. still experimental, it needs to run for at least 5 minutes until the cursor is saved to prevent skipping fields because of concurrency |
Thanks @redsolver! I tried this out it does seem to pick back up from the cursor, very cool. I see you're writing the cursor to file, so I could hypothetically manually update it to whatever was the last thing it saw before getting killed. I find it consistently crashing on a 60MB repository download, so I think I'll have to try your historical indexer out on a different machine, ultimately. |
I could add a max repo size which skips repos above a specific size, but that would of course cause an incomplete index |
I don't think that would be useful here ultimately, and the bottleneck seems to be my RAM on this particular machine. I'd be interested to know at what threshold of RAM you/others have success with the script so I may try to replicate. |
So recently I re-indexed the entire historical repo data on a new server (128GB RAM) and it's almost impossible to do because there seem to be some memory leaks. I did a lot of manual workarounds and small changes during indexing to get it to work, but the current implementation is pretty broken. So the historical indexer will likely need a rewrite in Rust to actually work correctly again without needing to manually intervene all the time |
The best short term solution would be to share DB dumps of the entire historical data so not all users need to index everything again. atm it's 50 GB for all historical data |
I agree, I think sharing checkpoints might work well. Does today's atproto blog post impact how this would work? |
The changes in repository structure might make it easier to sync the historical data, because it's likely less. But for now I'll focus on a robust backup/snapshot solution for my database format, which can then be used to bootstrap new third-party instances quickly |
Hi @redsolver -- making a new issue so we can stay organized. 🚀
My machine:
If I run this script, CPU immediately hits 100% (understandable, since this is a very weak machine) and memory slowly crawls up to 100% over the course of ~1hr before hitting a maximum and my machine killing the PID. It does manage to count all the repos and then start downloading them, and the script works as I can confirm SurrealDB stores the blocks. I'll get a
Process killed
message in my terminal when my machine kills it due to lack of RAM, and then the memory is released.If I start again, it starts over from the very beginning, not where it left off. This means that unless I have sufficient RAM, I can't get the whole historical index. Again, that's understandable. This is a super weak machine just for testing. But do you have a recommended spec to run this at so I can use the script?
The text was updated successfully, but these errors were encountered: