Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up processing: DB insertion and multiprocessing #600

Open
nesnoj opened this issue Jan 22, 2025 · 3 comments
Open

Speed up processing: DB insertion and multiprocessing #600

nesnoj opened this issue Jan 22, 2025 · 3 comments
Assignees
Labels
🚀 feature New feature or request

Comments

@nesnoj
Copy link
Collaborator

nesnoj commented Jan 22, 2025

Succeeding issue of recently closed #546.
I merged both topics into one issue as I think they interfere - e.g. choosing a specific DB insertion method might not work (efficiently) in multiprocessing. If you disagree, feel free to separate them.

  1. Explore faster methods of writing to the database for
  • sqlite
  • postgres
  1. Add multiprocessing for
  • XML parsing
  • Write to DB (if applicable due to concurrency -> table locks)

Notes on DB insertion: #546 (comment)
Notes on parallelization: #546 (comment)

Feel free to amend :)

@nesnoj nesnoj added the 🚀 feature New feature or request label Jan 22, 2025
@FlorianK13
Copy link
Member

we reached #600 🥇

@AlexandraImbrisca
Copy link
Contributor

AlexandraImbrisca commented Jan 22, 2025

Hi @nesnoj! Thanks for creating this issue. I'm testing my approach a bit more (different number of cores & different operating systems) and I'll create the PR. I've been developing and testing on MacOS & Linux and I'll continue with Windows. The approach is quite simple and uses the standard concurrent.futures library with a few options to optimize the access to the database

About writing the data to the postgre database: would you mind tackling this separately? I didn't get the chance to look into how optimize writing yet and the parallelization is database-agnostic right now

@nesnoj
Copy link
Collaborator Author

nesnoj commented Jan 23, 2025

About writing the data to the postgre database: would you mind tackling this separately? I didn't get the chance to look into how optimize writing yet and the parallelization is database-agnostic right now

Sure, feel free to create a separate issue if that makes sense to you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🚀 feature New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants