Skip to content
This repository has been archived by the owner on Jan 21, 2022. It is now read-only.

[Partial] PostgreSQL support for magneticod #214

Merged
merged 1 commit into from
Nov 27, 2020
Merged

[Partial] PostgreSQL support for magneticod #214

merged 1 commit into from
Nov 27, 2020

Conversation

skobkin
Copy link
Contributor

@skobkin skobkin commented Sep 21, 2019

In this PR I implemented simple database engine for PostgreSQL.
It implements pkg/persistence/interface.go partially only for magneticod part.

I'm not sure if I'll implement magneticow part (I'm not using it) at all. That's why I decided to share my contribution now. Probably someone can implement other parts of the interface if needed.

@skobkin skobkin marked this pull request as ready for review September 21, 2019 18:08
@kescherCode
Copy link

I might take some time out of my days to implement the magneticow part, as I do need it.

@skobkin
Copy link
Contributor Author

skobkin commented Sep 29, 2019

@kescherCode Nice. But you'd better ask @boramalper if he's going to merge it at all :)

@kescherCode
Copy link

@skobkin That's totally up to him :) I just really would like to get rid of this huge SQLite file ;)

@kescherCode
Copy link

kescherCode commented Sep 30, 2019

@skobin I have just set up your branch on my machine and got this error after 96 torrents successfully being added to the DB:

Could not add new torrent to the database       {
"infohash": "723e539bffe0cdd22c117ef2e381742d773c86b7",
"error": "tx.Exec (INSERT INTO files): pq: invalid byte sequence for encoding \"UTF8\": 0xbd",
"errorVerbose": "pq: invalid byte sequence for encoding \"UTF8\": 0xbd\ntx.Exec (INSERT INTO files)\ngithub.com/boramalper/magnetico/pkg/persistence.(*postgresDatabase).AddNewTorrent\n\t/home/kescher/projects/magnetico/pkg/persistence/postgres.go:123\nmain.main\n\t/home/kescher/projects/magnetico/cmd/magneticod/main.go:118\nruntime.main\n\t/usr/lib/go/src/runtime/proc.go:200\nruntime.goexit\n\t/usr/lib/go/src/runtime/asm_amd64.s:1337"}

Encoding issues hooray! Looks like some filenames will trip postgres. I wonder if we should use an encoding other than UTF8 in the postgres database?

Edit: Upon further investigation, the torrent in question contains Chinese characters. That should not be a problem for UTF8, however.

@skobkin
Copy link
Contributor Author

skobkin commented Sep 30, 2019

@kescherCode I've just forgot to add non-unicode checking for files. I'll try to fix it today.

I wonder if we should use an encoding other than UTF8 in the postgres database?

I think it's better to drop non-unicode torrents because we can have problems with them in other applications (besides of magneticod/magneticow). There are very small amount of them.

@skobkin
Copy link
Contributor Author

skobkin commented Sep 30, 2019

@kescherCode

I have just set up your branch on my machine and got this error after 96 torrents successfully being added to the DB

Fixed in 27c5446.

I wonder if we should use an encoding other than UTF8 in the postgres database?

I gave it one more thought and... I still don't think that non-unicode file names is a good idea. It would probably crash torrent client or cause some filesystem problems.

That's totally up to him :) I just really would like to get rid of this huge SQLite file ;)

Yep. But it's not so good for you to waste your time if it will not be merged and you're not planning to maintain a fork. If so it's better to ask @boramalper in the Gitter yourself.

@boramalper
Copy link
Owner

Oh of course I’d merge! I’ll check the PR this week when I have some time. =)

@ad-m
Copy link
Contributor

ad-m commented Oct 2, 2019

It would probably crash torrent client or cause some filesystem problems.

Magnetico new download files, so never use torrent file path with filesystem. If that torrent exists then somebody use that then there is filesystem & client which able to handle downloading it.

@kescherCode
Copy link

Magnetico new download files, so never use torrent file path with filesystem. If that torrent exists then somebody use that then there is filesystem & client which able to handle downloading it.

I disagree. There are a lot of malformed torrents out there that intentionally try to mess up clients, mainly by anti-piracy activists.

@skobkin
Copy link
Contributor Author

skobkin commented Oct 2, 2019

@ad-m

If that torrent exists then somebody use that then there is filesystem & client which able to handle downloading it.

It's logically incorrect. It's probable but not necessary.

Also according to BEP_0003 which describes torrent metadata format:

The name key maps to a UTF-8 encoded string which is the suggested name to save the file (or directory) as. It is purely advisory.

path - A list of UTF-8 encoded strings corresponding to subdirectory names, the last of which is the actual file name (a zero length list is an error case).

I don't think that we must support DHT messages which are not following the same specs as .torrent metadata files.

@skobkin
Copy link
Contributor Author

skobkin commented Oct 15, 2019

I've just created second PR for Beanstalk MQ support.

@skobkin
Copy link
Contributor Author

skobkin commented Oct 15, 2019

I also added schema parameter support to provide ability to customize schema name where all tables will be stored.

@skobkin
Copy link
Contributor Author

skobkin commented Oct 20, 2019

Moved my own magnetico-web project to PostgreSQL with this engine.

Compared to SQLite it's faster and more convenient to work with.

Also migrated all of my magnetico-python data to the new database with my migration tool.

If you want to test this engine, you can try my fork for now using Docker image skobkin/magneticod:0.9.1-extra-engines which is fork of last magnetico version with PostgreSQL and Beanstalk engine support. UPD: Now I think that it'd be better versioned as 0.10.0 according to SemVer...

Usage examples could be found here until this PR is merged.

pkg/persistence/postgres.go Outdated Show resolved Hide resolved
pkg/persistence/postgres.go Show resolved Hide resolved
pkg/persistence/postgres.go Outdated Show resolved Hide resolved
@boramalper
Copy link
Owner

Thank you for your hard work, sincerely appreciated! I have left some comments, and I would be glad if you can address the first and the third.

@skobkin
Copy link
Contributor Author

skobkin commented Nov 13, 2019

Thank you for your comments too. I've removed unused method as you requested. But I'm not sure that using search_path is totally better. If you say so I'll change that part too.

@Glandos
Copy link

Glandos commented Nov 15, 2020

See #218 (comment) for an interesting point-of-view.

@skobkin
Copy link
Contributor Author

skobkin commented Nov 15, 2020

@Glandos What's your point?

@Glandos
Copy link

Glandos commented Nov 15, 2020

I was just bringing attention to a comment in favor of this PR. I personally found that the arguments in this comments worth reading.

@skobkin
Copy link
Contributor Author

skobkin commented Nov 23, 2020

Ok. I did some things here:

  • rebased on master
  • added setting search_path (via connection string) as @boramalper asked
  • lib/pq is now in a maintenance mode and suggest to use jackc/pgx instead, so I replaced lib/pq with jackc/pgx
  • squashed all commits to remove unneeded trials and experiments from the history

If you were using old version with lib/pq, please check docs which were changed, because database URL format is slightly changed.

I'll rebuild my Docker image and migrate my instance to the new code soon.

@boramalper boramalper merged commit 0a497e8 into boramalper:master Nov 27, 2020
@boramalper
Copy link
Owner

Thanks to @skobkin and everyone else for their contribution, much appreciated! =)

This has now landed on v0.12.0 (also on Docker Hub) so please let us know your experience with it --- by opening separate issues.

@daemonserj
Copy link

Could you support PGSQL in magneticow also ?

@boramalper
Copy link
Owner

If someone is willing to implement it, sure. I rarely have time these days, and even when I do, magnetico has no longer been a priority.

@skobkin
Copy link
Contributor Author

skobkin commented Nov 28, 2020

Could you support PGSQL in magneticow also ?

Not right now. I did what I had time for and what I needed personally. Some people only use magneticod as crawler tool for some other software.

You can check my magnetico-web project which I use as an alternative web interface for Magnetico database (PostgreSQL). It may or may not require some knowledge of PHP to run it.
It's intended to run public search instance accessible by invites.

@Glandos
Copy link

Glandos commented Dec 3, 2020

I am working on improving my magnetico database merger to support postgresql, especially to migrate from sqlite to postgresql.
I've noticed that this PR specifically reject non valid UTF-8 names. Is there any rationale behind this? I have some of them in my sqlite dumps, because sqlite doesn't really cares. I'd say that I roughly have 2 to 4 per thousand.

Is UTF-8 part of the bittorrent protocol? Or can the name be in any encoding?

@Glandos
Copy link

Glandos commented Dec 3, 2020

And, I've tested with magneticow, it can happily returns results with those invalid UTF-8 characters replaced.

@skobkin
Copy link
Contributor Author

skobkin commented Dec 4, 2020

@Glandos

I've noticed that this PR specifically reject non valid UTF-8 names.
Is there any rationale behind this?

here.

Is UTF-8 part of the bittorrent protocol?

See linked comment above. BEP-0003 says:

All strings in a .torrent file that contains text must be UTF-8 encoded.

It's not explicitly says that it should be also UTF-8 outside of torrent file (magnet link for example). But I don't see why should we support data which is not compliant with that spec.

If you want, you can check an alternative approach.

@Glandos
Copy link

Glandos commented Dec 4, 2020

Sometimes, I should try to find the information myself, especially when it's so easy to find.

So thanks for that mention ☺️

I think I'll use replace to manage decoding errors while moving from sqlite. Otherwise, you're right, it may be good to avoid non-conforming torrents.

@skobkin
Copy link
Contributor Author

skobkin commented Dec 4, 2020

@Glandos

while moving from sqlite

https://gitlab.com/skobkin/magnetico-go-migrator 😄

You can also build it with the newer version of magnetico persistence package though.

UPD: Just be aware that it uses AddTorrent, so torrent date would be replaced by the date of migration if you use this.
UPD2: Uh, sorry. I've just understood that you're moving from new magnetico (SQLite) to new magnetico (PostgreSQL). In that case it's not what you're need. But... the database structure is similar, so it could work, but I didn't test such use-case.

@Glandos
Copy link

Glandos commented Dec 4, 2020

I didn't know about this tool 👍 However, it seems that it performs a simple migration from sqlite to postgresql. My tool is designed to merge databases.
Right now it only supports merging two sqlite databases, or merging a sqlite dump into postgresql.

I'm not really proud of the code… It tries to be as fast as possible, given the constraints, but it's far from good code.

@skobkin
Copy link
Contributor Author

skobkin commented Dec 7, 2020

designed to merge databases

There is no much difference in this case because magnetico's AddTorrent() checks for existing torrent before insertion. But if you'll add support for different types of DB's, it would be a good tool to merge all community dumps.

@Glandos
Copy link

Glandos commented Dec 11, 2020

With the issues I currently encounter on migrating my 42GB SQLite database to Postgres, I think using AddTorrent() for every 14M torrent will be too slow.

My work on it is not over, but basically when importing a massive dump, you need to:

  • drop indices, except the UNIQUE constraints on torrents.info_hash
  • disable WAL: my experience showed a WAL of 93GB after a third of the aforementioned SQLite database imported. I run out of disk space then 😄
  • insert everything using multi-values inserts
  • recreate indices
  • re-enable WAL

With only the insert multi-values phase, I still have an import time of about 24-25 hours.

@skobkin
Copy link
Contributor Author

skobkin commented Dec 11, 2020

I think using AddTorrent() for every 14M torrent will be too slow.

Yes, kind of. I migrated ~12M of torrents too and it took several hours as I remember. The point of such approach is that it's native to Magnetico, uses the same API and most likely will not cause broken data or anything. If you sure that you know what you're doing, then making more performant solution is of course the better way.

Second advantage of using magnetico's API is that you can run the daemon while also importing old torrents to the database, it will still check for duplicates. So if you worry that you miss some very important torrent, you can use it.

still have an import time of about 24-25 hours

Huh. Maybe your machine is somewhat CPU or IO constrained. But still you need to do that only once, so it shouldn't be a problem
regardless of approach :)

BTW you can also try to run multi-threaded or multi-process import reading batches from different torrent.id ranges and pushing them to the PostgreSQL at the same time. Depending on the IO limit, CPU and PostgreSQL settings you may gain something from that.

@Glandos
Copy link

Glandos commented Dec 22, 2020

My merger is ready. It can only merge from SQLite into either PostgreSQL or SQLite. It is quite fast to populate a new PostgreSQL database from SQLite (my current use case), but updating a large PostgreSQL can be improved, e.g. by using a temporary table.

Anyway, I have been running magneticod with PostgreSQL for four days, and I don't see any performance improvement when crawling DHT. Maybe there are some network restrictions out of control…

@skobkin
Copy link
Contributor Author

skobkin commented Dec 23, 2020

I don't see any performance improvement when crawling DHT

That's because you shouldn't see any. SQLite wasn't bottleneck for crawler.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants