-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Additional crawlers progress tracker #29
Comments
Looking for suggestions on the easiest way to get started with this. 1337x is a pain. Anyone got any bright ideas? |
Right I have an update! In its current form you'll need some dev experience to get this running so if you are a casual user please be wary. Heres a full EZTV scraper. You'll need to run this on a system that you can access the Knight Crawler Postgres db and an instance of RabbitMQ from. This can be the same RabbitMQ that Knight Crawler uses or a different temporary one. https://github.com/purple-emily/knight-crawler-scrapers-dirty It uses python with poetry to install the dependencies. If anyone wants a quick guide on how to run it let me know. Start one producer and one/two consumers and you should be good. This is generally a single use script as Knight Crawler gets the most recent releases from EZTV already. You can abort and resume running at any time and the script should take care of this for you. This will add at least 200,000 new torrents from initial runs. Final numbers to be confirmed later. This is essentially an alpha release so use with caution. Back up Postgres before running. It should take between an hour and two hours to get the data and no confirmed numbers on processing it all. Runs on any system with python. I have provided a start script for each service. |
Taking requests for what everyone would like me to prioritise next. @iPromKnight I don't know if you want to take the logic I have created and convert it to C#. Once we have done a single "full scrape" we don't really have to repeat it. Following the RSS feed gets us all the new releases anyway. |
That's only when the database is shared (or importable) right? |
Essentially run the scraper once to get all of the history, and then the RSS feed crawler will keep it up to date. |
You already said that in your previous comment and I get that. What I mean is where is this scraped history stored? If it's stored in your local database only then no one else can access it unless they also scrape it themselves. What I'm trying to get at is this: if KC users are expected to run the EZTV scraper themselves to fetch all of the initial history, I think it would make sense to rewrite your POC to C# for consistency. If the DB is somehow shared, then it doesn't matter as much imo. |
@sleeyax It's stored in the local database, we don't have any sort of database sharing at the moment, but it is definitely something I'd like to see happening, but it's going to be a lot of work. As far as the language is concerned, I don't see a big issue with supporting multiple as long as it's pretty much plug and play. All that matters is that the database schema is respected. |
The problem with having a shared database is then we become susceptible to DMCA actions When media takedown requests are issued, they are against the hash for the magnet as well as the content. That's why I've been reluctant to implement anything for that, and rely solely on external. I'm toying with the idea of taking the idea for #45 and expanding on that so that for a preseed action it could get the cinemeta known IMDb id list and just process lookups in parallel for them using the Helios compatible provider definitions. |
One of the reasons I wanted to redo the consumer in typescript and didn't rewrite in c# was to kind of show that with a service bus in place it doesn't matter what tech we write services in 😃 |
I’m going to do a refactor of the “deep eztv” crawler I’ve written and then try and use it as a framework to make more. nyaa.si has a rss feed. How easy would it be to add it to the c# scraper? |
If it's rss really easy as we have an abstract xml scraper. Just have to derive from that and override the methods required |
You think that’s something you could do? Or does anyone else want to offer to do it? I can make that the next deep scraper as it’s our most requested in Discord |
@purple-emily I can take care of it. Was going to do Torrent9 but I can probably do both. |
@iPromKnight you not able to make a throwaway and join discord even if it’s just to stick it on mute and never speak in the group context so me or Gabi can keep in contact? |
As per #98 we now support new releases from nyaa.si. Support for scraping old releases to come |
Did we add that abstract XML scraper by any chance? |
Re implement scrapers from the upstream repo
The text was updated successfully, but these errors were encountered: