This is a basic server that hosts and coordinates Tech & Check's media scraper gems (zorki, forki, and YoutubeArchiver The scrapers are kept separate from Zenodotus, because the IP addresses of traditional hosting servers (where Zenodotus resides), may be blocked by the media sources we're trying to scrape. The Hypatia server can be hosted on a Raspberry Pi in your house or something.
- Make sure you have Ruby 3.1.0 installed (may work on older version, haven't checked)
- Grab this repo
$ bundle install
to get all the gems in- Install Redis and make sure it's running using
$ redis-server
if running locally. - Turn on Sidekiq (which is install with the gems) by running
$ bundle exec sidekiq
in the project directory. - Create
config/application.yml
and createINSTAGRAM_USER_NAME
,INSTAGRAM_PASSWORD
. - Run
$ rails secret
and then add a variablesecret_key_base
toconfig/application.yml
$ rails db:migrate
to setup the database (this uses SQLite so there's no need for Postgres or MySQL or anything)- Setup Selenium standalone server
- Download the "Selenium Server (Grid)" JAR package at https://www.selenium.dev/downloads/
- Save it to the folder of this package
- Test that it works by running
java -jar ./selenium-server-4.2.1.jar standalone
(note the actual version you downloaded)
- Download Firefox's geckodriver and Chrome's chromedriver and save both in a PATH-listed folder. Give the user and system execute privileges for both drivers.
- Ensure that Google Chrome and Mozilla Firefox are installed
- Generate an API key for security purposes in the Rails console
$ rails c
Setting.generate_auth_key
- Note this key in a password manager or something, you'll need it later. It's currently stored in the database (this should be hashed at some point, but meh for now)
exit
- Start up the Selenium server
java -jar ./selenium-server-4.2.1.jar standalone
in a separate CLI pane - Start up the server
$ rails s
If your auth key gets compromised just reload it using the same steps above.
This service allows you to pass in an Instagram, Facebook, or YouTube URL and it'll return a JSON structure with everything you'd want, including the images as base64 encoded fields.
The only endpoint is GET /scrape
which access two parameters:
url
: the url of the Instagram postauth_key
: this is generated in the setup
Legend has it that external data sources and APIs fail every now and then. Hypatia implements retries through Sidekiq to manage these issues. Below, we list some scenarios that hopefully illuminate when Hypatia's retry system will/won't kick in.
A scrape job succeeds!
Sidekiq should dequeue the job so it isn't retried. Hypatia should send a POST request to Zenodotus containing the scraped post data.
A scrape job fails with an error that isn’t retryable (e.g. an InvalidUrlError
)
Sidekiq should dequeue the job so it isn't retried. Hypatia should let Zenodotus know that the scrape request has failed.
A scrape job that has failed and been re-queued max_retries
times fails again with a RetryableError
Sidekiq should dequeue the job so it isn't retried, and Hypatia should let Zenodotus know that the scrape request has failed.
A scrape job that has failed and been re-queued 0≤n<max_retries
times fails with a RetryableError
Sidekiq should re-queue the job so it's retried. If the job subsequently succeeds, Hypatia should follow the scenario 1 playbook. If the job subsequently fails, Hypatia should run through the scenario 2 or scenario 3 playbook.
A few of the individual scraper gems implement their own retry logic. This makes tests, which don't engage ActiveJob/Sidekiq right now, more resillient. In the future, we can probably just force test to use ActiveJob/Sidekiq and move all retry logic to Hypatia.
Follow the same steps to set up and if you want to run tests it's done by rails t
- [] Use a hash for the auth key instead of the key itself for comparison
- [] Allow the requests to be IP limited