Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolving PHP POST Issue in AZ Scraper #53

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

DGieseke
Copy link
Collaborator

Problem
The Arizona bill scraper experienced failures when running in a Kubernetes pod, despite functioning correctly in a local Docker environment. Specifically:

  • Issue: The PHPSESSID cookie was not retrieved after sending a POST request to https://www.azleg.gov/azlegwp/setsession.php, resulting in a failure to select the PHP session ID and scrape bills.
  • Environment Discrepancy: While the scraper worked locally within Docker, the issue consistently occurred in the Kubernetes pod. It looks like this behavior is linked to IP blocking by Sucuri, the firewall protecting the Arizona Legislature's website.

Root Causes

  • IP Blocking in Kubernetes: It looks like Sucuri flagged repeated requests from the Kubernetes pod, denying access and causing the scraper to fail.
  • Unnecessary Initial GET Request: The scraper performed an initial GET request to retrieve cookies from https://www.azleg.gov/bills/. However, the retrieved cookies were never used in subsequent requests. It looks like this redundant request likely triggered additional scrutiny from Sucuri, exacerbating the likelihood of IP blocking from the Kubernetes environment.

Solutions Entertained

  • SSL Certificate Verification
  • Adding Delays Between Requests: Introduced retries and delays to avoid triggering Sucuri's rate-limiting rules. While this reduced the likelihood of blocking, it did not fully resolve the issue.
  • Proxy or IP Rotation: Considered routing traffic through a proxy or rotating IPs to circumvent Sucuri’s IP-based blocking. This solution was ruled out due to complexity.

Final Fix
The issue was resolved by removing the unnecessary initial GET request to https://www.azleg.gov/bills/. This significantly reduced the number of requests made to the server and avoided triggering Sucuri. While the main fork uses the multiple requests to the bill list page, our previous fix to this scraper avoids this. The scraper now:

  • Sends a single POST request to set the session ID.
  • Uses the resulting PHPSESSID cookie directly for subsequent requests.

@DGieseke DGieseke requested a review from a team as a code owner December 18, 2024 16:57
@Desitrain22
Copy link

This LGTM! Nice use of the delayed requests.

A few notes:

  • I know Securi and other government vendors tend to patch dispatching cookies from API requests -- in the past, I've run into issues with sec.gov and house.gov creating blockers for attempting to retrieve certificates and/or cookies via API. Just something to keep an eye out for/take note of in the future should this stop working

  • As we start to scrape, it might be worth adding error handling to the request -- i.e if it's rejected due to rate limiting vs rejected due to auth/security issues. PST is a lot lower traffic than EST but if they start to add more anti-scrape layers of security then it'd be useful for logging and attempting different types of requests depending on the reason for request rejection/failure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants