sam_gov_scraper

Setup

Install dependencies

pip install -r requirements.txt

Create a .env file with the following variables:

DATABASE_URL=XXX

Initialize the database

python -m sam_gov_scraper.reset_db

Run the scraper

python -m sam_gov_scraper.scraper --max_workers=50

Notes

The Scraper works by scanning sam.gov search results. We do not use the explicit API, rather this uses the Web interface XHRs. This interface has a limit of 10000 results per query, regardless of page size. To work around this, we do an open query (no filters) and query by day. The scraper is re-entrant, with only minimal redundancy if stopped and restarted. It queries the DB for the oldest recorded opportunity (by modified date) and uses that day as the starting date. Opportunity IDs are read from the XHRs and added to a ThreadedExecutor.

A stack of workers (specified by --max_workers) are used to process the opportunities. It makes 2 requests for each opportunity:

The XHR to get the opportunity details
The XHR to get the attachments/links

During #1 we discover that some opportunities are not accessible for permission reasons. See: example With ~1% of opportunties processed, almost 12% are missing due to permissions.

We are currently not storing any info about the opportunities missing due to permissions.

I've run the scaper with 50 workers. From my laptop it gets about 40rps. At that rate will take about 1.5 days to scrape all opportunities. I've seen the occasional 500 error from the server, so I don't want to run it any faster for fear that I will either break the server, or cause admins to do something to block my IP. NOTE: If you run this near US-EAST-1, the same # of workers will have a much higher RPS (and load on SAM.GOV).

SqlAlchemy

The DB is a Postgresql database. I've used SqlAlchemy to interact with it. I plan to use Alembic to manage schema upgrades, though at this point it wasn't necessary.

Schema

The Schema is relatively simple. Links are normalized as are contractors. We only store the primary point of contact.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
sam_gov_scraper		sam_gov_scraper
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

sam_gov_scraper

Setup

Notes

SqlAlchemy

Schema

About

Uh oh!

Releases

Packages

Uh oh!

Languages

quirkyllama/sam_gov_scraper

Folders and files

Latest commit

History

Repository files navigation

sam_gov_scraper

Setup

Notes

SqlAlchemy

Schema

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages