- REST API
- Web Crawlers(2)
- Parallelization
- NoSQL Database
- ETL
- Pipeline Scheduler
Run following commands from the top level directory of the repo to build layer and to deploy the app.
docker run --rm -v $PWD:/usr/app node:12 bash -c "cd /usr/app/layers/node/nodejs/ && npm install"
docker run --rm -v $PWD:/usr/app python:3.8 bash -c "pip install -r /usr/app/layers/py/python/requirements.txt --target /usr/app/layers/py/python/ --no-cache-dir"
sam deploy --guided
- Api doc here
- Api summary: After submiting the crawling job using
POST
you query company data using theGET
API. Endpoint accepts FQDN/hostname
- To query data at temporary bucket run the Glue Crawlers i.e.
angel-json
andcrunchbase-json
, then goto Athena console and selectupsellxtemp
database to query it. - Scheduler will compress data everyday in parquet && will append to
upsellxsilo
database. To access it early through Athena, runaws-data-wrangler
lambda function from aws console and refresh Athena table list to access it.