Enriching and processing Job Data from Common Crawl and other sources.
This repository looks at methods of aggregating information from the job ad data. See an introduction of fetching Job Data from CommonCrawl and the related articles for more information about the techniques used.
Requires Python 3.6+. Install requirements.txt in an appropriate virtual environment:
# Set up a new virtual environment
python -m venv --prompt job-advert-analysis .venv
source .venv/bin/activate
# Install requirement
python -m pip install -r requirements.txt
# Download SpaCy model
python -m spacy download en_core_web_lg
For downloading the Kaggle data you will need Kaggle API credentials set up, and accept the competition rules.
Alternatively you can manually download and unzip the data from Kaggle directly.
If you do not wish to use Kaggle datasources then remove them from DATASOURCES
in 01_fetch_data.py
.
You can run the whole pipeline using python -m job_pipeline build
.
You need a Placeholder server running on Port 3000 of localhost for locatino normalisation. Follow these instructions for a simple way to do this using Docker.