The classifier is developed on CUDA 11. Higher CUDA versions may support it also.
You can use pip install -r requirements.txt
to install some of the dependencies, and you need to install Detectron2 (we used v0.6) separately.
You can get model from our Hugging Face repo.
The code is designed as assembly line:
data_summarizer.py
: summarize the data info (domain name, image path, extracting text) into a csv file.dataset.py
: build dataset pkl file from the previous csv file.predict.py
: predict the website category by loading dataset pkl into model.
python3 data_summarizer.py --data_dir /path/to/provider/data/dir --output_csv /path/to/your/csv
python3 dataset.py --target_dir /path/to/pkl/target/dir --file /path/to/your/csv
python3 predict.py --model_path /path/to/model/dir --data_pkl /path/to/your/pkl --data_csv /path/to/your/csv --output_csv /path/to/result/csv
An example (you can get some test cases in seconds by following collector usage instructions):
python3 data_summarizer.py --data_dir /tmp/test/screenshots/20240101/ngrok --output_csv /tmp/test.csv
python3 dataset.py --target_dir /tmp --file /tmp/test.csv
python3 predict.py --data_pkl /tmp/pfs_model --data_pkl /tmp/0part.pkl --data_csv /tmp/test.csv --output_csv /tmp/prediction.csv
Tips/Notes:
- The assembly line is designed for large-scale measurement, enabling us to process the dataset efficiently within the assembly line framework, which is more efficient.
- You can power each component of the assembly line by a multiprocessing/parallelism manner, a strategy that we employ to improve the classifier.