| 中文 | English |
The Data Processing Toolkit for LLMs, published by Zhejiang Lab, contains tools designed for the data collection and processing to train LLMs. This toolkit is engineered to address the challenges associated with data preparation across diverse domains of LLM training. This project aims to help researchers enhance the efficiency of data preparation and reduce the cost of data set construction.
The data processing toolkit released in the current version includes:
- Vertical dataset collection tool based on subject classification (Subject_Classifier)
- Web data collection tool based on large model prompt and search (One_Click_Crawler)
- Self-developed integrated OCR tool (DataPrep4LLM_Algos)
- ES database management tool (Easy_ES)
If you use this toolkit in your research, please cite it as follows:
@misc{ZJ2024DataProcessesToolkit,
author = {Zhejiang Lab},
title = {Data Processing Toolkit for LLMs},
year = {2024},
howpublished = {\url{https://github.com/zhejianglab/Data-Processing-Toolkit-for-LLMs}},
note = {Accessed: 2024-09-14}
}
If you have published research using this toolkit, please let us know and we will maintain a list of relevant publications to facilitate better communication among researchers.
If you have any problems using the toolkit, please contact us via email at Zhejiang Lab
© 2024 Research Center for Intelligent Equipment of Zhejiang Lab