Skip to content

Sirius3615/DataCollection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DataCollection

Screenshot of DataCollection repository

DataCollection is a simple Node.js server designed for local hosting that facilitates the collection of data for AI training. It automatically scrapes all text from <p> tags on a given website, cleans the text by removing unnecessary characters such as dots, apostrophes, and brackets, and then saves the clean data to a text.txt file.

This tool was specifically created to streamline data collection for training AI models. I use it to gather data on Croatia's history and general knowledge from Wikipedia.

I encourage contributions and feedback to help improve the codebase. If you have suggestions, please submit a pull request, or contact me via Discord (Ivan.#4912) or email.


Running & Using the server

Clone the repo from https://github.com/Sirius3615/DataCollection.git

Install all the NPM libs:

npm install

Run the server:

npm start

(either using the terminal or the IDE of choice)

Go to the endpoint:

http://localhost:3000/scrape

And scrape away!


License

DataCollection is licensed under the Unlicense, which means it is in the public domain and can be used freely without any licensing requirements.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published