DataCollection

DataCollection is a simple Node.js server designed for local hosting that facilitates the collection of data for AI training. It automatically scrapes all text from <p> tags on a given website, cleans the text by removing unnecessary characters such as dots, apostrophes, and brackets, and then saves the clean data to a text.txt file.

This tool was specifically created to streamline data collection for training AI models. I use it to gather data on Croatia's history and general knowledge from Wikipedia.

I encourage contributions and feedback to help improve the codebase. If you have suggestions, please submit a pull request, or contact me via Discord (Ivan.#4912) or email.

Running & Using the server

Clone the repo from https://github.com/Sirius3615/DataCollection.git

Install all the NPM libs:

npm install

Run the server:

npm start

(either using the terminal or the IDE of choice)

Go to the endpoint:

http://localhost:3000/scrape

And scrape away!

License

DataCollection is licensed under the Unlicense, which means it is in the public domain and can be used freely without any licensing requirements.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.vscode		.vscode
node_modules		node_modules
src		src
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
screenshot.png		screenshot.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DataCollection

Running & Using the server

License

About

Releases

Packages

Languages

Sirius3615/DataCollection

Folders and files

Latest commit

History

Repository files navigation

DataCollection

Running & Using the server

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages