Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build scrapper to continuously update the unstructured data folder with latest Lucknow data #38

Open
monk1337 opened this issue Mar 10, 2024 · 5 comments

Comments

@monk1337
Copy link
Member

Right now the unstructured data folder contains limited data, we need scrappers to scrape the data from different Lucknow websites so that if we want to add more data in the future or update the database of the Lucknow we can simply run those scrappers agents.

@thePratyakshSoni1
Copy link
Contributor

I can do it but i will need list of websites from which to fetch the data. Like if there's a blogging site then whenever we will run our scrapper so new blogs will be added to unstructured data.

@AayushSharma-1
Copy link
Contributor

How about we build this scraper in parts, like someone takes the tourism part, someone takes the hospitals part, and later on, we can combine them to make a fully automated raw data scraper?

@thePratyakshSoni1
Copy link
Contributor

thePratyakshSoni1 commented Mar 11, 2024

How about we build this scraper in parts, like someone takes the tourism part, someone takes the hospitals part, and later on, we can combine them to make a fully automated raw data scraper?

That would be nice, but we will still need list of sites ( that regularly update data on specific topic ) to target them for latest data.

Or we can have another folder called scrapped in Unstrcured_data folder and we can scrap any data related to lucknow by our program, ( can be in different files that are named based on date or something else ) in it.

@monk1337
Copy link
Member Author

monk1337 commented Mar 11, 2024

@pratyakshSoni1 @AayushSharma-1 That's a great idea to take care of one topic and build the scrapper step by step.
@AayushSharma-1 you can go through the old PRs of this repo, those who are contributing the unstructured data, are also mentioning the source of websites/links in the PR description, we can use those websites to scrape.

@AayushSharma-1
Copy link
Contributor

Yes, Sure!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants