This web scraper is developed to meet the data requirements of SadedeGel library. It scrapes data from news websites and stores them as .txt files. Developed as a part of Açık Kaynak Hackathon Programı 2020.
The SadedeGel project is maintained by @globalmaksmum AI team members @dafajon, @askarbozcan, @mccakir and @husnusensoy.
Type | Platforms |
---|---|
🚨 Bug Reports | GitHub Issue Tracker |
🎁 Feature Requests | GitHub Issue Tracker |
- Gets author urls of given news website
- Gets article urls of each author
- Scrapes data from the article and write to a .txt file
You need sbt to build the project.
$ git clone https://github.com/GlobalMaksimum/sadedegel-scraper.git
$ cd sadedegel-scraper
$ sbt assembly
You will get the jar under ./target/scala-[version]/
$ nohup java -jar sadedegel-scraper-assembly-0.3.jar "hurriyet" > hurriyet.out &
Check for hurriyet-[dd-MM-yyyy] directory for .txt files.
You can add support for additional news sources by extending NewsWebsite Trait.
Example:
import com.sadedegel.ScraperUtils.getArticles
class HurriyetScraper extends NewsWebsite {
val domain = "https://www.hurriyet.com.tr"
val authorsUrl = "https://www.hurriyet.com.tr/yazarlar/tum-yazarlar/#hurriyetcomtr"
override def getAuthorUrls(): List[String] = {
List("https://www.hurriyet.com.tr/yazarlar/ilber-ortayli/"
)
}
override def getArticlesOfAuthors(authorUrls: List[String], domain: String): Unit = {
getArticles(authorUrls, domain, ".highlighted-box.mb20", writeArticlesToFile, "?p=", "")
}
override def writeArticlesToFile(articleUrl: String): Unit = {
ScraperUtils.writeToFile(articleUrl, List(".article-content.news-text", ".rhd-all-article-detail"),
"hurriyet")
}
}