This web scraper is developed to meet the data requirements of SadedeGel library. It scrapes data from news websites and stores them as .txt files. Developed as a part of Açık Kaynak Hackathon Programı 2020.
The SadedeGel project is maintained by @globalmaksmum AI team members @dafajon, @askarbozcan, @mccakir and @husnusensoy.
Type | Platforms |
🚨 Bug Reports | GitHub Issue Tracker |
🎁 Feature Requests | GitHub Issue Tracker |
- Gets author urls of given news website
- Gets article urls of each author
- Scrapes data from the article and write to a .txt file
You need sbt to build the project.
$ git clone
$ cd sadedegel-scraper
$ sbt assembly
You will get the jar under ./target/scala-[version]/
$ nohup java -jar sadedegel-scraper-assembly-0.3.jar "hurriyet" > hurriyet.out &
Check for hurriyet-[dd-MM-yyyy] directory for .txt files.
You can add support for additional news sources by extending NewsWebsite Trait.
import com.sadedegel.ScraperUtils.getArticles
class HurriyetScraper extends NewsWebsite {
val domain = ""
val authorsUrl = ""
override def getAuthorUrls(): List[String] = {
override def getArticlesOfAuthors(authorUrls: List[String], domain: String): Unit = {
getArticles(authorUrls, domain, ".highlighted-box.mb20", writeArticlesToFile, "?p=", "")
override def writeArticlesToFile(articleUrl: String): Unit = {
ScraperUtils.writeToFile(articleUrl, List("", ".rhd-all-article-detail"),