Skip to content

A C# library for reading/writing WARC files and scraping websites.

Notifications You must be signed in to change notification settings

portseif/Shaman.Scraping

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Shaman.Scraping

A library for scraping websites and reading/writing WARC files.

Reading a CDX/WARC file

var items = WarcItem.ReadIndex("path/to/index.cdx");
Stream firstResponseBody = items[0].OpenStream();

WebsiteScraper

Generic implementation of a scraper. Configurable with ShouldCrawl, Parallelism, Cookies, CollectAdditionalLinks.

using(var scraper = new WebsiteScraper())
{
    scraper.ShouldScrape = (url, prereq) =>
    {
        if (prereq) return true;
        if (
            url.Host == scraper.FirstAddedUrl.Host && 
            url.Path.StartsWith("/example") &&
            url.HasNoQueryParameters()
            ) return true;

        return false;
    }
}

RedditScraper

Scrapes a subreddit.

FacebookScraper

Scrapes the content of a Facebook page or a group.

WikiScraper

Scraper optimized for MediaWiki sites.

Command line arguments

When run as a console app (as opposed to a library), the following parameters are supported:

  • --make-cdx [path-to-folder]: Generates index.cdx
  • --website-cookies <cookies>: Cookies to use
  • --facebook-page <name>: Scrapes a facebook page

About

A C# library for reading/writing WARC files and scraping websites.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C# 100.0%