Scraphead allow scrapping html from URL in order to retrieve OpenGraph, Twitter Card and many other meta information from HTML head tag.
Scraphead was divided into core
and netty
. The core
contains all the logic, the HTML head parsing and the
mapping into OpenGraph and Twitter Card model. The netty
was one of the multiple possible implementations for
the web client.
- non blocking
- download only the
<head/>
, not the entire HTML file - Multiple web client implementation available
- Detect file encoding
- Read OpenGraph and Twitter Card, and more
- Allow plugins for specific treatment (depending on domain for example)
- build for Java 17 and modules
<dependency>
<groupId>fr.ght1pc9kc</groupId>
<artifactId>scraphead-core</artifactId>
<version>${scraphead.version}</version>
</dependency>
<dependency>
<groupId>fr.ght1pc9kc</groupId>
<artifactId>scraphead-netty</artifactId>
<version>${scraphead.version}</version>
</dependency>
With all collectors :
ScrapClient scrapHttpClient = new NettyScrapClient();
HeadScraper scraper = HeadScrapers.builder(scrapHttpClient).build();
scraper.scrap(URI.create("https://blog.ght1pc9kc.fr/2021/server-sent-event-vs-websocket-avec-spring-webflux.html"))
.map(doWhatEverYouWantWithMeta)
.subscribe();
With limited collectors' usage :
ScrapClient scrapClient = new NettyScrapClient();
HeadScraper scraper = HeadScrapers.builder(scrapClient)
.useMetaTitleAndDescr()
.useOpengraph()
.build();
scraper.scrap(URI.create("https://blog.ght1pc9kc.fr/2021/server-sent-event-vs-websocket-avec-spring-webflux.html"))
.map(doWhatEverYouWantWithMeta)
.subscribe();