Extracting page titles / URLs from cdxj #11

jakebickford · 2021-04-06T11:40:45Z

Hello!

Apologies if this is a silly question, but I'm wondering if cdxj-indexer has the ability to generate a list of pages (and potentially their titles) from a warc file? I am thinking of something analogous to the 'Pages' tab in replayweb.page, where it's a list of the captured pages and their titles, rather than the list of all of the many digital objects that make them up. I wondered if there is an http header that could be used for this with cdxj-indexer that would help with this, but I don't see anything obvious.

The use case is that it would be great to provide a human-friendly list of pages on our archive catalogue entries. At the moment I've been generating cdxj's with the default settings but I can see researchers finding this confusing.

Many thanks,

Jake

edsu · 2022-06-08T16:25:27Z

I don't think this seems like something cdxj-indexer should support, since it does one thing (generate CDXJ files) and does it pretty well. But it does seem like something you might be better off writing as a custom utility program that uses warcio directly? Getting a page title will require parsing responses that are html with something like beautifulsoup? This seems to work ok?

  import bs4
  import sys

  from warcio.archiveiterator import ArchiveIterator

  warc_file = sys.argv[1]
  records = ArchiveIterator(open(warc_file, 'rb'))

  for record in records:
      rec_type = record.rec_type

      if record.http_headers:
          content_type = record.http_headers.get('content-type', '')
      else:
          content_type = ''

      if rec_type == 'response' and 'html' in content_type:
          url = record.rec_headers.get('WARC-Target-URI')
          doc = bs4.BeautifulSoup(record.raw_stream)
          print(f"{url}\t{doc.title.string}")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extracting page titles / URLs from cdxj #11

Extracting page titles / URLs from cdxj #11

jakebickford commented Apr 6, 2021

edsu commented Jun 8, 2022

Extracting page titles / URLs from cdxj #11

Extracting page titles / URLs from cdxj #11

Comments

jakebickford commented Apr 6, 2021

edsu commented Jun 8, 2022