Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extracting page titles / URLs from cdxj #11

Open
jakebickford opened this issue Apr 6, 2021 · 1 comment
Open

Extracting page titles / URLs from cdxj #11

jakebickford opened this issue Apr 6, 2021 · 1 comment

Comments

@jakebickford
Copy link

Hello!

Apologies if this is a silly question, but I'm wondering if cdxj-indexer has the ability to generate a list of pages (and potentially their titles) from a warc file? I am thinking of something analogous to the 'Pages' tab in replayweb.page, where it's a list of the captured pages and their titles, rather than the list of all of the many digital objects that make them up. I wondered if there is an http header that could be used for this with cdxj-indexer that would help with this, but I don't see anything obvious.

The use case is that it would be great to provide a human-friendly list of pages on our archive catalogue entries. At the moment I've been generating cdxj's with the default settings but I can see researchers finding this confusing.

Many thanks,

Jake

@edsu
Copy link
Contributor

edsu commented Jun 8, 2022

I don't think this seems like something cdxj-indexer should support, since it does one thing (generate CDXJ files) and does it pretty well. But it does seem like something you might be better off writing as a custom utility program that uses warcio directly? Getting a page title will require parsing responses that are html with something like beautifulsoup? This seems to work ok?

  import bs4
  import sys

  from warcio.archiveiterator import ArchiveIterator

  warc_file = sys.argv[1]
  records = ArchiveIterator(open(warc_file, 'rb'))

  for record in records:
      rec_type = record.rec_type

      if record.http_headers:
          content_type = record.http_headers.get('content-type', '')
      else:
          content_type = ''

      if rec_type == 'response' and 'html' in content_type:
          url = record.rec_headers.get('WARC-Target-URI')
          doc = bs4.BeautifulSoup(record.raw_stream)
          print(f"{url}\t{doc.title.string}")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants