AttributeError: 'NoneType' object has no attribute 'lower' #40

nvanderperren · 2024-01-29T14:52:34Z

I'm trying to create a wacz from a warc.gz file. I want it to detect pages and create a full text index. This is my command: python3 -m wacz create -f vlaamsekunstcollectie.warc.gz -o vlaamsekunstcollectie_be.wacz --detect-pages --text. The warc.gz file has a size of 15 GB.

I get multiple times this error:

AttributeError: 'NoneType' object has no attribute 'lower'
Error parsing HTML
Traceback (most recent call last):
  File "/opt/homebrew/lib/python3.11/site-packages/boilerpy3/extractors.py", line 108, in parse_doc
    bp_parser.feed(input_str)
  File "/opt/homebrew/lib/python3.11/site-packages/boilerpy3/parser.py", line 658, in feed
    self.end_document()
  File "/opt/homebrew/lib/python3.11/site-packages/boilerpy3/parser.py", line 461, in end_document
    self.flush_block()
  File "/opt/homebrew/lib/python3.11/site-packages/boilerpy3/parser.py", line 540, in flush_block
    if self.last_start_tag.lower() == "title":

AttributeError: 'NoneType' object has no attribute 'lower'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/homebrew/lib/python3.11/site-packages/boilerpy3/extractors.py", line 114, in parse_doc
    bp_parser.feed(input_str)
  File "/opt/homebrew/lib/python3.11/site-packages/boilerpy3/parser.py", line 658, in feed
    self.end_document()
  File "/opt/homebrew/lib/python3.11/site-packages/boilerpy3/parser.py", line 461, in end_document
    self.flush_block()
  File "/opt/homebrew/lib/python3.11/site-packages/boilerpy3/parser.py", line 540, in flush_block
    if self.last_start_tag.lower() == "title":

It succeeds in creating a wacz, but there are no images, although the warc.gz file does show the images.

screenshot of the wacz file:

same page in the warc file:

Not sure if this problem has something to do with a HTML <title> tag, but something is going wrong.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AttributeError: 'NoneType' object has no attribute 'lower' #40

AttributeError: 'NoneType' object has no attribute 'lower' #40

nvanderperren commented Jan 29, 2024

AttributeError: 'NoneType' object has no attribute 'lower' #40

AttributeError: 'NoneType' object has no attribute 'lower' #40

Comments

nvanderperren commented Jan 29, 2024