You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to create a wacz from a warc.gz file. I want it to detect pages and create a full text index. This is my command: python3 -m wacz create -f vlaamsekunstcollectie.warc.gz -o vlaamsekunstcollectie_be.wacz --detect-pages --text. The warc.gz file has a size of 15 GB.
I get multiple times this error:
AttributeError: 'NoneType' object has no attribute 'lower'
Error parsing HTML
Traceback (most recent call last):
File "/opt/homebrew/lib/python3.11/site-packages/boilerpy3/extractors.py", line 108, in parse_doc
bp_parser.feed(input_str)
File "/opt/homebrew/lib/python3.11/site-packages/boilerpy3/parser.py", line 658, in feed
self.end_document()
File "/opt/homebrew/lib/python3.11/site-packages/boilerpy3/parser.py", line 461, in end_document
self.flush_block()
File "/opt/homebrew/lib/python3.11/site-packages/boilerpy3/parser.py", line 540, in flush_block
if self.last_start_tag.lower() == "title":
AttributeError: 'NoneType' object has no attribute 'lower'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/homebrew/lib/python3.11/site-packages/boilerpy3/extractors.py", line 114, in parse_doc
bp_parser.feed(input_str)
File "/opt/homebrew/lib/python3.11/site-packages/boilerpy3/parser.py", line 658, in feed
self.end_document()
File "/opt/homebrew/lib/python3.11/site-packages/boilerpy3/parser.py", line 461, in end_document
self.flush_block()
File "/opt/homebrew/lib/python3.11/site-packages/boilerpy3/parser.py", line 540, in flush_block
if self.last_start_tag.lower() == "title":
It succeeds in creating a wacz, but there are no images, although the warc.gz file does show the images.
screenshot of the wacz file:
same page in the warc file:
Not sure if this problem has something to do with a HTML <title> tag, but something is going wrong.
The text was updated successfully, but these errors were encountered:
I'm trying to create a wacz from a warc.gz file. I want it to detect pages and create a full text index. This is my command:
python3 -m wacz create -f vlaamsekunstcollectie.warc.gz -o vlaamsekunstcollectie_be.wacz --detect-pages --text
. The warc.gz file has a size of 15 GB.I get multiple times this error:
It succeeds in creating a wacz, but there are no images, although the warc.gz file does show the images.
screenshot of the wacz file:
same page in the warc file:
Not sure if this problem has something to do with a HTML <title> tag, but something is going wrong.
The text was updated successfully, but these errors were encountered: