Skip to content
This repository has been archived by the owner on Oct 30, 2018. It is now read-only.

Treat content as HTML even if it has junk before the start of the HTML #98

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

gnp
Copy link

@gnp gnp commented Aug 2, 2015

I've seen pages like this in the wild, for example with <script> stuff before the HTML doctype stuff. This fallback helped me still be able to run Goose on those pages.

@gnp
Copy link
Author

gnp commented Aug 2, 2015

I have JDK 8 installed, and I had to invoke Maven like this:

JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.7.0_45.jdk/Contents/Home/ mvn clean package

else I got compiler complaints about JDK classfiles being "broken".

@gnp
Copy link
Author

gnp commented Aug 2, 2015

With some refactoring to split out the parsing from the fetching in HtmlFetcher.scala, it would be possible to write unit tests that exercise the different cases, including the one I built this for.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant