Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOMContentFilter creates REJECTED_IMPORT (com.norconex.importer.response.ImporterResponse@3af2bcdb) #483

Closed
mauromi76 opened this issue Apr 23, 2018 · 4 comments

Comments

@mauromi76
Copy link

mauromi76 commented Apr 23, 2018

Hi
I am trying to filter the HTML source removing all those DIVs that i don't need (for example disclaimers, modals ecc).

I read the doc at https://www.norconex.com/collectors/importer/latest/apidocs/com/norconex/importer/handler/filter/impl/DOMContentFilter.html where this seems to be so easy to implement using the DOMContentFilter but in real life I'm facing the problem that when I add this filter the crawler is skipping all the pages.

as I understood this filter goes in the section.

so here's an extract of my config:

<preParseHandlers>
    <filter class="com.norconex.importer.handler.filter.impl.DOMContentFilter"  selector="div#cookie-info" onMatch="exclude" />
    <filter class="com.norconex.importer.handler.filter.impl.DOMContentFilter"  selector="div#external-dialog" onMatch="exclude" />
</preParseHandlers>

can you see anything wrong here? why this makes all pages skip with REJECTED_IMPORT (com.norconex.importer.response.ImporterResponse@3af2bcdb) ?

many thanks in advance for your support
Mauro

@essiembre
Copy link
Contributor

Filters are to exclude matching documents, not part of their content. For want you want to do, you would have better luck with transformers, such as ReplaceTransformer and StripBetweenTransformer.

There is currently no DOM-based transformers, but if you really want to deal with DOM, you can look at the DOMTagger in case it can be of any help. That class can extract specific DOM elements and store them as metadata. You could then strip the original content with a SubstringTransformer if you do not want to keep it.

FYI, there is a feature request for a DOMTansformer here: Norconex/importer#62

@mauromi76
Copy link
Author

Hi Pascal
thank you very much for the quick response. I will check the docs of ReplaceTransformer and StripBetweenTransformer to see how I can achieve to my goal.

just one question before close this topic:
where can I find some info on what is the difference between <preParseHandlers> and <postParseHandlers> ?

I'd like to have a very clear view at which step of the parsing process they are actioned .

thanks
Mauro

@essiembre
Copy link
Contributor

essiembre commented Apr 25, 2018

Pre-parse handlers are invoked before original files are parsed to get the plain text out of them. You have to be careful which handler you use there since you may be trying to do text-operations on binary files (e.g. PDFs). The <restrictTo> tag available with each handler can help. For example, if you need to operate on the raw XML or HTML markup, you would go with pre-parse handlers.

The post-parse handlers are invoked after the original document was parsed and its content should be guaranteed to be plain text at that point (without any formatting). For example, if you want to reject documents that have the word "potato" in them, regardless of their content-type, you would define that under post-parse handlers.

So to over-simplify, you can think of it as:

  1. Http GET to download a raw document.
  2. Pre-parse handlers on raw document.
  3. Parse document to extract text + metadata/fields.
  4. Post-parse handlers on extracted text + metadata/fields.
  5. Commit the document (extracted text + metadata/fields), unless rejected for whatever reason.

For a more detailed view of the HTTP Collector execution flow, have a look here.

Clearer?

@mauromi76
Copy link
Author

just perfect!
really thank you so much.
I will mark this as closed.

Mauro

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants