DOMContentFilter creates REJECTED_IMPORT (com.norconex.importer.response.ImporterResponse@3af2bcdb) #483

mauromi76 · 2018-04-23T15:55:20Z

Hi
I am trying to filter the HTML source removing all those DIVs that i don't need (for example disclaimers, modals ecc).

I read the doc at https://www.norconex.com/collectors/importer/latest/apidocs/com/norconex/importer/handler/filter/impl/DOMContentFilter.html where this seems to be so easy to implement using the DOMContentFilter but in real life I'm facing the problem that when I add this filter the crawler is skipping all the pages.

as I understood this filter goes in the section.

so here's an extract of my config:

<preParseHandlers>
    <filter class="com.norconex.importer.handler.filter.impl.DOMContentFilter"  selector="div#cookie-info" onMatch="exclude" />
    <filter class="com.norconex.importer.handler.filter.impl.DOMContentFilter"  selector="div#external-dialog" onMatch="exclude" />
</preParseHandlers>

can you see anything wrong here? why this makes all pages skip with REJECTED_IMPORT (com.norconex.importer.response.ImporterResponse@3af2bcdb) ?

many thanks in advance for your support
Mauro

essiembre · 2018-04-24T02:04:58Z

Filters are to exclude matching documents, not part of their content. For want you want to do, you would have better luck with transformers, such as ReplaceTransformer and StripBetweenTransformer.

There is currently no DOM-based transformers, but if you really want to deal with DOM, you can look at the DOMTagger in case it can be of any help. That class can extract specific DOM elements and store them as metadata. You could then strip the original content with a SubstringTransformer if you do not want to keep it.

FYI, there is a feature request for a DOMTansformer here: Norconex/importer#62

mauromi76 · 2018-04-24T15:20:54Z

Hi Pascal
thank you very much for the quick response. I will check the docs of ReplaceTransformer and StripBetweenTransformer to see how I can achieve to my goal.

just one question before close this topic:
where can I find some info on what is the difference between <preParseHandlers> and <postParseHandlers> ?

I'd like to have a very clear view at which step of the parsing process they are actioned .

thanks
Mauro

essiembre · 2018-04-25T01:51:05Z

Pre-parse handlers are invoked before original files are parsed to get the plain text out of them. You have to be careful which handler you use there since you may be trying to do text-operations on binary files (e.g. PDFs). The <restrictTo> tag available with each handler can help. For example, if you need to operate on the raw XML or HTML markup, you would go with pre-parse handlers.

The post-parse handlers are invoked after the original document was parsed and its content should be guaranteed to be plain text at that point (without any formatting). For example, if you want to reject documents that have the word "potato" in them, regardless of their content-type, you would define that under post-parse handlers.

So to over-simplify, you can think of it as:

Http GET to download a raw document.
Pre-parse handlers on raw document.
Parse document to extract text + metadata/fields.
Post-parse handlers on extracted text + metadata/fields.
Commit the document (extracted text + metadata/fields), unless rejected for whatever reason.

For a more detailed view of the HTTP Collector execution flow, have a look here.

Clearer?

mauromi76 · 2018-04-25T12:40:37Z

just perfect!
really thank you so much.
I will mark this as closed.

Mauro

mauromi76 closed this as completed Apr 25, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOMContentFilter creates REJECTED_IMPORT (com.norconex.importer.response.ImporterResponse@3af2bcdb) #483

DOMContentFilter creates REJECTED_IMPORT (com.norconex.importer.response.ImporterResponse@3af2bcdb) #483

mauromi76 commented Apr 23, 2018 •

edited by essiembre

Loading

essiembre commented Apr 24, 2018

mauromi76 commented Apr 24, 2018

essiembre commented Apr 25, 2018 •

edited

Loading

mauromi76 commented Apr 25, 2018

DOMContentFilter creates REJECTED_IMPORT (com.norconex.importer.response.ImporterResponse@3af2bcdb) #483

DOMContentFilter creates REJECTED_IMPORT (com.norconex.importer.response.ImporterResponse@3af2bcdb) #483

Comments

mauromi76 commented Apr 23, 2018 • edited by essiembre Loading

essiembre commented Apr 24, 2018

mauromi76 commented Apr 24, 2018

essiembre commented Apr 25, 2018 • edited Loading

mauromi76 commented Apr 25, 2018

mauromi76 commented Apr 23, 2018 •

edited by essiembre

Loading

essiembre commented Apr 25, 2018 •

edited

Loading