-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOMContentFilter creates REJECTED_IMPORT (com.norconex.importer.response.ImporterResponse@3af2bcdb) #483
Comments
Filters are to exclude matching documents, not part of their content. For want you want to do, you would have better luck with transformers, such as There is currently no DOM-based transformers, but if you really want to deal with DOM, you can look at the DOMTagger in case it can be of any help. That class can extract specific DOM elements and store them as metadata. You could then strip the original content with a FYI, there is a feature request for a DOMTansformer here: Norconex/importer#62 |
Hi Pascal just one question before close this topic: I'd like to have a very clear view at which step of the parsing process they are actioned . thanks |
Pre-parse handlers are invoked before original files are parsed to get the plain text out of them. You have to be careful which handler you use there since you may be trying to do text-operations on binary files (e.g. PDFs). The The post-parse handlers are invoked after the original document was parsed and its content should be guaranteed to be plain text at that point (without any formatting). For example, if you want to reject documents that have the word "potato" in them, regardless of their content-type, you would define that under post-parse handlers. So to over-simplify, you can think of it as:
For a more detailed view of the HTTP Collector execution flow, have a look here. Clearer? |
just perfect! Mauro |
Hi
I am trying to filter the HTML source removing all those DIVs that i don't need (for example disclaimers, modals ecc).
I read the doc at https://www.norconex.com/collectors/importer/latest/apidocs/com/norconex/importer/handler/filter/impl/DOMContentFilter.html where this seems to be so easy to implement using the DOMContentFilter but in real life I'm facing the problem that when I add this filter the crawler is skipping all the pages.
as I understood this filter goes in the section.
so here's an extract of my config:
can you see anything wrong here? why this makes all pages skip with REJECTED_IMPORT (com.norconex.importer.response.ImporterResponse@3af2bcdb) ?
many thanks in advance for your support
Mauro
The text was updated successfully, but these errors were encountered: