-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to extract the content of only certain tags using css selector #76
Comments
Filters are to eliminate documents, not content. For that, you have to use transformers to modify the content, or taggers to create new or modify existing metadata fields. In your case, if you want to rely on DOM matching, I suggest you use the DOMTagger to extract the content you want and store it in a metadata field of your choice. If you really want to modify the content, there is no DOM transformer as of now. You can look at other options, such as using regular expressions with ReplaceTransformer. |
For now, you can do a bit of a hack. You can use <importer>
<preParseHandlers>
<!-- filter content by css selector, store in a metadata field -->
<tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger">
<dom selector="..." overwrite="false" toField="css_selector_content" />
</tagger>
</preParseHandlers>
<postParseHandlers>
<!-- merge all selector hits to a single value -->
<tagger class="com.norconex.importer.handler.tagger.impl.ForceSingleValueTagger">
<singleValue field="css_selector_content" action="mergeWith: " />
</tagger>
<!-- replace document content with the value in the metadata field -->
<transformer class="com.norconex.importer.handler.transformer.impl.ScriptTransformer">
<script><![CDATA[
if(metadata['css_selector_content']) {
new_content = metadata['css_selector_content'][0];
} else {
new_content = content;
}
/* return */ new_content;
]]></script>
</transformer>
</postParseHandlers>
</importer> |
I am marking this as a feature request to add a DOMTransformer. |
I would also like to vote for a DOMTransformer :) It would be very helpful to have the possibility to extract only the content of a specific CSS selector to the "content" field. <handler class="DOMTransformer">
<dom selector="div#content" onMatch="include" />
</handler> |
I just made a snapshot release of the importer lib with a new DOMPreserveTransformer. You can define multiple <handler class="DOMPreserveTransformer">
<dom selector="div.something" extract="outerHtml"/>
<dom selector="div.somethingElse" extract="outerHtml"/>
</handler> It complements the |
This is the web page that I want to extract text from http://www.jornada.unam.mx/2018/02/21/politica/005n1pol
If I use the
complex-config.xml
file from the examples, I get all content of the web page. Now, I want to extract only the content inside a especific div. That div have the following css selector#article-cont
.To achieve this, I'm trying to use the following importer configuration, but I still get all page content.
What I'm doing wrong? What have I to do?
The text was updated successfully, but these errors were encountered: