How to extract the content of only certain tags using css selector #76

rulo4 · 2018-02-23T18:06:14Z

This is the web page that I want to extract text from http://www.jornada.unam.mx/2018/02/21/politica/005n1pol

If I use the complex-config.xml file from the examples, I get all content of the web page. Now, I want to extract only the content inside a especific div. That div have the following css selector #article-cont.

To achieve this, I'm trying to use the following importer configuration, but I still get all page content.

<importer>
          <preParseHandlers>
              <filter class="com.norconex.importer.handler.filter.impl.DOMContentFilter"
                        selector="#article-cont" onMatch="include" >
              </filter>
          </preParseHandlers>
</importer>

What I'm doing wrong? What have I to do?

The text was updated successfully, but these errors were encountered:

essiembre · 2018-02-24T18:33:43Z

Filters are to eliminate documents, not content. For that, you have to use transformers to modify the content, or taggers to create new or modify existing metadata fields. In your case, if you want to rely on DOM matching, I suggest you use the DOMTagger to extract the content you want and store it in a metadata field of your choice.

If you really want to modify the content, there is no DOM transformer as of now. You can look at other options, such as using regular expressions with ReplaceTransformer.

ronjakoi · 2018-02-26T20:15:37Z

For now, you can do a bit of a hack. You can use DOMTagger to first extract the content you want and then use ScriptTransformer to replace the imported content with the value stored in the metadata tag:

<importer>
    <preParseHandlers>
        <!-- filter content by css selector, store in a metadata field -->
        <tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger">
            <dom selector="..." overwrite="false" toField="css_selector_content" />
        </tagger>
    </preParseHandlers>

    <postParseHandlers>
        <!-- merge all selector hits to a single value -->
        <tagger class="com.norconex.importer.handler.tagger.impl.ForceSingleValueTagger">
            <singleValue field="css_selector_content" action="mergeWith: " />
        </tagger>

        <!-- replace document content with the value in the metadata field -->
        <transformer class="com.norconex.importer.handler.transformer.impl.ScriptTransformer">
            <script><![CDATA[
                if(metadata['css_selector_content']) {
                    new_content = metadata['css_selector_content'][0];
                } else {
                    new_content = content;
                }
                /* return */ new_content;
            ]]></script>
        </transformer>
    </postParseHandlers>
</importer>

essiembre · 2019-12-22T06:31:41Z

I am marking this as a feature request to add a DOMTransformer.

adesso-thomas-lippitsch · 2022-08-16T11:07:20Z

I would also like to vote for a DOMTransformer :)

It would be very helpful to have the possibility to extract only the content of a specific CSS selector to the "content" field.
Alternatively the DOMDeleteTransformer would just need a parameter which inverses its functionality. Analogous to the onMatch=[include|exclude] of the ReferenceFilter.

<handler class="DOMTransformer">
    <dom selector="div#content" onMatch="include" />
</handler>

essiembre · 2022-09-08T04:09:59Z

I just made a snapshot release of the importer lib with a new DOMPreserveTransformer.

You can define multiple <dom> selector and all matching ones will be "included" (rejecting the rest). Example (see JavaDoc for more options):

<handler class="DOMPreserveTransformer">
  <dom selector="div.something" extract="outerHtml"/>
  <dom selector="div.somethingElse" extract="outerHtml"/>
</handler>

It complements the DOMDeleteTransformer and you can use both, one after the other as you see fit.

essiembre added the feature-request label Dec 22, 2019

essiembre added a commit that referenced this issue Sep 8, 2022

New DOMPreserveTransformer. #76

ac26098

essiembre added the resolved label Sep 8, 2022

adesso-thomas-lippitsch mentioned this issue Sep 8, 2022

Inverse of DOMDeleteTransformer possible? Norconex/crawlers#796

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to extract the content of only certain tags using css selector #76

How to extract the content of only certain tags using css selector #76

rulo4 commented Feb 23, 2018 •

edited by essiembre

Loading

essiembre commented Feb 24, 2018

ronjakoi commented Feb 26, 2018 •

edited

Loading

essiembre commented Dec 22, 2019

adesso-thomas-lippitsch commented Aug 16, 2022 •

edited

Loading

essiembre commented Sep 8, 2022

How to extract the content of only certain tags using css selector #76

How to extract the content of only certain tags using css selector #76

Comments

rulo4 commented Feb 23, 2018 • edited by essiembre Loading

essiembre commented Feb 24, 2018

ronjakoi commented Feb 26, 2018 • edited Loading

essiembre commented Dec 22, 2019

adesso-thomas-lippitsch commented Aug 16, 2022 • edited Loading

essiembre commented Sep 8, 2022

rulo4 commented Feb 23, 2018 •

edited by essiembre

Loading

ronjakoi commented Feb 26, 2018 •

edited

Loading

adesso-thomas-lippitsch commented Aug 16, 2022 •

edited

Loading