Crawl https sitemap with http urls #639

SaschaHeyer · 2019-09-09T07:14:44Z

Hi Pascal,

how can I crawl a sitemap with is reachable with https but contains urls with http. Norconex is not identifying any startURLs in that case.

I tried already setting lenient to true

<sitemapResolverFactory ignore="false" lenient="true" class="com.norconex.collector.http.sitemap.impl.StandardSitemapResolverFactory">
      <path></path>
</sitemapResolverFactory>

As well as stayOnProtocol to false

<startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="false">
        #parse("./shared/startUrls-sitemap.xml")
</startURLs>

Any other recommendations?

Best regards
Sascha

The text was updated successfully, but these errors were encountered:

essiembre · 2019-09-09T17:12:54Z

I would try stayOnPort="false" given https is on 443.

SaschaHeyer · 2019-09-09T17:38:45Z

No difference with stayOnPort="false"

essiembre · 2019-09-30T04:17:04Z

Have you tried using GenericURLNormalizer and convert the said URLs to https with the secureScheme normalization?

jetnet · 2020-06-02T11:52:05Z

same issue here.
secureScheme helps with such sites, but unfortunately it cannot be used globally, as we have to crawl many unsecured sites too.
Any suggestions? Thanks!

essiembre · 2020-06-02T16:34:25Z

That is a tough one because any URLs, even within the same site, might be supporting different schemes: http, https, or both.

We could investigate a new feature that always tries the "preferred" scheme first, and if failing, falling back to the other. But for some sites, it could pretty much double the number of "hits" on the server.

My preference would be to encourage identifying those sites instead and handle them separately, in their own crawler config let's say. It may not always be the most realistic, but preferable when possible.

If you have better options, I would like to hear them.

essiembre added the feature-request label Jun 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawl https sitemap with http urls #639

Crawl https sitemap with http urls #639

SaschaHeyer commented Sep 9, 2019

essiembre commented Sep 9, 2019

SaschaHeyer commented Sep 9, 2019

essiembre commented Sep 30, 2019

jetnet commented Jun 2, 2020

essiembre commented Jun 2, 2020

Crawl https sitemap with http urls #639

Crawl https sitemap with http urls #639

Comments

SaschaHeyer commented Sep 9, 2019

essiembre commented Sep 9, 2019

SaschaHeyer commented Sep 9, 2019

essiembre commented Sep 30, 2019

jetnet commented Jun 2, 2020

essiembre commented Jun 2, 2020