Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

During incremental crawling, the number of "REJECTED_UNMODIFIED" outputs differs from expectations #1105

Open
n-ohbayashi opened this issue Jan 24, 2025 · 2 comments

Comments

@n-ohbayashi
Copy link

Hi.

When performing incremental crawling on two URLs with maxDepth = 5, the output for REJECTED_UNMODIFIED is 2.
Our expectation was that the number of REJECTED_UNMODIFIED entries would be higher, as it should verify that there are no changes in URL links up to the depth specified by maxDepth.
Could you please explain the reason for this behavior?

Thanks.

@ohtwadi
Copy link
Contributor

ohtwadi commented Jan 27, 2025

Hello,

If is REJECTED_UNMODIFIED 2, then likely there are some changes to the other pages detected by the Crawler in the incremental run. This should be noted in the logs, you may need to increase logging to see further details.

@n-ohbayashi
Copy link
Author

n-ohbayashi commented Jan 28, 2025

Hello,

Here is an excerpt from the execution logs. If changes were detected, there should be processing by the importer or committer, but that does not appear to be the case.
Since two start URLs are configured, we believe that REJECTED_UNMODIFIED is also 2. (Although it is expected that REJECTED_UNMODIFIED would exceed 2 depending on the maxDepth.)

2025-01-14 17:01:47.049 [test1] INFO  CrawlDocInfoService - STARTING an incremental crawl from previous 1 valid references.
2025-01-14 17:01:47.050 [test2] INFO  CrawlDocInfoService - STARTING an incremental crawl from previous 1 valid references.
2025-01-14 17:01:47.050 [test1] INFO  CRAWLER_INIT_END - Crawler "test1" initialized successfully.
2025-01-14 17:01:47.050 [test2] INFO  CRAWLER_INIT_END - Crawler "test2" initialized successfully.
2025-01-14 17:01:47.071 [test1] INFO  CRAWLER_RUN_BEGIN - test1
2025-01-14 17:01:47.071 [test2] INFO  CRAWLER_RUN_BEGIN - test2
2025-01-14 17:01:47.078 [test2] INFO  Crawler - JMX support enabled.
2025-01-14 17:01:47.078 [test1] INFO  Crawler - JMX support enabled.
2025-01-14 17:01:47.079 [test2] INFO  HttpCrawler - RobotsTxt support: true
2025-01-14 17:01:47.079 [test1] INFO  HttpCrawler - RobotsTxt support: true
2025-01-14 17:01:47.080 [test2] INFO  HttpCrawler - RobotsMeta support: true
2025-01-14 17:01:47.080 [test1] INFO  HttpCrawler - RobotsMeta support: true
2025-01-14 17:01:47.080 [test1] INFO  HttpCrawler - Sitemap support: false
2025-01-14 17:01:47.080 [test2] INFO  HttpCrawler - Sitemap support: false
2025-01-14 17:01:47.081 [test1] INFO  HttpCrawler - Canonical links support: true
2025-01-14 17:01:47.081 [test2] INFO  HttpCrawler - Canonical links support: true
2025-01-14 17:01:47.658 [test2] INFO  HstsResolver - No Strict-Transport-Security (HSTS) support detected for domain "***".
2025-01-14 17:01:48.002 [test2] INFO  HstsResolver - No Strict-Transport-Security (HSTS) support detected for domain "***".
2025-01-14 17:01:48.069 [test1] INFO  StandardRobotsTxtProvider - No robots.txt found for ***/robots.txt. (302 - 302)
2025-01-14 17:01:48.069 [test2] INFO  StandardRobotsTxtProvider - No robots.txt found for ***/robots.txt. (302 - 302)
2025-01-14 17:01:48.084 [test1] INFO  HttpCrawler - 1 start URLs identified.
2025-01-14 17:01:48.084 [test2] INFO  HttpCrawler - 1 start URLs identified.
2025-01-14 17:01:48.085 [test1] INFO  Crawler - Crawling references...
2025-01-14 17:01:48.085 [test2] INFO  Crawler - Crawling references...
2025-01-14 17:01:48.092 [test2#1] INFO  CRAWLER_RUN_THREAD_BEGIN - Thread[test2#1,5,main]
2025-01-14 17:01:48.092 [test1#1] INFO  CRAWLER_RUN_THREAD_BEGIN - Thread[test1#1,5,main]
2025-01-14 17:01:48.093 [test1#2] INFO  CRAWLER_RUN_THREAD_BEGIN - Thread[test1#2,5,main]
2025-01-14 17:01:48.094 [test2#2] INFO  CRAWLER_RUN_THREAD_BEGIN - Thread[test2#2,5,main]
2025-01-14 17:01:48.094 [test1#3] INFO  CRAWLER_RUN_THREAD_BEGIN - Thread[test1#3,5,main]
2025-01-14 17:01:48.094 [test2#3] INFO  CRAWLER_RUN_THREAD_BEGIN - Thread[test2#3,5,main]
2025-01-14 17:01:48.099 [test1#4] INFO  CRAWLER_RUN_THREAD_BEGIN - Thread[test1#4,5,main]
2025-01-14 17:01:48.100 [test1#5] INFO  CRAWLER_RUN_THREAD_BEGIN - Thread[test1#5,5,main]
2025-01-14 17:01:48.100 [test2#4] INFO  CRAWLER_RUN_THREAD_BEGIN - Thread[test2#4,5,main]
2025-01-14 17:01:48.101 [test2#5] INFO  CRAWLER_RUN_THREAD_BEGIN - Thread[test2#5,5,main]
2025-01-14 17:01:48.165 [test2#2] INFO  CRAWLER_RUN_THREAD_END - Thread[test2#2,5,main]
2025-01-14 17:01:48.166 [test1#2] INFO  CRAWLER_RUN_THREAD_END - Thread[test1#2,5,main]
2025-01-14 17:01:48.165 [test2#1] INFO  CRAWLER_RUN_THREAD_END - Thread[test2#1,5,main]
2025-01-14 17:01:48.166 [test2#3] INFO  CRAWLER_RUN_THREAD_END - Thread[test2#3,5,main]
2025-01-14 17:01:48.167 [test1#3] INFO  CRAWLER_RUN_THREAD_END - Thread[test1#3,5,main]
2025-01-14 17:01:48.166 [test2#4] INFO  CRAWLER_RUN_THREAD_END - Thread[test2#4,5,main]
2025-01-14 17:01:48.166 [test1#1] INFO  CRAWLER_RUN_THREAD_END - Thread[test1#1,5,main]
2025-01-14 17:01:48.166 [test1#4] INFO  CRAWLER_RUN_THREAD_END - Thread[test1#4,5,main]
2025-01-14 17:01:48.166 [test1#5] INFO  CRAWLER_RUN_THREAD_END - Thread[test1#5,5,main]
2025-01-14 17:01:48.165 [test2#5] INFO  CRAWLER_RUN_THREAD_END - Thread[test2#5,5,main]
2025-01-14 17:01:48.168 [test2] INFO  CRAWLER_RUN_END - test2
2025-01-14 17:01:48.168 [test1] INFO  CRAWLER_RUN_END - test1
2025-01-14 17:01:48.251 [test2] INFO  Crawler - Crawler completed.
2025-01-14 17:01:48.262 [test2] INFO  Crawler - Execution Summary:
Total processed:   1
Since (re)start:
  Crawl duration:  1 second
  Avg. throughput: 0.8 processed/seconds
  Event counts:
    CRAWLER_RUN_BEGIN:         2
    CRAWLER_RUN_END:           1
    CRAWLER_RUN_THREAD_BEGIN:  10
    CRAWLER_RUN_THREAD_END:    10
    DOCUMENT_PROCESSED:        2
    DOCUMENT_QUEUED:           2
    REJECTED_UNMODIFIED:       2
2025-01-14 17:01:48.267 [test2] INFO  MVStoreDataStoreEngine - Closing data store engine...
2025-01-14 17:01:48.269 [test2] INFO  MVStoreDataStoreEngine - Compacting data store...
2025-01-14 17:01:48.311 [test2] INFO  MVStoreDataStoreEngine - Data store engine closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants