Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Docker:nightly) Everything works but: "Crawling job failed: Error: EXDEV: cross-device link not permitted" leads to multiple retries and "failed crawling jobs" #273

Closed
Deathproof76 opened this issue Jul 4, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@Deathproof76
Copy link

Hello! πŸ™‹β€β™‚οΈ

I'm running the current nightly from July 2nd and I seem to encounter an almost cosmetic bug, as everything else seems to be in working order regarding fetching, crawling, inference etc. This hasn't happened with my setup before and I sadly can't make out the exact commit since it did.

Everytime I hoard a new page or recrawl an old one (to try out a new llm (running with ollama)) I get this error message from the workers container:

2024-07-04T10:12:15.501Z error: [Crawler][3904] Crawling job failed: Error: EXDEV: cross-device link not permitted, rename '/tmp/43e6a047-5a13-41ad-b6d9-af89eea2f7a1' -> '/data/assets/m886fg03flmkf2e9m9mo5tlk/43e6a047-5a13-41ad-b6d9-af89eea2f7a1/asset.bin'

although subjectively the hoarding hasn't failed, the workers deem it so and repeat the attempt multiple times, which leads to:

2024-07-04T09:44:45.028Z info: [Crawler][3902] Will crawl "https://huggingface.co/aTrain-core/distil-whisper-large-v3-de-kd" for link with id "n68vikut13h1d0jo7x2cewfs"
2024-07-04T09:44:45.029Z info: [Crawler][3902] Attempting to determine the content-type for the url https://huggingface.co/aTrain-core/distil-whisper-large-v3-de-kd
2024-07-04T09:44:45.512Z info: [Crawler][3902] Content-type for the url https://huggingface.co/aTrain-core/distil-whisper-large-v3-de-kd is "text/html; charset=utf-8"
2024-07-04T09:44:46.576Z info: [Crawler][3902] Successfully navigated to "https://huggingface.co/aTrain-core/distil-whisper-large-v3-de-kd". Waiting for the page to load ...
2024-07-04T09:44:49.551Z info: [Crawler][3902] Finished waiting for the page to load.
2024-07-04T09:44:49.872Z info: [Crawler][3902] Finished capturing page content and a screenshot. FullPageScreenshot: true
2024-07-04T09:44:49.875Z info: [Crawler][3902] Will attempt to extract metadata from page ...
2024-07-04T09:44:50.002Z info: [Crawler][3902] Will attempt to extract readable content ...
2024-07-04T09:44:50.094Z info: [Crawler][3902] Done extracting readable content.
2024-07-04T09:44:50.096Z info: [Crawler][3902] Done extracting metadata from the page.
2024-07-04T09:44:50.110Z info: [Crawler][3902] Stored the screenshot as assetId: ffa51dc7-2eeb-4e08-a832-87928459f2a4
2024-07-04T09:44:50.110Z info: [Crawler][3902] Downloading image from "https://cdn-thumbnails.huggingface.co/social-thumbnails/models/aTrain-core/distil-whisper-large-v3-de-kd.png"
2024-07-04T09:44:50.216Z info: [Crawler][3902] Downloaded image as assetId: c7d82832-7638-486d-9b3d-c388ff29df43
2024-07-04T09:44:50.234Z info: [Crawler][3902] Will attempt to archive page ...
2024-07-04T09:44:50.238Z info: [inference][5113] Starting an inference job for bookmark with id "n68vikut13h1d0jo7x2cewfs"
2024-07-04T09:44:50.239Z info: [search][11018] Attempting to index bookmark with id n68vikut13h1d0jo7x2cewfs ...
2024-07-04T09:44:50.295Z info: [search][11018] Completed successfully
2024-07-04T09:44:51.062Z info: [inference][5113] Inferring tag for bookmark "n68vikut13h1d0jo7x2cewfs" used 27 tokens and inferred: Machine Learning,Natural Language Processing,German Language,Whisper,HuggingFace
2024-07-04T09:44:51.080Z info: [inference][5113] Completed successfully
2024-07-04T09:44:51.080Z info: [search][11019] Attempting to index bookmark with id n68vikut13h1d0jo7x2cewfs ...
2024-07-04T09:44:51.135Z info: [search][11019] Completed successfully
2024-07-04T09:44:58.421Z error: [Crawler][3902] Crawling job failed: Error: EXDEV: cross-device link not permitted, rename '/tmp/08d38192-37d4-4628-8da4-4f1671c44520' -> '/data/assets/m886fg03flmkf2e9m9mo5tlk/08d38192-37d4-4628-8da4-4f1671c44520/asset.bin'
2024-07-04T09:45:00.461Z info: [Crawler][3902] Will crawl "https://huggingface.co/aTrain-core/distil-whisper-large-v3-de-kd" for link with id "n68vikut13h1d0jo7x2cewfs"
2024-07-04T09:45:00.461Z info: [Crawler][3902] Attempting to determine the content-type for the url https://huggingface.co/aTrain-core/distil-whisper-large-v3-de-kd
2024-07-04T09:45:00.941Z info: [Crawler][3902] Content-type for the url https://huggingface.co/aTrain-core/distil-whisper-large-v3-de-kd is "text/html; charset=utf-8"
2024-07-04T09:45:01.784Z info: [Crawler][3902] Successfully navigated to "https://huggingface.co/aTrain-core/distil-whisper-large-v3-de-kd". Waiting for the page to load ...
2024-07-04T09:45:04.787Z info: [Crawler][3902] Finished waiting for the page to load.
2024-07-04T09:45:05.119Z info: [Crawler][3902] Finished capturing page content and a screenshot. FullPageScreenshot: true
2024-07-04T09:45:05.121Z info: [Crawler][3902] Will attempt to extract metadata from page ...
2024-07-04T09:45:05.248Z info: [Crawler][3902] Will attempt to extract readable content ...
2024-07-04T09:45:05.331Z info: [Crawler][3902] Done extracting readable content.
2024-07-04T09:45:05.332Z info: [Crawler][3902] Done extracting metadata from the page.
2024-07-04T09:45:05.333Z info: [Crawler][3902] Stored the screenshot as assetId: 4e014c81-8a79-4877-ae9b-c99287434a13
2024-07-04T09:45:05.333Z info: [Crawler][3902] Downloading image from "https://cdn-thumbnails.huggingface.co/social-thumbnails/models/aTrain-core/distil-whisper-large-v3-de-kd.png"
2024-07-04T09:45:05.439Z info: [Crawler][3902] Downloaded image as assetId: 48ecb83e-a5ec-4d11-9164-1b01a97ffa71
2024-07-04T09:45:05.456Z info: [Crawler][3902] Will attempt to archive page ...
2024-07-04T09:45:05.460Z info: [inference][5114] Starting an inference job for bookmark with id "n68vikut13h1d0jo7x2cewfs"
2024-07-04T09:45:05.460Z info: [search][11020] Attempting to index bookmark with id n68vikut13h1d0jo7x2cewfs ...
2024-07-04T09:45:05.517Z info: [search][11020] Completed successfully
2024-07-04T09:45:06.330Z info: [inference][5114] Inferring tag for bookmark "n68vikut13h1d0jo7x2cewfs" used 27 tokens and inferred: Machine Learning,Natural Language Processing,German Language,Whisper Model,Distillation
2024-07-04T09:45:06.348Z info: [inference][5114] Completed successfully
2024-07-04T09:45:06.348Z info: [search][11021] Attempting to index bookmark with id n68vikut13h1d0jo7x2cewfs ...
2024-07-04T09:45:06.403Z info: [search][11021] Completed successfully
2024-07-04T09:45:22.300Z error: [Crawler][3902] Crawling job failed: Error: EXDEV: cross-device link not permitted, rename '/tmp/2aaff50d-f4c7-4fe3-8b04-2c004a83c21a' -> '/data/assets/m886fg03flmkf2e9m9mo5tlk/2aaff50d-f4c7-4fe3-8b04-2c004a83c21a/asset.bin'
2024-07-04T09:45:26.316Z info: [Crawler][3902] Will crawl "https://huggingface.co/aTrain-core/distil-whisper-large-v3-de-kd" for link with id "n68vikut13h1d0jo7x2cewfs"
2024-07-04T09:45:26.316Z info: [Crawler][3902] Attempting to determine the content-type for the url https://huggingface.co/aTrain-core/distil-whisper-large-v3-de-kd
2024-07-04T09:45:26.602Z info: [Crawler][3902] Content-type for the url https://huggingface.co/aTrain-core/distil-whisper-large-v3-de-kd is "text/html; charset=utf-8"
2024-07-04T09:45:27.698Z info: [Crawler][3902] Successfully navigated to "https://huggingface.co/aTrain-core/distil-whisper-large-v3-de-kd". Waiting for the page to load ...
2024-07-04T09:45:30.709Z info: [Crawler][3902] Finished waiting for the page to load.
2024-07-04T09:45:31.065Z info: [Crawler][3902] Finished capturing page content and a screenshot. FullPageScreenshot: true
2024-07-04T09:45:31.068Z info: [Crawler][3902] Will attempt to extract metadata from page ...
2024-07-04T09:45:31.187Z info: [Crawler][3902] Will attempt to extract readable content ...
2024-07-04T09:45:31.271Z info: [Crawler][3902] Done extracting readable content.
2024-07-04T09:45:31.273Z info: [Crawler][3902] Done extracting metadata from the page.
2024-07-04T09:45:31.274Z info: [Crawler][3902] Stored the screenshot as assetId: d2a1689b-b7c1-4129-aed7-be6231d8be83
2024-07-04T09:45:31.274Z info: [Crawler][3902] Downloading image from "https://cdn-thumbnails.huggingface.co/social-thumbnails/models/aTrain-core/distil-whisper-large-v3-de-kd.png"
2024-07-04T09:45:31.386Z info: [Crawler][3902] Downloaded image as assetId: dda5ffc3-34fd-4cff-8a3b-1066cb3b7e2a
2024-07-04T09:45:31.415Z info: [Crawler][3902] Will attempt to archive page ...
2024-07-04T09:45:31.419Z info: [inference][5115] Starting an inference job for bookmark with id "n68vikut13h1d0jo7x2cewfs"
2024-07-04T09:45:31.419Z info: [search][11022] Attempting to index bookmark with id n68vikut13h1d0jo7x2cewfs ...
2024-07-04T09:45:31.475Z info: [search][11022] Completed successfully
2024-07-04T09:45:32.750Z info: [inference][5115] Inferring tag for bookmark "n68vikut13h1d0jo7x2cewfs" used 45 tokens and inferred: Machine Learning,Natural Language Processing,Speech Recognition,German Language,HuggingFace
2024-07-04T09:45:32.765Z info: [inference][5115] Completed successfully
2024-07-04T09:45:32.765Z info: [search][11023] Attempting to index bookmark with id n68vikut13h1d0jo7x2cewfs ...
2024-07-04T09:45:32.821Z info: [search][11023] Completed successfully
2024-07-04T09:45:39.336Z error: [Crawler][3902] Crawling job failed: Error: EXDEV: cross-device link not permitted, rename '/tmp/f6c22412-7fb5-44d4-b47c-750b3a81c516' -> '/data/assets/m886fg03flmkf2e9m9mo5tlk/f6c22412-7fb5-44d4-b47c-750b3a81c516/asset.bin'
2024-07-04T09:45:47.359Z info: [Crawler][3902] Will crawl "https://huggingface.co/aTrain-core/distil-whisper-large-v3-de-kd" for link with id "n68vikut13h1d0jo7x2cewfs"
2024-07-04T09:45:47.359Z info: [Crawler][3902] Attempting to determine the content-type for the url https://huggingface.co/aTrain-core/distil-whisper-large-v3-de-kd
2024-07-04T09:45:47.629Z info: [Crawler][3902] Content-type for the url https://huggingface.co/aTrain-core/distil-whisper-large-v3-de-kd is "text/html; charset=utf-8"
2024-07-04T09:45:48.461Z info: [Crawler][3902] Successfully navigated to "https://huggingface.co/aTrain-core/distil-whisper-large-v3-de-kd". Waiting for the page to load ...
2024-07-04T09:45:51.536Z info: [Crawler][3902] Finished waiting for the page to load.
2024-07-04T09:45:51.866Z info: [Crawler][3902] Finished capturing page content and a screenshot. FullPageScreenshot: true
2024-07-04T09:45:51.869Z info: [Crawler][3902] Will attempt to extract metadata from page ...
2024-07-04T09:45:52.010Z info: [Crawler][3902] Will attempt to extract readable content ...
2024-07-04T09:45:52.102Z info: [Crawler][3902] Done extracting readable content.
2024-07-04T09:45:52.103Z info: [Crawler][3902] Done extracting metadata from the page.
2024-07-04T09:45:52.104Z info: [Crawler][3902] Stored the screenshot as assetId: f18e3c0d-6ae7-48ae-af10-9c43eeff1be6
2024-07-04T09:45:52.104Z info: [Crawler][3902] Downloading image from "https://cdn-thumbnails.huggingface.co/social-thumbnails/models/aTrain-core/distil-whisper-large-v3-de-kd.png"
2024-07-04T09:45:52.248Z info: [Crawler][3902] Downloaded image as assetId: b88ba0ce-7cd2-4739-8e59-2a7eba1f4ed7
2024-07-04T09:45:52.255Z info: [Crawler][3902] Will attempt to archive page ...
2024-07-04T09:45:52.260Z info: [inference][5116] Starting an inference job for bookmark with id "n68vikut13h1d0jo7x2cewfs"
2024-07-04T09:45:52.260Z info: [search][11024] Attempting to index bookmark with id n68vikut13h1d0jo7x2cewfs ...
2024-07-04T09:45:52.316Z info: [search][11024] Completed successfully
2024-07-04T09:45:52.873Z info: [inference][5116] Inferring tag for bookmark "n68vikut13h1d0jo7x2cewfs" used 28 tokens and inferred: Machine Learning,Natural Language Processing,German Language,Whisper Model,HuggingFace
2024-07-04T09:45:52.877Z info: [inference][5116] Completed successfully
2024-07-04T09:45:52.878Z info: [search][11025] Attempting to index bookmark with id n68vikut13h1d0jo7x2cewfs ...
2024-07-04T09:45:52.934Z info: [search][11025] Completed successfully
2024-07-04T09:46:03.330Z error: [Crawler][3902] Crawling job failed: Error: EXDEV: cross-device link not permitted, rename '/tmp/e0b27956-2f2c-4b5d-a663-4e04b61276bb' -> '/data/assets/m886fg03flmkf2e9m9mo5tlk/e0b27956-2f2c-4b5d-a663-4e04b61276bb/asset.bin'

until the workers give up.

I tried recrawling and reindexing all, which in the end marked all jobs as failed

image

even though they did not. The latest grabs for example:

image

All the other containers of the stack show no errors. My volumes are mounted directly. Tried to mount /tmp to /dev/shm/hoarder on the system drive and also to a folder next to the /data folder on the same disk as the other mounts. Which didn't help.

.env

MEILI_MASTER_KEY=6L0m********************************************hCzm
NEXTAUTH_SECRET=/RfD5U******************************************Eqr9OZ+ciD
NEXTAUTH_URL=http://192.168.0.208:3111
CRAWLER_NUM_WORKERS=1
CRAWLER_FULL_PAGE_SCREENSHOT=true
CRAWLER_FULL_PAGE_ARCHIVE=true
CRAWLER_JOB_TIMEOUT_SEC=90
CRAWLER_NAVIGATE_TIMEOUT_SEC=45
MAX_ASSET_SIZE_MB=20
OLLAMA_BASE_URL=http://192.168.0.208:11434
DISABLE_SIGNUPS=true
INFERENCE_LANG=english

compose:

version: "3.8"
services:
  web:
    image: ghcr.io/mohamedbassem/hoarder-web:latest
    restart: unless-stopped
    volumes:
      - /mnt/Dockerspace/hoarder/web:/data
    ports:
      - 3111:3000
    env_file:
      - /mnt/Dockerspace/hoarder/.env
    environment:
      MEILI_ADDR: http://192.168.0.208:7700
      DATA_DIR: /data
      REDIS_HOST: rediss
  rediss:
    image: redis:7.2-alpine
    restart: unless-stopped
    volumes:
      - /mnt/Dockerspace/hoarder/redis:/data
  chrome:
    image: gcr.io/zenika-hub/alpine-chrome:123
    restart: unless-stopped
    ports:
      - 9222:9222
    command:
      - --no-sandbox
      - --disable-gpu
      - --disable-dev-shm-usage
      - --remote-debugging-address=0.0.0.0
      - --remote-debugging-port=9222
      - --hide-scrollbars
      - --enable-features=ConversionMeasurement,AttributionReportingCrossAppWeb
  meilisearch:
    image: getmeili/meilisearch:v1.6
    restart: unless-stopped
    environment:
      MEILI_NO_ANALYTICS: true
    ports:
      - 7700:7700
    env_file:
      - /mnt/Dockerspace/hoarder/.env
    volumes:
      - /mnt/Dockerspace/hoarder/meilisearch:/meili_data
  workers:
    image: ghcr.io/mohamedbassem/hoarder-workers:latest
    restart: unless-stopped
    env_file:
      - /mnt/Dockerspace/hoarder/.env
    volumes:
      - /mnt/Dockerspace/hoarder/web:/data
    environment:
      REDIS_HOST: rediss
      MEILI_ADDR: http://192.168.0.208:7700
      BROWSER_WEB_URL: http://192.168.0.208:9222
      DATA_DIR: /data
      INFERENCE_TEXT_MODEL: gemma-2-9b-it-IQ3_XXS.gguf:latest
      INFERENCE_IMAGE_MODEL: moondream:1.8b-v2-q6_K
    depends_on:
      web:
        condition: service_started
@MohamedBassem MohamedBassem added the bug Something isn't working label Jul 5, 2024
@MohamedBassem
Copy link
Collaborator

Yeah, the culprit here is CRAWLER_FULL_PAGE_ARCHIVE=true, I intentionally made it the last step in the crawler worker such that if it fails, it doesn't impact other responsibilities of crawler.
Will need to dig deeper to understand why the rename doesn't work. For now, disabling full page archives should solve your issue.

@MohamedBassem
Copy link
Collaborator

I managed to repro the problem and I think I have a fix. Perfect timing before the release, thanks for the report!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants