Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error: could not write config file /root/.gitconfig: Device or resource busy - running clearml-agent in docker mode #193

Open
AH-Merii opened this issue Mar 7, 2024 · 3 comments

Comments

@AH-Merii
Copy link

AH-Merii commented Mar 7, 2024

Description

When executing tasks using the clearml-agent within a Docker container, we encounter a failure during operations that attempt to write to the .gitconfig file. Specifically, the command git config --global --replace-all safe.directory '*' fails with the error message could not write config file /root/.gitconfig: Device or resource busy. This issue persists even though manual tests for file access, read, and write operations to /root/.gitconfig succeed when performed within the container.

The failure to write to .gitconfig seems to occur only during the execution of automated tasks by clearml-agent, suggesting a possible issue with how file access or locking is managed in the context of Docker containers orchestrated by clearml-agent.

Steps to Reproduce

  1. Execute a clearml-agent task within a Docker container that requires Git operations.
  2. The task fails when attempting to globally configure Git to recognize all directories as safe, with the specific command being git config --global --replace-all safe.directory '*'.

Additional Context

  • We have enabled GIT_TRACE=1 for more detailed output on Git operations.
  • The issue appears to be related to the clearml-agent's interaction with the .gitconfig file within Docker containers, particularly concerning file locking or access permissions.
  • Deleting the vcs_cache directory allows the task to proceed successfully, suggesting the problem may be linked to the caching mechanism or file access within this cache.
  • This behavior raises concerns about potential issues with file locking, .gitconfig access, or interactions between Docker, the clearml-agent, and Git within the containerized environment.
  • The agent is running on an EC2 instance and we are using environment variables to configure the agent:
export CLEARML_AGENT_GIT_USER=<user_name>
export CLEARML_AGENT_GIT_PASS=<github_pat_our_pat_token>
export CLEARML_EXTRA_PIP_INSTALL_FLAGS="--extra-index-url=https://<aws_account_id>.d.codeartifact.eu-central-1.amazonaws.com/pypi/st-python-packages/simple/"
export CLEARML_API_HOST="https://api.clearml.<address>.com"
export CLEARML_WEB_HOST="https://app.clearml.<address>.com"
export CLEARML_FILES_HOST="https://files.clearml.<address>"
export CLEARML_API_ACCESS_KEY=<access_key>
export CLEARML_API_SECRET_KEY=<secret_key>
export CLEARML_DEFAULT_OUTPUT_URI="s3://our_bucket"
export CLEARML_DOCKER_IMAGE="<aws_account_id>.dkr.ecr.eu-central-1.amazonaws.com/python-secure:3.10-slim"

# Collect all environment variables starting with CLEARML and join them with a comma
CLEARML_ENV_VARS=$(env | grep ^CLEARML | cut -d '=' -f 1 | tr '\n' ',' | sed 's/,$//')
# Set the CLEARML_AGENT_DOCKER_ARGS_HIDE_ENV variable with the collected names
export CLEARML_AGENT_DOCKER_ARGS_HIDE_ENV=$CLEARML_ENV_VARS
export CLEARML_WORKER_NAME=""
export CLEARML_WORKER_ID=""
export CLEARML_AGENT_EXTRA_DOCKER_ARGS=""
  • We pass the pat token to the environment CLEARML_AGENT_GIT_PASS

Environment

  • clearml-agent version: 1.7.0
  • Docker image: python:3.10-slim
  • Host OS: Ubuntu 22.04

Error Logs

::: Using Cached environment /root/.clearml/venvs-cache/d99b7ac78c9f00157b7d88b26e395d7e :::
11:27:21.197479 git.c:460               trace: built-in: git config --global --replace-all safe.directory '*'
error: could not write config file /root/.gitconfig: Device or resource busy
Using cached repository in "/root/.clearml/vcs-cache/md-ap-feature-engineering.git.07c9b3f5f387de85ee33f17cae806c1f/md-ap-feature-engineering.git"
11:27:21.200445 git.c:460               trace: built-in: git fetch --all --recurse-submodules
11:27:21.200831 run-command.c:655       trace: run_command: GIT_DIR=.git git remote-https origin https://github.com/silencetherapeutics/md-ap-feature-engineering.git
11:27:21.201988 git.c:750               trace: exec: git-remote-https origin https://github.com/silencetherapeutics/md-ap-feature-engineering.git
11:27:21.202020 run-command.c:655       trace: run_command: git-remote-https origin https://github.com/silencetherapeutics/md-ap-feature-engineering.git
fatal: could not read Username for 'https://github.com': terminal prompts disabled
Repository cloning failed: Command '['git', 'fetch', '--all', '--recurse-submodules']' returned non-zero exit status 128.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/local/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.10/site-packages/clearml_agent/__main__.py", line 87, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/site-packages/clearml_agent/__main__.py", line 83, in main
    return run_command(parser, args, command_name)
  File "/usr/local/lib/python3.10/site-packages/clearml_agent/__main__.py", line 46, in run_command
    return func(**args_dict)
  File "/usr/local/lib/python3.10/site-packages/clearml_agent/commands/base.py", line 63, in newfunc
    return func(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/clearml_agent/commands/worker.py", line 2611, in execute
    directory, vcs, repo_info = self.get_repo_info(
  File "/usr/local/lib/python3.10/site-packages/clearml_agent/commands/worker.py", line 2883, in get_repo_info
    vcs, repo_info = self._get_repo_info(execution, task, venv_folder)
  File "/usr/local/lib/python3.10/site-packages/clearml_agent/commands/worker.py", line 2919, in _get_repo_info
    vcs, repo_info = clone_repository_cached(
  File "/usr/local/lib/python3.10/site-packages/clearml_agent/helper/repo.py", line 781, in clone_repository_cached
    vcs.pull()
  File "/usr/local/lib/python3.10/site-packages/clearml_agent/helper/repo.py", line 599, in pull
    self.call("fetch", "--all", "--recurse-submodules", cwd=self.location)
  File "/usr/local/lib/python3.10/site-packages/clearml_agent/helper/repo.py", line 659, in call
    return self._git_pass_auth_wrapper(super(Git, self).call, *argv, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/clearml_agent/helper/repo.py", line 612, in _git_pass_auth_wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/clearml_agent/helper/repo.py", line 435, in call
    return self._call_subprocess(subprocess.check_call, argv, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/clearml_agent/helper/repo.py", line 495, in _call_subprocess
    return command.call_subprocess(func, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/clearml_agent/helper/process.py", line 246, in call_subprocess
    return func(list(self), *args, **kwargs)
  File "/usr/local/lib/python3.10/subprocess.py", line 369, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['git', 'fetch', '--all', '--recurse-submodules']' returned non-zero exit status 128

Potential Areas for Investigation

  • Interactions between Docker volume mounts (especially for .gitconfig and vcs_cache) and the clearml-agent's file handling.
  • How the clearml-agent manages Git configurations and operations within Docker containers, particularly regarding global settings and cached environments.
@jkhenning
Copy link
Member

Hi @AH-Merii,

Are you running the container as non-root?

@AH-Merii
Copy link
Author

Hey @jkhenning,

No the user in the container is running as root.

@jkhenning
Copy link
Member

Hi @AH-Merii,

Try deleting ~/.gitconfig on the host machine and see if it works

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants