Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] - unit-test conda-store-server (ubuntu) hangs on CI #666

Closed
nkaretnikov opened this issue Nov 18, 2023 · 18 comments
Closed

[BUG] - unit-test conda-store-server (ubuntu) hangs on CI #666

nkaretnikov opened this issue Nov 18, 2023 · 18 comments
Labels
area: container 📦 needs: discussion 💬 This item needs team-level discussion before scoping type: bug 🐛 Something isn't working

Comments

@nkaretnikov
Copy link
Contributor

Describe the bug

Open a no-op PR against current main a5680f8, like in #665.
unit-test conda-store-server (ubuntu) hangs.

Expected behavior

Test passes, no hang.

How to Reproduce the problem?

.

Output

No response

Versions and dependencies used.

No response

Anything else?

No response

@nkaretnikov nkaretnikov added the type: bug 🐛 Something isn't working label Nov 18, 2023
@nkaretnikov nkaretnikov self-assigned this Nov 18, 2023
@nkaretnikov
Copy link
Contributor Author

nkaretnikov commented Nov 18, 2023

Currently looking into this.

  • Test that hangs is test_generate_conda_docker
  • It uses action_generate_conda_docker, which relies on conda-docker
  • In conda-docker, the problematic function is chroot_install
  • I think something is different in the environment now, which changed its behavior
  • I modified the code like this to see the output:
    subprocess.check_output(
        [
            "fakechroot",
            "chroot",
            new_root,
            "/_conda.exe",
            "install",
            "--offline",
            "--file",
            "/opt/conda/pkgs/env.txt",
            "-y",
            "--prefix",
            "/opt/conda",
        ],
        env=env,
        cwd=host_conda_opt,
        # stdout=subprocess.DEVNULL,
        # stderr=subprocess.DEVNULL,
    )
  • On my machine where the test passes, it prints:
conda_package_streaming/package_streaming.py:25: UserWarning: zstandard could not be imported. Running without .conda support.
conda_package_handling/api.py:29: UserWarning: Install zstandard Python bindings for .conda support
Error while loading conda entry point: conda-libmamba-solver (libmamba.so.2: cannot open shared object file: No such file or directory)
  • On CI, it prints a bunch of warnings instead:
[6472] WARNING: file already exists but should not: /_brotli.cpython-39-x86_64-linux-gnu.so
[6472] WARNING: file already exists but should not: /_ruamel_yaml.cpython-39-x86_64-linux-gnu.so
[6472] WARNING: file already exists but should not: /lib-dynload/_asyncio.cpython-39-x86_64-linux-gnu.so
[6472] WARNING: file already exists but should not: /lib-dynload/_bisect.cpython-39-x86_64-linux-gnu.so
[6472] WARNING: file already exists but should not: /lib-dynload/_blake2.cpython-39-x86_64-linux-gnu.so
[6472] WARNING: file already exists but should not: /lib-dynload/_bz2.cpython-39-x86_64-linux-gnu.so
[6472] WARNING: file already exists but should not: /lib-dynload/_codecs_cn.cpython-39-x86_64-linux-gnu.so
[6472] WARNING: file already exists but should not: /lib-dynload/_codecs_hk.cpython-39-x86_64-linux-gnu.so
[6472] WARNING: file already exists but should not: /lib-dynload/_codecs_iso2022.cpython-39-x86_64-linux-gnu.so
  • I also modified action in base.py and removed stdout/stdin redirection
  • On CI, I also inserted a step to be able to SSH into a builder:
    steps:
      # https://github.com/marketplace/actions/debugging-with-tmate
      - uses: actions/checkout@v3
      - name: Setup tmate session
        uses: mxschmitt/action-tmate@v3
        with:
          limit-access-to-actor: true
          timeout-minutes: 10
          detached: true
  • On CI, the repro is:
/usr/share/miniconda3/envs/test/bin/conda run --no-capture-output -p /usr/share/miniconda3/envs/test  /bin/sh -c 'pytest tests -vvv -k test_generate_conda_docker --capture=no'
    # FIXME: this should reall be check_output(), but chroot or fakechroot is
    # giving some weird segfault after the install command completes ¯\_(ツ)_/¯

@nkaretnikov
Copy link
Contributor Author

Reported upstream: conda-incubator/conda-docker#23

@nkaretnikov
Copy link
Contributor Author

When #653 was merged 3 days ago, all tests passed (not skipped). Will diff the test logs to see if anything is different now.

@nkaretnikov
Copy link
Contributor Author

nkaretnikov commented Nov 20, 2023

These messages are printed by PyInstaller:

WARNING: file already exists but should not:

See https://github.com/pyinstaller/pyinstaller/blob/554689c49e773d309a4fc7af61ad135464524c62/bootloader/src/pyi_utils.c#L639

You can set this env var to turn warnings into errors, however, conda-docker won't fail as is because no exit-code checking is done

PYINSTALLER_STRICT_UNPACK_MODE=1

See https://github.com/pyinstaller/pyinstaller/blob/554689c49e773d309a4fc7af61ad135464524c62/bootloader/src/pyi_utils.c#L552

@nkaretnikov
Copy link
Contributor Author

nkaretnikov commented Nov 20, 2023

The above warning repros on a CI builder after re-running the problematic test (to populate the test dir) and doing:

fakechroot chroot /tmp/test /_conda.exe --help

I changed tmpdir to /tmp/test in build_docker_environment_image in conda-docker.

The full command (not the help one) works fine without fakechroot.

Looking at strace, this keeps executing new processes, which is why it hangs:

strace -f --trace=vfork,execve fakechroot chroot /tmp/test /_conda.exe --help 2>&1 | grep -v WARN
[pid 43781] execve("/tmp/test/_conda.exe", ["/_conda.exe", "--help"], 0x56212379b090 /* 138 vars */) = 0
strace: Process 43785 attached
[pid 43785] execve("/tmp/test/_conda.exe", ["/_conda.exe", "--help"], 0x562482135e10 /* 138 vars */) = 0
strace: Process 43789 attached
[pid 43789] execve("/tmp/test/_conda.exe", ["/_conda.exe", "--help"], 0x55b5460f3e10 /* 138 vars */) = 0
strace: Process 43792 attached
...

You can also see this in the process tree.

@nkaretnikov
Copy link
Contributor Author

I can reproduce locally outside of CI:

fakechroot chroot /home/nkaretnikov/.conda/envs/conda-store-dev-old/standalone_conda /conda.exe --version
...
[2174441] WARNING: file already exists but should not: /libmambapy-1.4.1.dist-info/direct_url.json
[2174441] WARNING: file already exists but should not: /libmambapy-1.4.1.dist-info/top_level.txt
conda_package_streaming/package_streaming.py:25: UserWarning: zstandard could not be imported. Running without .conda support.
conda_package_handling/api.py:29: UserWarning: Install zstandard Python bindings for .conda support
Error while loading conda entry point: conda-libmamba-solver (libmamba.so.2: cannot open shared object file: No such file or directory)
conda 23.7.2

So, as mentioned before, locally this fails due to libmamba.so.2 not being found, which means the process doesn't loop like on CI.

Still trying to figure out what causes the loop. I suspect it might have something to do with LD environment variables. fakechroot relies on LD_PRELOAD and LD_LIBRARY_PATH is also set (I copied env and grep to the chroot dir to make them accessible):

fakechroot -s chroot /tmp/test/ /sh -c '/env | /grep LD'
FAKECHROOT_LDLIBPATH=/usr/lib:/lib
LD_LIBRARY_PATH=/tmp/test/usr/lib:/tmp/test/lib:/usr/lib:/lib:/tmp/test/usr/lib:/tmp/test/lib
LD_PRELOAD=libfakechroot.so

PyInstaller, which is used by standalone conda, also uses LD_LIBRARY_PATH. People were having issues related to that in that past. Again, it could be that fakechroot just doesn't work with PyInstaller and we didn't notice because libmamba import was failing.

Could be that the versions matter:

  • local (older): conda 23.7.2
  • CI: conda 23.10.0

I'll try with the same version locally and report back.

@nkaretnikov

This comment was marked as outdated.

@nkaretnikov
Copy link
Contributor Author

I think this might be the cause:

Unlikely; with multiprocessing, the problem is that frameworks tend assume that sys.executable contains python interpreter, so they spawn sub-processes using it in order to run some python command or script. In frozen application, sys.executable points to your frozen executable, so you are spawning an additional copy of your program. Our hooks for multiprocessing monkey-patch multiprocessing.freeze() with custom function that detects such spawned subprocesses (based on sys.argv) and divert the program flow accordingly. That's why multiprocessing works (as long as you call freeze_support), but other frameworks (or custom subbprocess.Popen((sys.executable, ...), ...) do not (unless you explicitly handle the subprocess codepaths in your entry-point script).

pyinstaller/pyinstaller#7165 (comment)

@nkaretnikov
Copy link
Contributor Author

nkaretnikov commented Nov 20, 2023

Yes, it's a PyInstaller/fakechroot-specific problem.

  • I created a Python program that does: print("hello world")
  • Used PyInstaller as pyinstaller --onefile main.py
  • Then fakechroot chroot ~/tmp/piapp /dist/main and it loops with warnings as well.

@nkaretnikov
Copy link
Contributor Author

I've diffed the logs that passed before against the recent ones that fail. I think it might be due to a change in the conda-standalone package that's installed by GitHub Actions:

  + conda-standalone                    [-23.9.0-]{+23.10.0+}  ha770c72_0          conda-forge/linux-64      [-34MB-]{+32MB+}

Again, the cause is the PyInstaller/fakechroot thing. But the above change is what's likely highlighted the problem.

I'm going to try reproducing with an old version of conda-standalone.

@nkaretnikov
Copy link
Contributor Author

On CI, I tried adding the following to conda-store-server/environment-dev.yaml:

- conda-standalone ==23.9.0

As expected, this makes the tests "pass", but only because:

  1. there's an error when executing conda-docker, which makes the subprocess exit
  2. no exit-code checking is done in conda-docker, so an early exit due to an error is indistinguishable from a success
  3. test_generate_conda_docker currently doesn't validate conda-docker's output in any way.

Here's an actual test result from CI after enabling error checking and enabling stdout/stderr printing:

tests/test_actions.py::test_generate_conda_docker[conda_prefix0] INFO  [alembic.runtime.migration] Context impl SQLiteImpl.
INFO  [alembic.runtime.migration] Will assume non-transactional DDL.
INFO  [alembic.runtime.migration] Running upgrade  -> 48be4072fe58, initial schema
INFO  [alembic.runtime.migration] Running upgrade 48be4072fe58 -> 8d63a091aff8, Add Environment.description
INFO  [alembic.runtime.migration] Running upgrade 8d63a091aff8 -> 5ad723de2abd, Adding CONTAINER_REGISTRY value to enum
INFO  [alembic.runtime.migration] Running upgrade 5ad723de2abd -> 16f65805dc8f, split conda_package into conda_package and conda_package_build
INFO  [alembic.runtime.migration] Running upgrade 16f65805dc8f -> abd7248d5327, adding a settings table
INFO  [alembic.runtime.migration] Running upgrade abd7248d5327 -> b387747ca9b7, role mapping
INFO  [alembic.runtime.migration] Running upgrade b387747ca9b7 -> d78e9889566a, add status_info
WARNI [traitlets] registry https://registry-1.docker.io not configured using registry without authentication
[6138] Error loading Python lib '/_internal/libpython3.9.so.1.0': dlopen: /tmp/tmpp1ab61zw/_internal/libpython3.9.so.1.0: cannot open shared object file: No such file or directory
FAILED

Changes I had to make to get this:

diff --git a/conda-store-server/conda_store_server/action/base.py b/conda-store-server/conda_store_server/action/base.py
index e0c5e7a..41a2aae 100644
--- a/conda-store-server/conda_store_server/action/base.py
+++ b/conda-store-server/conda_store_server/action/base.py
@@ -16,10 +16,10 @@ def action(f: typing.Callable):
         action_context = ActionContext()
         with contextlib.ExitStack() as stack:
             # redirect stdout -> action_context.stdout
-            stack.enter_context(contextlib.redirect_stdout(action_context.stdout))
+            # stack.enter_context(contextlib.redirect_stdout(action_context.stdout))

             # redirect stderr -> action_context.stdout
-            stack.enter_context(contextlib.redirect_stderr(action_context.stdout))
+            # stack.enter_context(contextlib.redirect_stderr(action_context.stdout))

And changed the fakechroot command as follows in /usr/share/miniconda3/envs/test/lib/python3.10/site-packages/conda_docker/conda.py:

    subprocess.check_output(
        [
            "fakechroot",
            "chroot",
            new_root,
            "/_conda.exe",
            "install",
            "--offline",
            "--file",
            "/opt/conda/pkgs/env.txt",
            "-y",
            "--prefix",
            "/opt/conda",
        ],
        env=env,
        cwd=host_conda_opt,
        # stdout=subprocess.DEVNULL,
        # stderr=subprocess.DEVNULL,
    )

@nkaretnikov
Copy link
Contributor Author

nkaretnikov commented Nov 21, 2023

  • I'm going to skip the problematic test test_generate_conda_docker since it's not working anyway, just to unblock other work. I had to disable the docker action as well because that's what actually causing the problem (it's used by other code and can lead to a DoS due to a huge number of subprocesses being spawned)
  • I'll try to fix the issue in conda-docker and update the said test, but as a separate issue/PR. The said test would need to check that there are necessary packages in a docker container.

nkaretnikov pushed a commit to nkaretnikov/conda-store that referenced this issue Nov 21, 2023
nkaretnikov pushed a commit to nkaretnikov/conda-store that referenced this issue Nov 21, 2023
@nkaretnikov
Copy link
Contributor Author

Jaime says conda-standalone (_conda.exe above from conda-docker) is not production-ready. It shouldn't be used. Jaime suggests to use micromamba instead.

@jaimergp
Copy link
Member

xref conda-incubator/conda-docker#23

@nkaretnikov
Copy link
Contributor Author

Status update: #667 has landed, which disables conda-docker in conda-store, but a fix still needs to be implemented in conda-docker.

@nkaretnikov
Copy link
Contributor Author

Marked as blocked because I need to focus on other tasks for now.

@nkaretnikov nkaretnikov moved this from TODO 📬 to Blocked ⛔️ in conda-store 🐍 Dec 13, 2023
@nkaretnikov nkaretnikov moved this from Blocked ⛔️ to TODO 📬 in conda-store 🐍 Jan 23, 2024
@nkaretnikov nkaretnikov added this to the challenges: Round 2 🚀 milestone Jan 28, 2024
@nkaretnikov nkaretnikov removed their assignment Mar 18, 2024
@nkaretnikov nkaretnikov removed this from the challenges: Round 2 🚀 milestone Mar 18, 2024
@nkaretnikov nkaretnikov added this to the challenges: Round 2 🚀 milestone Mar 18, 2024
@nkaretnikov nkaretnikov added project: challenges needs: discussion 💬 This item needs team-level discussion before scoping labels Mar 18, 2024
@nkaretnikov
Copy link
Contributor Author

Tania said we need to discuss whether we will be supporting Docker artifact generation going forward.

@nkaretnikov nkaretnikov moved this from TODO 📬 to Follow up 📤 in conda-store 🐍 Mar 18, 2024
@trallard trallard removed this from the challenges: Round 2 🚀 milestone Apr 30, 2024
@trallard
Copy link
Collaborator

This is no longer an issue so will close this

@github-project-automation github-project-automation bot moved this from Follow up 📤 to Done 💪🏾 in conda-store 🐍 Oct 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: container 📦 needs: discussion 💬 This item needs team-level discussion before scoping type: bug 🐛 Something isn't working
Projects
Archived in project
Development

No branches or pull requests

3 participants