Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review zipping tool #2215

Closed
tcompa opened this issue Jan 27, 2025 · 9 comments
Closed

Review zipping tool #2215

tcompa opened this issue Jan 27, 2025 · 9 comments
Assignees
Labels

Comments

@tcompa
Copy link
Collaborator

tcompa commented Jan 27, 2025

Context:

  1. In a specific instance, we recently observed a clear correlation of
    • The zipping of a somewhat large job folder (22k files, 180M unzipped)
    • A worker timeout (which means the worker did not update its corresponding tmp file)
  2. In the past, we observed some (non-reproducible) error upon zipping a large job folder - ref Failure in zipping large folder (9k images, 900M) #1955.
  3. Our zipping tool is currently based on Python zipfile
  4. When working on the SSH runner (which also involves creating compressed folders to be sent back and forth), we moved from Python tarfile to a subprocess.run wrap of native tools (that is, tar).
    • We observed that the Python tarfile library was quite heavier than native tools (that is, tar), and attributed at least part of this difference to the large number of additional os.stat calls. Note that this difference became more evident due to our environment (a CEPH filesystem with a cold cache).
    • For the SSH runner, we then moved from tarfile to tar, see e.g. 1596 review use of tartarfile in compress folderpy module #1641.

We don't have a controlled way of reproducing the issue in point 1, but all the items above suggest that we should also move the current zipfile-based zipping operations to subprocess.run wraps of gzip. In the worst case, we are just improving the performance of one operation. In the best case, we are fixing the issue at point 1.

@tcompa
Copy link
Collaborator Author

tcompa commented Jan 27, 2025

Minor update in terms of finding the right explanation:
The zipping operation is likely blocking. If this is the case, that worker cannot perform other tasks (e.g. sending an hearth beat) for the time it takes to complete the zip operation.
(ref #1507)

Mitigation 1 is clearly "make it faster".
Mitigation 2 (if needed) is to also make it async.

@jluethi
Copy link
Collaborator

jluethi commented Jan 27, 2025

The zipping operation is likely blocking

Oh, that's relevant for sure. Yes, making this an non-blocking operation in an async approach would be very valuable to ensure we don't run into worker timeout issues for unexpected scaling (larger data provenance for some reason, a job with much larger than usual logs, more images being processed etc.)

@tcompa
Copy link
Collaborator Author

tcompa commented Jan 27, 2025

In the current concurrency model (see #2218), we need to be careful about making the zipping function async and non-blocking.

Here are three examples of how to run a bash command:

  1. subprocess.run running from threads --> non-blocking
  2. async function that calls subprocess.run running in event loop --> blocking
  3. asyncio.create_subprocess_exec running in event loop --> non-blocking

We are now close to situation 2 (we don't run subprocess.run, but other blocking functions), and then the whole zipping operation is blocking.
We will move to 3 right away.
Later, when fixing #2218, we will even simplify things by moving to 1.

from concurrent.futures import ThreadPoolExecutor
import subprocess
import shlex
import time
import asyncio


def function(sleeptime):
    subprocess.run(
        shlex.split(f"sleep {sleeptime}"),
        capture_output=True,
        check=True,
    )


async def function_async_subprocess(sleeptime):
    subprocess.run(
        shlex.split(f"sleep {sleeptime}"),
        capture_output=True,
        check=True,
    )


async def function_async_create_subprocess_exec(sleeptime):
    proc = await asyncio.create_subprocess_exec(*shlex.split(f"sleep {sleeptime}"))
    stdout, stderr = await proc.communicate()
    print()


async def run_it_all_subprocess(sleeptime):
    task1 = asyncio.create_task(function_async_subprocess(sleeptime))
    task2 = asyncio.create_task(function_async_subprocess(sleeptime))
    await task1
    await task2


async def run_it_all_create_subprocess_exec(sleeptime):
    task1 = asyncio.create_task(function_async_create_subprocess_exec(sleeptime))
    task2 = asyncio.create_task(function_async_create_subprocess_exec(sleeptime))
    await task1
    await task2


if __name__ == "__main__":
    print("BLOCK 1")
    t0 = time.perf_counter()
    sleeptime = 1
    with ThreadPoolExecutor(max_workers=10) as executor:
        fut1 = executor.submit(function, sleeptime)
        fut2 = executor.submit(function, sleeptime)

    t1 = time.perf_counter()
    print(f"{t1-t0}")
    assert abs(t1 - t0 - sleeptime) < 0.1
    print()

    print("BLOCK 2")
    t0 = time.perf_counter()
    asyncio.run(run_it_all_subprocess(sleeptime))
    t1 = time.perf_counter()
    print(f"{t1-t0}")
    print()

    print("BLOCK 3")
    t0 = time.perf_counter()
    asyncio.run(run_it_all_create_subprocess_exec(sleeptime))
    t1 = time.perf_counter()
    print(f"{t1-t0}")
    assert abs(t1 - t0 - sleeptime) < 0.1
    print()

@tcompa
Copy link
Collaborator Author

tcompa commented Jan 27, 2025

Question cc @jluethi

Right now we can download the zipped job folders in two ways:

  1. After the job is complete, we would download the zipped version of the final job folder (the one that is also kept on disk).
  2. During job execution, we could only download a partial zipped version of the job folder (e.g. it may include only some images, or only some tasks).

Use case 1 is obviously critical.
Is use case 2 also relevant?

  • If not, we could simplify a bit the implementation by removing an additional branch that has to create a temporary version of the zipped folder.
  • If yes, then the best option is likely to write the partial zip file to a tmp folder before streaming it through the API.

@jluethi
Copy link
Collaborator

jluethi commented Jan 27, 2025

I agree that use case 1 is more critical than use case 2, but use case 2 can be helpful sometimes. There are 2 things that make me hesitant on the conclusion here:
a. It's convenient to have that option to see logs of what is currently running to check whether things are running nicely in long-running jobs
b. We may want to work on how we expose these logs anyway as part of flexibility + provenance: It would be very convenient to have individual, web-based access to job logs, instead of just the zip download.

a) for me argues that maybe we want to refactor to the partial zip being put in the tmp folder etc. But b) argues that the need to download full zips in intermediate states would go away, so we shouldn't invest too much into it.
Mid term, I'm fine if we drop support for use case 2, if we expose individual logs in the web interface in an ok manner.

@tcompa tcompa changed the title Move zipping of job folders from Python zipfile to gzip Review zipping tool Jan 28, 2025
@tcompa
Copy link
Collaborator Author

tcompa commented Jan 28, 2025

Status update (with @mfranzon, as always..):

  1. We checked that switching to a different implementation (from Python zipfile to the zip command) does not immediately lead to a significant speedup - ref Compare zipping performance of zipfile and zip #2221. This can be explored further, but it's not the top priority now.
  2. We confirmed that the current concurrency scheme was not treating the zipping feature properly, resulting in a blocking behavior for the given worker. This provides a reasonable explanation both for a recent event and for a past one, in terms of worker timeouts.
  3. Rather than making the zipping feature itself async, we are cleaning up a long-standing issue with our concurrency scheme for job execution (ref Review concurrency model for job execution #2218). The upcoming version makes it clear that the job-execution functions (starting from their top-level submit_workflow function) all run in a specific thread (or in other threads, if they are opened at lower levels) and they never re-use the same event-loop that handles API requests. In our understanding, this will prevent the kind of worker timeout we observed recently.
  4. We did not modify or remove the "download-partial-zip-folder" feature, which is then identical to the current version. Warning: within this feature, the zipping must take place on-the-fly when the download-logs API request is handled. This means that for large folders the endpoint may not be responsive. Note that this is the same that we currently have, and not due to any of the ongoing updates. This may become less relevant if we expose granular-log downloads, in a future version.

@jluethi
Copy link
Collaborator

jluethi commented Jan 28, 2025

Thanks for the summary!

Rather than making the zipping feature itself async, we are cleaning up a long-standing issue with our concurrency scheme for job execution

So for my understanding: We wouldn't expect any worker timeouts due to zipping anymore with this server version? But we could still trigger them with the download partial zip folder (not a very common action) for the time being?

@tcompa
Copy link
Collaborator Author

tcompa commented Jan 28, 2025

We wouldn't expect any worker timeouts due to zipping anymore with this server version?

Agreed.

But we could still trigger them with the download partial zip folder (not a very common action) for the time being?

That's true (and it did not change with this version). As you note, this only applies to downloading logs for an ongoing job (where the zip has to be created on-the-fly).

Ref #2223

@tcompa
Copy link
Collaborator Author

tcompa commented Jan 28, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Development

When branches are created from issues, their pull requests are automatically linked.

3 participants