Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

device.queue.read_buffer times out #650

Closed
wpmed92 opened this issue Dec 2, 2024 · 8 comments · Fixed by #651
Closed

device.queue.read_buffer times out #650

wpmed92 opened this issue Dec 2, 2024 · 8 comments · Fixed by #651
Labels
bug Something isn't working

Comments

@wpmed92
Copy link
Contributor

wpmed92 commented Dec 2, 2024

Describe the bug

device.queue.read_buffer times out

To Reproduce

git clone https://github.com/tinygrad/tinygrad.git
cd tinygrad
python3 -m pip install -e .
PYTHONPATH=. WEBGPU=1 python3 examples/stable_diffusion.py --seed 0 --steps 1

Output:

File "/Users/ahmedharmouche/Documents/tinygrad/tinygrad/runtime/ops_webgpu.py", line 56, in _copyout
buffer_data = self.dev.queue.read_buffer(src, 0)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/ahmedharmouche/Documents/tinygrad/venv/lib/python3.12/site-packages/wgpu/backends/wgpu_native/_api.py", line 3427, in read_buffer
tmp_buffer._map("READ_NOSYNC").sync_wait()
File "/Users/ahmedharmouche/Documents/tinygrad/venv/lib/python3.12/site-packages/wgpu/backends/wgpu_native/_helpers.py", line 268, in sync_wait
return self.finish()
^^^^^^^^^^^^^
File "/Users/ahmedharmouche/Documents/tinygrad/venv/lib/python3.12/site-packages/wgpu/backends/wgpu_native/_helpers.py", line 260, in finish
raise RuntimeError(f"Waiting for {self.title} timed out.")
RuntimeError: Waiting for buffer.map timed out.
zsh: segmentation fault PYTHONPATH=. WEBGPU=1 python3 examples/stable_diffusion.py --seed 0 --steps 1

Observed behavior

When running tinygrad stable_diffusion.py, the buffer read times out when trying to get the output of the decode step. But it is not the buffer reading that takes that long, but to actually run the compute. Manually increasing the timeout from 5.0s here solves it, but in 0.18.1 this just worked (the timeout wasn't there?). Now, we have stable diffusion working on faster machines, but on my local computer, it times out, so I have to manually increase this timeout value, so for now we downgraded to 0.18.1.
Can this timeout be increased/disabled?

Your environment

OS: MacOS Sonoma 14.4.1
Python version: 3.12
wgpu-py version: >=0.19.0
wgpu backend: Metal

@almarklein
Copy link
Member

Hi Ahmed, thanks for this issue!

Ah, when I added the timeout, I thought it was as good idea in case for some reason wgpu-native did not fulfill the promise. I figured 5 s is plenty of time, but I had not considered it could simply be waiting for computations 🤦

I think it makes sense to remove the timeout altogether. Probably should get #631 merged first, because it touches the same code.

@almarklein
Copy link
Member

cc @fyellin

@wpmed92
Copy link
Contributor Author

wpmed92 commented Dec 3, 2024

Thank you for the response @almarklein, once the timeout is removed, we'll bump the version back to latest.

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Dec 9, 2024

I figured 5 s is plenty of time, but I had not considered it could simply be waiting for computations 🤦

This is something we we struggle with too.

I've never used async frameworks but do they "raise errors" or just "die silently"?

@wpmed92
Copy link
Contributor Author

wpmed92 commented Dec 9, 2024

@almarklein thank you for fixing it! Do you have an eta for this to ship?

@Korijn
Copy link
Collaborator

Korijn commented Dec 9, 2024

I've never used async frameworks but do they "raise errors" or just "die silently"?

Raise errors.

@almarklein
Copy link
Member

I've never used async frameworks but do they "raise errors" or just "die silently"?

Longer answer:

Co-routines by themselves can die silently. Something has to take responsibility for checking whether they failed and take appropriate action. From what I understand one of the main points of Trio is to force you to write code that does this. Asyncio detects when a task raises an exception without anyone checking, and will raise an exception if that happens with Task exception was never retrieved. In rendercanvas all tasks a re wrapped in an error-logger thingy.

@almarklein
Copy link
Member

Do you have an eta for this to ship?

Probably tomorrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants