Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Opening the shared memory failed, os error 24 #54

Open
meua opened this issue Apr 24, 2023 · 11 comments
Open

Opening the shared memory failed, os error 24 #54

meua opened this issue Apr 24, 2023 · 11 comments

Comments

@meua
Copy link
Contributor

meua commented Apr 24, 2023

Describe the bug

frame:  (1080, 1920, 4)
img:  (1080, 1920, 3)
output:  [[ 3.6737819  3.6716187  3.6644292 ... 12.203822  12.108404  12.077164 ]
 [ 3.6688473  3.6667528  3.6598594 ... 12.227848  12.146185  12.11989  ]
 [ 3.6578336  3.655894   3.6496665 ... 12.287019  12.240098  12.226307 ]
 ...
 [53.57742   53.672173  53.898468  ... 83.08049   83.24545   83.30725  ]
 [53.411537  53.528187  53.81199   ... 83.325745  83.47475   83.5309   ]
 [53.344387  53.46946   53.775078  ... 83.41604   83.56092   83.615654 ]]
Traceback (most recent call last):
  File "<string>", line 1, in <module>
RuntimeError: Dora Runtime raised an error.

Caused by:
   0: main task failed
   1: received error event: failed to map shared memory input

      Caused by:
          Opening the shared memory failed, os error 24

      Location:
          apis/rust/node/src/event.rs:64:14

Location:
    binaries/runtime/src/lib.rs:316:34
(dora3.7) jarvis@jia:~/coding/dora_home/dora-drives$ Traceback (most recent call last):
  File "<string>", line 1, in <module>
RuntimeError: Dora Runtime raised an error.

Caused by:
   0: main task failed
   1: failed to send node output
   2: failed to allocate shared memory
   3: Creating the shared memory failed, os error 24

Location:
    apis/rust/node/src/node.rs:169:22

To Reproduce
Steps to reproduce the behavior:

  1. Dora start daemon: dora up
  2. Start a new dataflow: dora start graphs/tutorials/webcam_single_dpt_frame.yaml --attach --hot-reload --name webcam-midas

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots or Video
image

Environments (please complete the following information):

  • System info: Linux jia 5.15.0-69-generic #76~20.04.1-Ubuntu SMP Mon Mar 20 15:54:19 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
  • Dora version: 0.2.2
@haixuanTao
Copy link
Collaborator

Would be great if you could share the code as well. Thanks :)

Specifically: graphs/tutorials/webcam_single_dpt_frame.yaml

@meua
Copy link
Contributor Author

meua commented Apr 26, 2023

Would be great if you could share the code as well. Thanks :)

Specifically: graphs/tutorials/webcam_single_dpt_frame.yaml

Ok, I submitted the related PR

@phil-opp
Copy link
Collaborator

Thanks for reporting!

Ok, I submitted the related PR

You're talking about #55, right?

Regarding the error:

Did you see any warnings in the logs? There are some situations where we will unmap shared memory regions after some timeout if the receiver did not react as expected. If this happened, you should see a warning in the log output. (@haixuanTao Do we have the tracing to stdout enabled for Python by default? )

Given that the shared memory allocation failed too, it is more likely that the issue is the number of open files. There is typically a limit on the number of open file handles, which you can query using ulimit -n. We're currently allocating each message as a separate shared memory region (which requires a file handle), so it's easy to exhaust this limit if you have many messages in transit. To work around this, you can temporarily double the file limit by running ulimit -n 2048, larger values are possible too.

To fix this properly, we should reduce the number of allocated shared memory regions and reuse the same region for mulitple messages. I opened dora-rs/dora#268 for that.

@haixuanTao
Copy link
Collaborator

@phil-opp , so trace goes to stdout with export RUST_LOG=trace, the only case they don't is if we also activate DORA_JAEGER_TRACING

@phil-opp
Copy link
Collaborator

Ok good. And the default log level is warn, right? Then it sounds like the file handle number is the issue.

@haixuanTao
Copy link
Collaborator

If the environment variable is empty or not set, or if it contains only invalid directives, a default directive enabling the ERROR level is added.

The default is the same as Tokio tracing default which is error. We can change it to warn.

@meua
Copy link
Contributor Author

meua commented Apr 26, 2023

The original reason for triggering the #54 problem is that the bytes data (numpy array) sent by send_output is relatively large. Now I have replaced the sent content according to haixuanTao's opinion, So the code for this problem does not appear now. To reproduce this problem, the code in dora-drives/operators/single_dpt_op.py needs to be modified as follows:

                prediction = torch.nn.functional.interpolate(
                    prediction.unsqueeze(1),
                    size=img.shape[:2],
                    mode="bicubic",
                    align_corners=False,
                ).squeeze()
                
                depth_output = prediction.cpu().numpy()
                print("depth_output: ", depth_output)
                send_output("depth_frame", depth_output.tobytes(), dora_input["metadata"])

The content of depth_output is relatively large, which is more likely to trigger this problem.

@phil-opp
Copy link
Collaborator

The default is the same as Tokio tracing default which is error. We can change it to warn.

This would be a good idea in my opinion. We're using warnings in dora to log abnormal events that are not critical yet, but should still be observed by users.

@phil-opp
Copy link
Collaborator

@meua Thanks a lot for the info!

@phil-opp
Copy link
Collaborator

What's the status of this? Can we still reproduce the "failed to map shared memory input" error with the latest version?

@meua
Copy link
Contributor Author

meua commented Jun 27, 2023

What's the status of this? Can we still reproduce the "failed to map shared memory input" error with the latest version?

I don't have time to test it now, I will verify it later when I have a chance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants