Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Realm: better error message when failed to create an accessor #1772

Open
eddy16112 opened this issue Oct 10, 2024 · 8 comments
Open

Realm: better error message when failed to create an accessor #1772

eddy16112 opened this issue Oct 10, 2024 · 8 comments
Labels
Realm Issues pertaining to Realm

Comments

@eddy16112
Copy link
Contributor

When creating an accessor for a region instance whose metadata is not valid, Realm will throw an assertion

const Realm::InstanceLayoutGeneric* Realm::RegionInstance::get_layout() const: Assertion `r_impl->metadata.is_valid() && "instance metadata must be valid before accesses are performed"' failed

However, the assertion does not provide useful information for debugging. There are several cases that could trigger the assertion:

  1. the instance has been already destroyed or redistricted. I am not sure if we want to / can distinguish them.
  2. the allocation of the instance is not ready.
  3. the client forget to call fetch_metadata for remote instances.

Thus, it will be helpful if we can provide a detailed error message/code to tell which case is causing the failure.

@eddy16112 eddy16112 added the Realm Issues pertaining to Realm label Oct 10, 2024
@lightsighter
Copy link
Contributor

I suspect this is going to be one of those things that is really hard to give a good error message. On the owner node at least you can mark a RegionInstanceImpl as being "alive" or not to tell the difference. If you're on a remote node though you probably don't have any way of telling if it is "alive" or not or just hasn't been paged in correctly yet.

@eddy16112
Copy link
Contributor Author

I just realized that we do not use accessor for remote instance any more. The only memory support remote access was global memory from GASNet1, which is almost deprecated. Maybe the IPC memory?

@lightsighter
Copy link
Contributor

I think I can still make a GenericAccessor on a remote node for an instance in the REGDMA_MEM and then the generic accessor will turn the read and write calls into gets and puts using RDMA.

@eddy16112
Copy link
Contributor Author

REGDMA_MEM is just a LocalCPUMemory, and its put_bytes/get_bytes is memcpy, so I do not think it support remote access using accessor. Correct me if I am wrong.

If the REGDMA_MEM supports GenericAccessor. There are some active messages that are used to update the state of metadata of remote instances., so at least we can try to protect the constructor to make sure the instance is alive. Protecting the future read/write could be difficult, we might be able to provide a function for people to query the remote state after they are done with read/write, but I am not sure if it worths a roundtrip message.

@lightsighter
Copy link
Contributor

REGDMA_MEM is just a LocalCPUMemory, and its put_bytes/get_bytes is memcpy, so I do not think it support remote access using accessor.

Huh, that is surprising to me. Maybe that is a regression but I thought we could do puts/gets to instances in REGDMA_MEM from remote nodes. If we can't it seems like a feature we should have at some point (not now).

Protecting the future read/write could be difficult, we might be able to provide a function for people to query the remote state after they are done with read/write, but I am not sure if it worths a roundtrip message.

I would make sure that whatever checking we do incurs near-zero overhead (some branches are fine since they can be predicted away).

@eddy16112
Copy link
Contributor Author

@lightsighter I just tried GenericAccessor on REGDMA_MEM, and it did not work. Because the RemoteMemory such as GASNetEXRemoteMemory or UCPRemoteMemory has no implementation of put_bytes/get_bytes https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/realm/gasnetex/gasnetex_module.cc#L82.

I will not put any effort on this issue for now, but I plan to leave it open just in case we need it in the future.

@lightsighter
Copy link
Contributor

Because the RemoteMemory such as GASNetEXRemoteMemory or UCPRemoteMemory has no implementation of put_bytes/get_bytes https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/realm/gasnetex/gasnetex_module.cc#L82.

Ah, that's ok. The plumbing is there and my interpretation of what should be allowed is reasonable, we just haven't implemented it in the networking backends.

I will not put any effort on this issue for now, but I plan to leave it open just in case we need it in the future.

Sounds good. The code paths are there, just unimplemented until we need them which is fine.

@apryakhin
Copy link
Contributor

apryakhin commented Oct 14, 2024

Huh, that is surprising to me. Maybe that is a regression but I thought we could do puts/gets to instances in REGDMA_MEM from remote nodes. If we can't it seems like a feature we should have at some point (not now)

I was surprised that we don't have it too. I came across that when writing several integration tests and wanting to validate the remote data written. That's not a strong enough justification only for the feature to be pushed but would certainly remove some of the boiler-plate code we have in our integration tests on none perf-critical paths.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Realm Issues pertaining to Realm
Projects
None yet
Development

No branches or pull requests

3 participants