[TPU] How to write in-place custom ops compatible with torch.compile using pallas? #25173

soodoshll · 2024-11-28T16:16:13Z

soodoshll
Nov 28, 2024

Hi all. We are trying to implement an in-place operator in torch_xla using pallas, and make it work with torch.compile. However, we found that XLA inserts extra copy instructions, hindering input-output aliasing.

In this example, the custom op is a simple elementwise +1 operation. The graph sent to XLA compiler correctly captures the buffer donation information:

HloModule SyncTensorsGraph.4, buffer_donor={ (0, {}) }, entry_computation_layout={(s32[8192,2048]{1,0:T(8,128)})->(s32[8192,2048]{1,0:T(8,128)})}

ENTRY SyncTensorsGraph.4 {
  p0.1 = s32[8192,2048]{1,0} parameter(0)
  custom-call.2 = s32[8192,2048]{1,0} custom-call(p0.1), custom_call_target="tpu_custom_call", operand_layout_constraints={s32[8192,2048]{1,0}}, ...
  ROOT tuple.3 = (s32[8192,2048]{1,0}) tuple(custom-call.2)
}

Everything goes well before the pass copy-insertion.after_adding_copies_to_resolve_interference, where the copy operation is inserted:

HloModule SyncTensorsGraph.4, input_output_alias={ {0}: (0, {}, may-alias) }, entry_computation_layout={(s32[8192,2048]{1,0:T(8,128)})->(s32[8192,2048]{1,0:T(8,128)})}

ENTRY SyncTensorsGraph.4 {
  p0.1 = s32[8192,2048]{1,0:T(8,128)} parameter(0)
  copy = s32[8192,2048]{1,0:T(8,128)} copy(p0.1)
  custom-call.2 = s32[8192,2048]{1,0:T(8,128)} custom-call(copy), custom_call_target="tpu_custom_call",...
  copy.1 = s32[8192,2048]{1,0:T(8,128)} copy(custom-call.2), control-predecessors={copy}
  ROOT tuple = (s32[8192,2048]{1,0:T(8,128)}) tuple(copy.1)
}

And in the previous pass, it does not contain the copy (and seems aliasing analysis is correct)

HloModule SyncTensorsGraph.4, input_output_alias={ {0}: (0, {}, may-alias) }, entry_computation_layout={(s32[8192,2048]{1,0:T(8,128)})->(s32[8192,2048]{1,0:T(8,128)})}

ENTRY SyncTensorsGraph.4 {
  p0.1 = s32[8192,2048]{1,0:T(8,128)} parameter(0)
  custom-call.2 = s32[8192,2048]{1,0:T(8,128)} custom-call(p0.1), custom_call_target="tpu_custom_call", operand_layout_constraints={s32[8192,2048]{1,0}}, backend_config={"custom_call_config": {"body":...
  ROOT tuple.3 = (s32[8192,2048]{1,0:T(8,128)}) tuple(custom-call.2)
}

I am wondering how I can remove this copy instruction. Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TPU] How to write in-place custom ops compatible with torch.compile using pallas? #25173

{{title}}

Replies: 0 comments

Select a reply

[TPU] How to write in-place custom ops compatible with torch.compile using pallas? #25173

soodoshll Nov 28, 2024

Replies: 0 comments

soodoshll
Nov 28, 2024