-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Raise when in place operations occur on leafs requiring grad #1458
base: main
Are you sure you want to change the base?
Changes from all commits
b56dd80
d79173c
e4ad972
551de30
a179d12
a26ed67
c279c14
590ef18
3fb1f53
cfd5143
ad518a5
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -477,15 +477,15 @@ def f(xs, ys, z): | |
) | ||
def test_inplace_to_tensors_with_grad(executor, device, _): | ||
@torch.no_grad | ||
beverlylytle marked this conversation as resolved.
Show resolved
Hide resolved
|
||
def add_y(x, y): | ||
x.add_(y, alpha=0.1) | ||
def add_grad(x, y): | ||
return x.add_(x.grad) | ||
|
||
@torch.no_grad | ||
def add_grad(x, y): | ||
x.add_(x.grad, alpha=0.1) | ||
def add_y(x, y): | ||
return x.add_(y, alpha=0.1) | ||
|
||
for f in (add_y, add_grad): | ||
jitted_f = executor.make_callable(f) | ||
for fn in (add_grad, add_y): | ||
jitted_f = executor.make_callable(fn) | ||
x = make_tensor((2, 2), device=device, dtype=torch.float32, requires_grad=True) | ||
x.grad = make_tensor((2, 2), device=device, dtype=torch.float32) | ||
y = make_tensor((2, 2), device=device, dtype=torch.float32) | ||
|
@@ -495,7 +495,7 @@ def add_grad(x, y): | |
y_ref = y.clone().detach() | ||
|
||
res = jitted_f(x, y) | ||
res_ref = f(x_ref, y_ref) | ||
res_ref = fn(x_ref, y_ref) | ||
|
||
torch.testing.assert_close(x, x_ref) | ||
torch.testing.assert_close(x.grad, x_ref.grad) | ||
|
@@ -549,7 +549,8 @@ def single_tensor_adam( | |
ref_state_steps = [torch.tensor(1, device=device) for _ in range(2)] | ||
single_tensor_adam(*ref_tensors, state_steps=ref_state_steps) | ||
|
||
jitted = executor.make_callable(single_tensor_adam) | ||
# torch.compile does not support accessing the ContextVariable compile data used in _copy__impl_ | ||
jitted = executor.make_callable(single_tensor_adam, torch_compile_fullgraph=False) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Interesting that import torch
from contextvars import ContextVar
_compile_data = ContextVar("compile_data", default=(None, None))
def fn(x):
_compile_data.get()
return x + 1
torch.compile(fn, fullgraph=False)(torch.randn(3, 3)) # Works with GraphBreak at _compile_data.get()
torch.compile(fn, fullgraph=True)(torch.randn(3, 3)) # Fails There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What does Thunder's Interpreter do? It probably fails There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. thunder just burns the value in computation trace (if used) without having a corresponding check in prologue. (Will file an issue for the same). Eg. import torch
import thunder
from contextvars import ContextVar
_compile_data = ContextVar("compile_data", default=1)
def fn(x):
v = _compile_data.get()
return x + v
jfn = thunder.jit(fn)
o = jfn(torch.ones(3,))
print(o) # tensor([2., 2., 2.])
_compile_data.set((2,))
o = jfn(torch.ones(3,))
print(o) # tensor([2., 2., 2.])
print(thunder.last_prologue_traces(jfn)[-1])
# @torch.no_grad()
# @no_autocast
# def prologue(*args, **kwargs):
# # args: "Any"
# check_len(args, 1)
# # prims.check_len(args, 1)
# # kwargs: "Any"
# check_len(kwargs, 0)
# # prims.check_len(kwargs, 0)
# x: "cpu f32[3]" = args[0]
# check_tensor_metadata(x, (3,), 'cpu', torch.float32, False)
# # prims.check_tensor_shape_and_metadata(x, (3,), 'cpu', torch.float32, False)
# cache_info: "Any" = thunder._get_cache_info()
# cache_info_default_dtype: "<class 'torch.dtype'>" = cache_info['default_dtype']
# check_literal_like(cache_info_default_dtype, torch.float32)
# # prims.check_literal_like(cache_info_default_dtype, torch.float32)
# cache_info_default_device: "<class 'torch.device'>" = cache_info['default_device']
# check_literal_like(cache_info_default_device, torch.device("cpu"))
# # prims.check_literal_like(cache_info_default_device, torch.device("cpu"))
# cache_info_is_autocast_enabled: "bool False" = cache_info['is_autocast_enabled']
# check_number_type_and_value(cache_info_is_autocast_enabled, False)
# # prims.check_number_type_and_value(cache_info_is_autocast_enabled, False)
# cache_info_no_grad_sync: "bool False" = cache_info['no_grad_sync']
# check_number_type_and_value(cache_info_no_grad_sync, False)
# # prims.check_number_type_and_value(cache_info_no_grad_sync, False)
# cache_info_alias_tensor_indices: "str" = cache_info['alias_tensor_indices']
# check_string_value(cache_info_alias_tensor_indices, '')
# # prims.check_string_value(cache_info_alias_tensor_indices, '')
# cache_info_is_grad_enabled: "bool True" = cache_info['is_grad_enabled']
# check_number_type_and_value(cache_info_is_grad_enabled, True)
# # prims.check_number_type_and_value(cache_info_is_grad_enabled, True)
# return ((x,), ())
print(thunder.last_traces(jfn)[-1])
# @torch.no_grad()
# @no_autocast
# def computation(x):
# # x: "cpu f32[3]"
# t0 = torch.add(x, 1, alpha=1) # t0: "cpu f32[3]"
# # t0 = ltorch.add(x, 1, alpha=1) # t0: "cpu f32[3]"
# # _ = prims.convert_element_type(1, float)
# # t0 = prims.add(x, 1.0) # t0: "cpu f32[3]"
# return t0 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Issue filed at #1464 |
||
params, grads, exp_avgs, exp_avg_sqs = tensors | ||
|
||
jitted(params, grads, exp_avgs, exp_avg_sqs, state_steps) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if Symbol
copy_
inthunder/torch/__init__.py
is more appropriate location for the check.lightning-thunder/thunder/torch/__init__.py
Lines 1961 to 1963 in 60f3ee1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a
andb
are proxies and it it not clear to me if a proxy knows that it is a leaf.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They do not. It's only a PyTorch concept that's available at runtime inside
_copy__impl
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, previously I missed that the fix was in
copy_impl
. And since, it is happening at runtime, I am wondering ifcompile_data
is actually available.Quick test shows (see below) that it wouldn't be. So, we probably need a way to check if this
copy
was called underno_grad
in users code (as PyTorch supports inplace of leaf tensors underno_grad
, see comment).Snippet to check if compile_data is available -
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, compile_data was not available, but now it should be with the added context manager in thunder/init.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is still incorrect because as discussed in #1486, the value of compile_data.is_grad_enabled here would be that of last updated state which can lead to incorrectness when used outside of tracing context.
We can see the discrepancy here.
So, whether the
copy
is in no_grad region needs to be captured during the tracing time.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, this is why I created the other issue. This PR fixes the leaf/grad issue when there is no annotation. When there is an annotation, another approach is required. This other approach may or may not involve using compile data in
_copy__impl
.As far as I understand, compile data is the medium for passing around data such as whether grad is enabled. But as the other issue points out, compile data reflects the end state of a function call and not the "live" state, at least at the time it reaches
_copy__impl
. So I'm left with the questions "are there other mechanisms for passing around whether grad is enabled?" "where else in the execution is it simultaneously knowable that a (1) leaf tensor (2) requiring grad is being (3) copied when (4) grad is enabled?" "is it feasible/desirable to make the compile data more dynamic?" "is there a way to context-manage the tensors so that theirrequires_grad
flags are set toFalse
when the interpreter seestorch._C._set_grad_enabled(False)
, and then later restored, thereby obviating the need for the compile data for this check?" Do you have suggestions for a fix that addresses both issues? Or can we close out this issue and move the discussion to the more involved issue?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So to tackle - leaf tensor requiring grad being copied into when grad is enabled, I think similar to a previous commit,
we can update
prims.copy
to take a argumentis_grad_enabled
. With this,ltorch.copy
will querycd.is_grad_enabled
and callprims.copy
by also passing this argument.lightning-thunder/thunder/torch/__init__.py
Lines 1984 to 1986 in 9de5434
With these changes, the
copy_impl
's signature will also change to acceptis_grad_enabled
and it will be called at runtime with a tensor which we can query if it is a leaf and also whether grad was enabled or not when calling that particular copy. Wdyt @beverlylytle?Though, I am curious if there is another approach to this - cc: @IvanYashchuk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's see what the CI thinks.