Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: make exla build resilient to stale upgrades #1548

Merged
merged 9 commits into from
Oct 28, 2024

Conversation

polvalente
Copy link
Contributor

fixes #1546

@polvalente polvalente requested review from jonatanklosko and josevalim and removed request for jonatanklosko October 25, 2024 02:36
@jonatanklosko
Copy link
Member

Looks good to me!

exla/README.md Outdated

We recommend following the steps below:

1. If the error appeared after upgrading EXLA, ensure that you have the proper dependency versions given by [XLA](https://github.com/elixir-nx/xla). Afterwards, compile with `mix compile --clean-libexla-cache --force` to clean up all cached files.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mix compile is not going to work when compiled as a dependency. We would need to introduce a separate mix compiler or a separate task.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It won’t work in the sense that it won’t invoke the code you added in the mix.exs file. mix compile is only about the local project. Maybe an env var with mix deps.compile exla will work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@josevalim it may works as mix deps.clean exla --build --clean-libexla-cache?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

err mix deps.compile exla --force --clean-libexla-cache

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But yeah, env var sounds better either way.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if we pass options forward in all of these cases. Although we should probably make mix deps.clean for dependencies extensible, so we actually invoke clean in the relevant dep.

exla/mix.exs Outdated
defp cached_make(_) do
defp cached_make(args) do
{parsed, _args, _invalid} =
OptionParser.parse(args, strict: [clean_libexla_cache: :boolean, force: :boolean])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion to make this a single env var called EXLA_FORCE_REBUILD, as explained above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Env var is probably the simpler alternative :)

exla/mix.exs Outdated
@@ -150,14 +167,21 @@ defmodule EXLA.MixProject do
cached_so = Path.join([xla_cache_dir(), "exla", cache_key, "libexla.so"])
cached? = File.exists?(cached_so)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can mostly remove the changes to this function and instead simply write this:

Suggested change
cached? = File.exists?(cached_so)
cached? = System.get_env("XLA_FORCE_REBUILD") == nil and File.exists?(cached_so)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be enough to force a rebuild and write the new build to cache.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@josevalim I changed to using env vars, but kept 2 separate ones because there are many instances in which I wanted to leverage the .o caching while getting rid of the libexla.so file and cache, as well as instances where I wanted to clean-slate everything.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My problem with the above is that, if something is not working, I need to set two env vars to clear everything. So why not a single env var with different values?

  • XLA_FORCE_REBUILD=1 - remove all
  • XLA_FORCE_REBUILD=partial - keep .o

Also, is there any situation where you don't want to cache files to be rewritten? I don't see a reason to skip that as part of these options?

Finally note that we use XLA env vars for pretty much everything, since we have XLA_CACHE_DIR, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with having special values for the env var and having a single one.

I disagree with using XLA as the prefix, though, as this doesn't have anything to do with :xla. XLA_CACHE_DIR being used in EXLA is just an artifact of "let's use the same subdir for both XLA and EXLA". The origin of the variable is in :xla.

Also, is there any situation where you don't want to cache files to be rewritten? I don't see a reason to skip that as part of these options?

As for this part, I don't think we are skipping the rewrite.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored the cached? check so that there's less confusion about whether rewrites will happen.
cached? will now be set to false regardless of the file existing if force rebuild is set.

@@ -27,14 +27,13 @@ defmodule EXLA.DeviceMemorySharingTest do
end

@tag :cuda_required
test "ipc handles don't crash the runtime when :local mode is selected" do
assert {:error, ~c"Invalid pointer size for selected mode."} ==
test "invalid ipc handles don't crash the runtime" do
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test was breaking locally (I guess I hadn't run it on my desktop since the update 😛)

exla/mix.exs Outdated
Comment on lines 141 to 153
false

"0" ->
false

"partial" ->
:partial

"true" ->
true

"1" ->
true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
false
"0" ->
false
"partial" ->
:partial
"true" ->
true
"1" ->
true
:none
"0" ->
:none
"partial" ->
:partial
"true" ->
:full
"1" ->
:full

This should force all the code below to be explicit :)

Copy link
Collaborator

@josevalim josevalim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One final nitpick and ship it!

@polvalente polvalente merged commit 5eb444e into main Oct 28, 2024
8 checks passed
@polvalente polvalente deleted the pv-fix/exla-build-for-upgrading branch October 28, 2024 20:50
josevalim added a commit that referenced this pull request Nov 16, 2024
Co-authored-by: José Valim <[email protected]>
Co-authored-by: Jonatan Kłosko <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

"module EXLA.NIF is not available" What does this mean?
3 participants