Best practices jit and vmap #5199

GJBoth · 2020-12-15T20:52:56Z

GJBoth
Dec 15, 2020

Hi jax team,

I'm working on switching our code to Jax (using Flax as NN library) and I'm amazed with jit and vmap. I'm wondering if there are best practices for when to apply these. For example:

I create a loss function and then use jax.value_and_grad(loss) Should I jit the loss, the value_and_grad, both ,or doesn't it matter?
If I use vmap to calculate the loss, should I first calculate the forward pass and then vmap for the loss, or can I vmap the forward pass well? Is there a performance penalty for this?

Maybe these things don't matter at all and the jit and vmap magic is super robust against amateurs like me, but otherwise a simple best practices page could help. Thanks!

Answered by avital

Dec 17, 2020

Generally, you can always jit the top-most level (but it's fine to jit inner functions -- that doesn't impact the end result.)

vmap changes the function signature so put it where it makes sense -- if you want to have a forward pass function that works on batches but defined on single elements, then you should vmap that function.

But I agree we should have a simple page with best practices for jit, vmap, etc. I'll file an issue for that.

View full answer

8bitmp3 · 2020-12-15T21:05:47Z

8bitmp3
Dec 15, 2020

@GJBoth I'd also love to know the best practices - good questions. Maybe this issue should be moved GitHub Discussions by the JAX admins (there's a guide for this: https://docs.github.com/en/free-pro-team@latest/discussions/managing-discussions-for-your-community/moderating-discussions#converting-an-issue-to-a-discussion) cc Flax @avital

1 reply

shoyer Dec 15, 2020
Collaborator

Good suggestion -- I moved this to github discussions

8bitmp3 · 2020-12-15T23:26:04Z

8bitmp3
Dec 15, 2020

Hi @GJBoth 👋 I did some research on Flax and Haiku examples to try to find some answers.

I'm working on switching our code to Jax (using Flax as NN library)

Working on cutting edge stuff 🔥 Nice!

I'm amazed with jit and vmap.

It's like magical 🦄 dust wrapped around XLA (...I don't know what this means and how those transforms work).

I create a loss function and then use jax.value_and_grad(loss) Should I jit the loss, the value_and_grad, both ,or doesn't it matter?

It looks like it depends on how we define the loss_function(): if we set up the loss inside the main training function (e.g. func train() -> singularity: { func loss():... func grad_and_eval(loss)...}), then you apply @jit on the entire @jit func train(). This then speeds up both the loss_function() and the gradient compute (value_and_grad(loss_function)). Otherwise, we jit loss_function separately and then jit again the gradient update function that contains value_and_grad(loss_function). I also noticed that the Haiku examples use grad() in some cases and value_and_grad() in other ones. cc for UX @dynamicwebpaige.

I included some examples - supervised classification, RL, generative - of Flax Linen and Haiku code below.

Some Flax Linen examples also use jit on parameter initialization ops (e.g. NLP seq) and evaluation ops on the test set. I also found cool uses of jit for RL-specific ops (e.g. computing the advantage for policy gradients). There are code gists from IMPALA and PPO examples below from Flax and Haiku repos.

jit on loss/grad compute and weight update - examples

Flax Linen - VAE: @jit both the loss_function (inside the training step) and the value_and_grad(loss_function

@jax.jit
def train_step(optimizer, batch, z_rng):
  def loss_fn(params):
  ...
  grad_fn = jax.value_and_grad(loss_fn, has_aux=True)
  ...
  optimizer = optimizer.apply_gradient(grad)
  return optimizer

Flax Linen - seq2seq: @jit both the loss_function (inside the training step) and the value_and_grad(loss_function

@jax.jit
def train_step(optimizer, batch, masks, key):
  ...
  def loss_fn(params):
    ...
    return loss, logits

  grad_fn = jax.value_and_grad(loss_fn, has_aux=True)
  ...
  return optimizer, metrics

Flax - RL - PG (PPO) with AC: jit the loss_function and then jit the gradient update (jax.value_and_grad(loss_function):

@functools.partial(jax.jit, static_argnums=1)
def loss_fn(
    ...):
    ...
    return PPO_loss + vf_coeff*value_loss - entropy_coeff*entropy

@functools.partial(jax.jit, static_argnums=(0,7))
def train_step(
    ...):
    ...
  for batch in zip(*trajectories):
    grad_fn = jax.value_and_grad(loss_fn)
    ...
    optimizer = optimizer.apply_gradient(grad, learning_rate=lr)
  return optimizer, loss

Haiku - VAE (with the Optax lib): jit both the loss_function and the grad(loss_function):

  @jax.jit
  def loss_fn(params: hk.Params, rng_key: PRNGKey, batch: Batch) -> jnp.ndarray:
    ...
    outputs: VAEOutput = model.apply(params, rng_key, batch["image"])

    log_likelihood = -binary_cross_entropy(batch["image"], outputs.logits)
    kl = kl_gaussian(outputs.mean, outputs.stddev**2)
    elbo = log_likelihood - kl

    return -jnp.mean(elbo)

  @jax.jit
  def update(
    ...
  ) -> Tuple[hk.Params, OptState]:
    ...
    grads = jax.grad(loss_fn)(params, rng_key, batch)
    updates, new_opt_state = optimizer.update(grads, opt_state)
    new_params = optax.apply_updates(params, updates)
    return new_params, new_opt_state

Haiku - RL - IMPALA (with RLax, Optax libs):jit only the value_and_grad(loss_function); the losses are vmapped separately:

  @functools.partial(jax.jit, static_argnums=0)
  def update(self, params, opt_state, batch: util.Transition):
    """The actual update function."""
    (_, logs), grads = jax.value_and_grad(
        self._loss, has_aux=True)(params, batch)

    grad_norm_unclipped = optimizers.l2_norm(grads)
    ...
    return params, updated_opt_state, logs

...
def policy_gradient_loss(logits, *args):
  ...
  mean_per_batch = jax.vmap(rlax.policy_gradient_loss, in_axes=1)(logits, *args)
  total_loss_per_batch = mean_per_batch * logits.shape[0]
  return jnp.sum(total_loss_per_batch)


def entropy_loss(logits, *args):
  ...
  mean_per_batch = jax.vmap(rlax.entropy_loss, in_axes=1)(logits, *args)
  total_loss_per_batch = mean_per_batch * logits.shape[0]
  return jnp.sum(total_loss_per_batch)

jit applied on other functions - examples

Flax Linen - ImageNet: param initialization with @jit:

def initialized(key, image_size, model):
  input_shape = (1, image_size, image_size, 3)
  @jax.jit
  def init(*args):
    return model.init(*args)
  variables = init({'params': key}, jnp.ones(input_shape, model.dtype))
  model_state, params = variables.pop('params')
  return params, model_state

Flax - RL - PG (PPO) with AC: jit the advantage calculation (GAE):

@jax.jit
@functools.partial(jax.vmap, in_axes=(1, 1, 1, None, None), out_axes=1)
def gae_advantages(
    ...):
  ....
  for t in reversed(range(len(rewards))):
    ...
    gae = delta + discount * gae_param * terminal_masks[t] * gae
    advantages.append(gae)
  advantages = advantages[::-1]
  return jnp.array(advantages)

Haiku - RL - IMPALA (with RLax, Optax libs): jit -> state initialization, jit -> action sampling:

  @functools.partial(jax.jit, static_argnums=(0, 1))
  def initial_state(self, batch_size: Optional[int]):
    ...
    return self._initial_state_apply_fn(None, batch_size)

  @functools.partial(jax.jit, static_argnums=(0,))
  def step(
    ....
  ) -> Tuple[AgentOutput, Nest]:
   ...
    action = hk.multinomial(rng_key, net_out.policy_logits, num_samples=1)
    ...
    return AgentOutput(net_out.policy_logits, net_out.value, action), next_state

2 replies

inoryy Dec 21, 2020

Note that in the VAE example the reason the loss_fn is separately jitted is that it's also used to keep track of validation loss here.

8bitmp3 Dec 21, 2020

I was going to ask why it was the "odd one out" in the dm-haiku/examples (like the Haiku Transformers example which applies jit only on the training fn—this is similar to what Flax examples do) and you've clarified it @inoryy, thanks 🙌

avital · 2020-12-17T14:13:37Z

avital
Dec 17, 2020

Generally, you can always jit the top-most level (but it's fine to jit inner functions -- that doesn't impact the end result.)

vmap changes the function signature so put it where it makes sense -- if you want to have a forward pass function that works on batches but defined on single elements, then you should vmap that function.

But I agree we should have a simple page with best practices for jit, vmap, etc. I'll file an issue for that.

3 replies

matpalm Jan 20, 2021

if that page can have a vmap -> pmap guide (especially w.r.t. tpu pods) that'd be awesome! wasn't always obvious to me which way to go porting some vmap code.

yiyixuxu Jul 22, 2022

I'm still waiting on that best practice!
vmap is really powerful and makes writing code much easier once you get used to it. I want to know if there is any performance advantage or disadvantage compared to code that's batch-aware and does not require vmap?

ddrous Mar 30, 2023

Indeed, I have this same question. Like, how to compute gradients through code that is batch-aware ? The discussions I see online try to compute the Jacobian, or the element-wise gradients, which isn't what I'm after. I'm essentially looking to compute the gradient wrt the non-batched dimension. Any help, I would appreciate a lot.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best practices jit and vmap #5199

{{title}}

Replies: 3 comments 6 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Best practices jit and vmap #5199

Replies: 3 comments · 6 replies

shoyer Dec 15, 2020 Collaborator

Replies: 3 comments 6 replies

shoyer Dec 15, 2020
Collaborator