Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HuggingFace] Compilation of ESM model fail for unsupported operator erf #1081

Open
JingyaHuang opened this issue Jan 6, 2025 · 2 comments

Comments

@JingyaHuang
Copy link

Hi team! We observed an issue when compiling the ESM model with Neuron SDK 2.21.0:

Reported stdout: 
DEBUG: needsModular? No. macCnt 494817280
INFO: Switching to single-module compile. PrePartitionPipe skipped.
INFO: Found memory bound graph
2025-01-06 11:49:42.255433: F hilo/hlo_passes/VerifySupportedHloOps.cc:160] ERROR: Unsupported operator: erf
Detailed log
(aws_neuronx_venv_pytorch_2_5) ubuntu@ip-xx-xx-x-xx:~/optimum-neuron$ optimum-cli export neuron --model hf-internal-testing/tiny-random-EsmModel --batch_size 1 --sequence_length 16  --auto_cast matmul --auto_cast_type bf16 tiny_esm/
WARNING:root:MASTER_ADDR environment variable is not set, defaulting to localhost
WARNING:root:Found libneuronpjrt.so. Setting PJRT_DEVICE=NEURON.
/opt/aws_neuronx_venv_pytorch_2_5/lib/python3.10/site-packages/neuronx_distributed/modules/moe/expert_mlps.py:11: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.
  from neuronx_distributed.modules.moe.blockwise import (
/opt/aws_neuronx_venv_pytorch_2_5/lib/python3.10/site-packages/neuronx_distributed/modules/moe/expert_mlps.py:11: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.
  from neuronx_distributed.modules.moe.blockwise import (
pytorch_model.bin: 100%|████████████████████████████████████████████████████████████████████████| 241k/241k [00:00<00:00, 55.4MB/s]
Some weights of EsmModel were not initialized from the model checkpoint at hf-internal-testing/tiny-random-EsmModel and are newly initialized: ['contact_head.regression.bias', 'contact_head.regression.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
tokenizer_config.json: 100%|██████████████████████████████████████████████████████████████████████| 277/277 [00:00<00:00, 3.67MB/s]
vocab.txt: 100%|████████████████████████████████████████████████████████████████████████████████| 93.0/93.0 [00:00<00:00, 1.27MB/s]
special_tokens_map.json: 100%|████████████████████████████████████████████████████████████████████| 125/125 [00:00<00:00, 1.79MB/s]
***** Compiling tiny-random-EsmModel *****
Using Neuron: --auto-cast matmul
Using Neuron: --auto-cast-type bf16
Using Neuron: --optlevel 2
model.safetensors: 100%|████████████████████████████████████████████████████████████████████████| 223k/223k [00:00<00:00, 42.3MB/s]
.
Process Process-1:
Traceback (most recent call last):
  File "neuronxcc/driver/CommandDriver.py", line 345, in neuronxcc.driver.CommandDriver.CommandDriver.run_subcommand
  File "neuronxcc/driver/commands/CompileCommand.py", line 1353, in neuronxcc.driver.commands.CompileCommand.CompileCommand.run
  File "neuronxcc/driver/commands/CompileCommand.py", line 1304, in neuronxcc.driver.commands.CompileCommand.CompileCommand.runPipeline
  File "neuronxcc/driver/commands/CompileCommand.py", line 1324, in neuronxcc.driver.commands.CompileCommand.CompileCommand.runPipeline
  File "neuronxcc/driver/commands/CompileCommand.py", line 1327, in neuronxcc.driver.commands.CompileCommand.CompileCommand.runPipeline
  File "neuronxcc/driver/Job.py", line 344, in neuronxcc.driver.Job.SingleInputJob.run
  File "neuronxcc/driver/Job.py", line 370, in neuronxcc.driver.Job.SingleInputJob.runOnState
  File "neuronxcc/driver/Pipeline.py", line 30, in neuronxcc.driver.Pipeline.Pipeline.runSingleInput
  File "neuronxcc/driver/Job.py", line 344, in neuronxcc.driver.Job.SingleInputJob.run
  File "neuronxcc/driver/Job.py", line 370, in neuronxcc.driver.Job.SingleInputJob.runOnState
  File "neuronxcc/driver/jobs/Frontend.py", line 454, in neuronxcc.driver.jobs.Frontend.Frontend.runSingleInput
  File "neuronxcc/driver/jobs/Frontend.py", line 218, in neuronxcc.driver.jobs.Frontend.Frontend.runXLAFrontend
  File "neuronxcc/driver/jobs/Frontend.py", line 190, in neuronxcc.driver.jobs.Frontend.Frontend.runHlo2Tensorizer
neuronxcc.driver.Exceptions.CompilerInvalidInputException: ERROR: Failed command  /opt/aws_neuronx_venv_pytorch_2_5/lib/python3.10/site-packages/neuronxcc/starfish/bin/hlo2penguin --input /tmp/tmpp3mv4kkw/model/graph.hlo --out-dir ./ --output penguin.py --layers-per-module=1 --emit-tensor-level-dropout-ops --emit-tensor-level-rng-ops
------------ 
Reported stdout: 
DEBUG: needsModular? No. macCnt 600064
INFO: Switching to single-module compile. PrePartitionPipe skipped.
INFO: Found memory bound graph
2025-01-06 11:46:54.763940: F hilo/hlo_passes/VerifySupportedHloOps.cc:160] ERROR: Unsupported operator: erf

------------ 
Reported stderr: 
None
------------                       
Import of the HLO graph into the Neuron Compiler has failed.                       
This may be caused by unsupported operators or an internal compiler error.                       
More details can be found in the error message(s) above.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "neuronxcc/driver/CommandDriver.py", line 352, in neuronxcc.driver.CommandDriver.CommandDriver.run_subcommand_in_process
  File "neuronxcc/driver/CommandDriver.py", line 347, in neuronxcc.driver.CommandDriver.CommandDriver.run_subcommand
  File "neuronxcc/driver/CommandDriver.py", line 111, in neuronxcc.driver.CommandDriver.handleError
  File "neuronxcc/driver/GlobalState.py", line 102, in neuronxcc.driver.GlobalState.FinalizeGlobalState
  File "neuronxcc/driver/GlobalState.py", line 82, in neuronxcc.driver.GlobalState._GlobalStateImpl.shutdown
  File "/usr/lib/python3.10/shutil.py", line 715, in rmtree
    onerror(os.lstat, path, sys.exc_info())
  File "/usr/lib/python3.10/shutil.py", line 713, in rmtree
    orig_st = os.lstat(path)
FileNotFoundError: [Errno 2] No such file or directory: '/home/ubuntu/optimum-neuron/neuronxcc-u6vnj9h_'
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/opt/aws_neuronx_venv_pytorch_2_5/lib/python3.10/site-packages/optimum/exporters/neuron/__main__.py", line 781, in <module>
    main()
  File "/opt/aws_neuronx_venv_pytorch_2_5/lib/python3.10/site-packages/optimum/exporters/neuron/__main__.py", line 751, in main
    main_export(
  File "/opt/aws_neuronx_venv_pytorch_2_5/lib/python3.10/site-packages/optimum/exporters/neuron/__main__.py", line 634, in main_export
    _, neuron_outputs = export_models(
  File "/opt/aws_neuronx_venv_pytorch_2_5/lib/python3.10/site-packages/optimum/exporters/neuron/convert.py", line 372, in export_models
    neuron_inputs, neuron_outputs = export(
  File "/opt/aws_neuronx_venv_pytorch_2_5/lib/python3.10/site-packages/optimum/exporters/neuron/convert.py", line 455, in export
    return export_neuronx(
  File "/opt/aws_neuronx_venv_pytorch_2_5/lib/python3.10/site-packages/optimum/exporters/neuron/convert.py", line 585, in export_neuronx
    neuron_model = neuronx.trace(
  File "/opt/aws_neuronx_venv_pytorch_2_5/lib/python3.10/site-packages/torch_neuronx/xla_impl/trace.py", line 589, in trace
    neff_filename, metaneff, flattener, packer, weights = _trace(
  File "/opt/aws_neuronx_venv_pytorch_2_5/lib/python3.10/site-packages/torch_neuronx/xla_impl/trace.py", line 654, in _trace
    neff_artifacts = generate_neff(
  File "/opt/aws_neuronx_venv_pytorch_2_5/lib/python3.10/site-packages/torch_neuronx/xla_impl/trace.py", line 506, in generate_neff
    neff_filename = hlo_compile(
  File "/opt/aws_neuronx_venv_pytorch_2_5/lib/python3.10/site-packages/torch_neuronx/xla_impl/trace.py", line 396, in hlo_compile
    raise RuntimeError(f"neuronx-cc failed with {status}")
RuntimeError: neuronx-cc failed with 1
Traceback (most recent call last):
  File "/opt/aws_neuronx_venv_pytorch_2_5/bin/optimum-cli", line 8, in <module>
    sys.exit(main())
  File "/opt/aws_neuronx_venv_pytorch_2_5/lib/python3.10/site-packages/optimum/commands/optimum_cli.py", line 208, in main
    service.run()
  File "/opt/aws_neuronx_venv_pytorch_2_5/lib/python3.10/site-packages/optimum/commands/export/neuronx.py", line 305, in run
    subprocess.run(full_command, shell=True, check=True)
  File "/usr/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'python3 -m optimum.exporters.neuron --model hf-internal-testing/tiny-random-EsmModel --batch_size 1 --sequence_length 16 --auto_cast matmul --auto_cast_type bf16 tiny_esm/' returned non-zero exit status 1.

To reproduce:

  • Tiny
optimum-cli export neuron --model hf-internal-testing/tiny-random-EsmModel --batch_size 1 --sequence_length 16  --auto_cast matmul --auto_cast_type bf16 tiny_esm/
  • Regular
optimum-cli export neuron --model facebook/esm2_t6_8M_UR50D --batch_size 1 --sequence_length 64  --auto_cast matmul --auto_cast_type bf16 regular_esm/
@jeffhataws
Copy link
Contributor

We have reproduced the issue and is looking into the problem.

@aws-donkrets
Copy link
Contributor

Hi JingyaHuang- we have verified the erf function support and it is expected to be part of the next Neuron SDK release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants