Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: Issues with custom loss and Tensorboard logging using multiprocessing mode #757

Closed
ibengtsson opened this issue Dec 2, 2024 · 5 comments · Fixed by #780
Closed
Assignees
Labels
bug Something isn't working

Comments

@ibengtsson
Copy link

What happened?

Hi!

First of all, thank you for a great package. It's extremely useful in my research and I'm looking forward to being able to cite your work in future papers!

I mainly use PySR in the multiprocessing mode (as I have struggled to reach a high CPU-utilization when doing multithreading), but it seems to have its limitations and some features only seem to work when doing multithreading.

My latest problem came up when trying to run with the new Tensorboard logging:
Traceback (most recent call last): File "/Users/isakbe/Dev/modelling/il-sr/il_sr/scripts/run_sr.py", line 32, in main() File "/Users/isakbe/Dev/modelling/il-sr/il_sr/scripts/run_sr.py", line 23, in main trainer.fit_expression() File "/Users/isakbe/Dev/modelling/il-sr/il_sr/scripts/../src/sr_training.py", line 251, in fit_expression self.model.fit( File "/Users/isakbe/Library/Caches/pypoetry/virtualenvs/il-sr-9TFUWRsR-py3.11/lib/python3.11/site-packages/pysr/sr.py", line 2240, in fit self._run(X, y, runtime_params, weights=weights, seed=seed, category=category) File "/Users/isakbe/Library/Caches/pypoetry/virtualenvs/il-sr-9TFUWRsR-py3.11/lib/python3.11/site-packages/pysr/sr.py", line 2028, in _run out = SymbolicRegression.equation_search( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/isakbe/.julia/packages/PythonCall/Nr75f/src/JlWrap/any.jl", line 258, in __call__ return self._jl_callmethod($(pyjl_methodnum(pyjlany_call)), args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ juliacall.JuliaError: TaskFailedException Stacktrace: [1] wait @ ./task.jl:352 [inlined] [2] fetch @ ./task.jl:372 [inlined] [3] _main_search_loop!(state::SymbolicRegression.SearchUtilsModule.SearchState{Float32, Float32, Expression{Float32, Node{Float32}, @NamedTuple{operators::Nothing, variable_names::Nothing}}, Distributed.Future, Distributed.RemoteChannel}, datasets::Vector{Dataset{Float32, Float32, Matrix{Float32}, Vector{Float32}, Nothing, @NamedTuple{}, Nothing, Nothing, Nothing, Nothing}}, ropt::SymbolicRegression.SearchUtilsModule.RuntimeOptions{:multiprocessing, 1, true, SRLogger{TensorBoardLogger.TBLogger{String, IOStream}}}, options::Options{SymbolicRegression.CoreModule.OptionsStructModule.ComplexityMapping{Int64, Int64}, DynamicExpressions.OperatorEnumModule.OperatorEnum, Node, Expression, @NamedTuple{}, MutationWeights, false, true, nothing, Nothing, 5}) @ SymbolicRegression ~/.julia/packages/SymbolicRegression/44X04/src/SymbolicRegression.jl:833 [4] _equation_search(datasets::Vector{Dataset{Float32, Float32, Matrix{Float32}, Vector{Float32}, Nothing, @NamedTuple{}, Nothing, Nothing, Nothing, Nothing}}, ropt::SymbolicRegression.SearchUtilsModule.RuntimeOptions{:multiprocessing, 1, true, SRLogger{TensorBoardLogger.TBLogger{String, IOStream}}}, options::Options{SymbolicRegression.CoreModule.OptionsStructModule.ComplexityMapping{Int64, Int64}, DynamicExpressions.OperatorEnumModule.OperatorEnum, Node, Expression, @NamedTuple{}, MutationWeights, false, true, nothing, Nothing, 5}, saved_state::Nothing) @ SymbolicRegression ~/.julia/packages/SymbolicRegression/44X04/src/SymbolicRegression.jl:535 [5] equation_search(datasets::Vector{Dataset{Float32, Float32, Matrix{Float32}, Vector{Float32}, Nothing, @NamedTuple{}, Nothing, Nothing, Nothing, Nothing}}; options::Options{SymbolicRegression.CoreModule.OptionsStructModule.ComplexityMapping{Int64, Int64}, DynamicExpressions.OperatorEnumModule.OperatorEnum, Node, Expression, @NamedTuple{}, MutationWeights, false, true, nothing, Nothing, 5}, saved_state::Nothing, runtime_options::Nothing, runtime_options_kws::@kwargs{niterations::Int64, parallelism::String, numprocs::Int64, procs::Nothing, addprocs_function::Nothing, heap_size_hint_in_bytes::Nothing, runtests::Bool, return_state::Bool, run_id::String, verbosity::Int64, logger::SRLogger{TensorBoardLogger.TBLogger{String, IOStream}}, progress::Bool, v_dim_out::Val{1}}) @ SymbolicRegression ~/.julia/packages/SymbolicRegression/44X04/src/SymbolicRegression.jl:525 [6] equation_search @ ~/.julia/packages/SymbolicRegression/44X04/src/SymbolicRegression.jl:506 [inlined] [7] #equation_search#20 @ ~/.julia/packages/SymbolicRegression/44X04/src/SymbolicRegression.jl:476 [inlined] [8] equation_search @ ~/.julia/packages/SymbolicRegression/44X04/src/SymbolicRegression.jl:422 [inlined] [9] #equation_search#21 @ ~/.julia/packages/SymbolicRegression/44X04/src/SymbolicRegression.jl:499 [inlined] [10] pyjlany_call(self::typeof(equation_search), args_::Py, kwargs_::Py) @ PythonCall.JlWrap ~/.julia/packages/PythonCall/Nr75f/src/JlWrap/any.jl:40 [11] _pyjl_callmethod(f::Any, self_::Ptr{PythonCall.C.PyObject}, args_::Ptr{PythonCall.C.PyObject}, nargs::Int64) @ PythonCall.JlWrap ~/.julia/packages/PythonCall/Nr75f/src/JlWrap/base.jl:73 [12] _pyjl_callmethod(o::Ptr{PythonCall.C.PyObject}, args::Ptr{PythonCall.C.PyObject}) @ PythonCall.JlWrap.Cjl ~/.julia/packages/PythonCall/Nr75f/src/JlWrap/C.jl:63

nested task error: On worker 2:
KeyError: key TensorBoardLogger [899adc3e-224a-11e9-021f-63837185c80f] not found
Stacktrace:
[1] deserialize_module
@ /opt/homebrew/Cellar/julia/1.10.4/share/julia/stdlib/v1.10/Serialization/src/Serialization.jl:994
[2] handle_deserialize
@ /opt/homebrew/Cellar/julia/1.10.4/share/julia/stdlib/v1.10/Serialization/src/Serialization.jl:896
[3] deserialize
@ /opt/homebrew/Cellar/julia/1.10.4/share/julia/stdlib/v1.10/Serialization/src/Serialization.jl:814
[4] deserialize_datatype
@ /opt/homebrew/Cellar/julia/1.10.4/share/julia/stdlib/v1.10/Serialization/src/Serialization.jl:1398
[5] handle_deserialize
@ /opt/homebrew/Cellar/julia/1.10.4/share/julia/stdlib/v1.10/Serialization/src/Serialization.jl:867
[6] deserialize
@ /opt/homebrew/Cellar/julia/1.10.4/share/julia/stdlib/v1.10/Serialization/src/Serialization.jl:814
[7] deserialize_datatype
@ /opt/homebrew/Cellar/julia/1.10.4/share/julia/stdlib/v1.10/Serialization/src/Serialization.jl:1423
[8] handle_deserialize
@ /opt/homebrew/Cellar/julia/1.10.4/share/julia/stdlib/v1.10/Serialization/src/Serialization.jl:867
[9] deserialize
@ /opt/homebrew/Cellar/julia/1.10.4/share/julia/stdlib/v1.10/Serialization/src/Serialization.jl:814
[10] deserialize_datatype
@ /opt/homebrew/Cellar/julia/1.10.4/share/julia/stdlib/v1.10/Serialization/src/Serialization.jl:1423
[11] handle_deserialize
@ /opt/homebrew/Cellar/julia/1.10.4/share/julia/stdlib/v1.10/Serialization/src/Serialization.jl:867
[12] deserialize
@ /opt/homebrew/Cellar/julia/1.10.4/share/julia/stdlib/v1.10/Serialization/src/Serialization.jl:814
[13] deserialize_datatype
@ /opt/homebrew/Cellar/julia/1.10.4/share/julia/stdlib/v1.10/Serialization/src/Serialization.jl:1423
[14] handle_deserialize
@ /opt/homebrew/Cellar/julia/1.10.4/share/julia/stdlib/v1.10/Serialization/src/Serialization.jl:867
[15] deserialize
@ /opt/homebrew/Cellar/julia/1.10.4/share/julia/stdlib/v1.10/Serialization/src/Serialization.jl:814
[16] handle_deserialize
@ /opt/homebrew/Cellar/julia/1.10.4/share/julia/stdlib/v1.10/Serialization/src/Serialization.jl:874
[17] deserialize
@ /opt/homebrew/Cellar/julia/1.10.4/share/julia/stdlib/v1.10/Serialization/src/Serialization.jl:814 [inlined]
[18] deserialize_msg
@ /opt/homebrew/Cellar/julia/1.10.4/share/julia/stdlib/v1.10/Distributed/src/messages.jl:87
[19] #invokelatest#2
@ ./essentials.jl:892 [inlined]
[20] invokelatest
@ ./essentials.jl:889 [inlined]
[21] message_handler_loop
@ /opt/homebrew/Cellar/julia/1.10.4/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:176
[22] process_tcp_streams
@ /opt/homebrew/Cellar/julia/1.10.4/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:133
[23] #103
@ /opt/homebrew/Cellar/julia/1.10.4/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:121
Stacktrace:
[1] remotecall_fetch(f::Function, w::Distributed.Worker, args::Distributed.RRID; kwargs::@kwargs{})
@ Distributed /opt/homebrew/Cellar/julia/1.10.4/share/julia/stdlib/v1.10/Distributed/src/remotecall.jl:465
[2] remotecall_fetch(f::Function, w::Distributed.Worker, args::Distributed.RRID)
@ Distributed /opt/homebrew/Cellar/julia/1.10.4/share/julia/stdlib/v1.10/Distributed/src/remotecall.jl:454
[3] remotecall_fetch
@ /opt/homebrew/Cellar/julia/1.10.4/share/julia/stdlib/v1.10/Distributed/src/remotecall.jl:492 [inlined]
[4] call_on_owner
@ /opt/homebrew/Cellar/julia/1.10.4/share/julia/stdlib/v1.10/Distributed/src/remotecall.jl:565 [inlined]
[5] fetch(r::Distributed.Future)
@ Distributed /opt/homebrew/Cellar/julia/1.10.4/share/julia/stdlib/v1.10/Distributed/src/remotecall.jl:619
[6] (::SymbolicRegression.var"#56#61"{SymbolicRegression.SearchUtilsModule.SearchState{Float32, Float32, Expression{Float32, Node{Float32}, @NamedTuple{operators::Nothing, variable_names::Nothing}}, Distributed.Future, Distributed.RemoteChannel}, Int64, Int64})()
@ SymbolicRegression ~/.julia/packages/SymbolicRegression/44X04/src/SymbolicRegression.jl:810

I have also had problems running custom loss functions in multiprocessing mode, so I thought the errors might be related. Unfortunately I'm not really fluent enough in Julia to get to the bottom of the problems and try to fix them myself, so I'd be grateful for all the help I can get.

My other option would be to only use multithreading, but I've really struggled to utilize my CPU-resources when trying that. This seems to be a known issue, but if someone would have any more suggestions on how to optimize the settings for multithreading I'd be grateful. I mainly run PySR on a distributed SLURM-cluster.

Once again, thank you for a great package!

Version

1.0.0

Operating System

macOS

Package Manager

pip

Interface

Script (i.e., python my_script.py)

Relevant log output

Traceback (most recent call last): File "/Users/isakbe/Dev/modelling/il-sr/il_sr/scripts/run_sr.py", line 32, in main() File "/Users/isakbe/Dev/modelling/il-sr/il_sr/scripts/run_sr.py", line 23, in main trainer.fit_expression() File "/Users/isakbe/Dev/modelling/il-sr/il_sr/scripts/../src/sr_training.py", line 251, in fit_expression self.model.fit( File "/Users/isakbe/Library/Caches/pypoetry/virtualenvs/il-sr-9TFUWRsR-py3.11/lib/python3.11/site-packages/pysr/sr.py", line 2240, in fit self._run(X, y, runtime_params, weights=weights, seed=seed, category=category) File "/Users/isakbe/Library/Caches/pypoetry/virtualenvs/il-sr-9TFUWRsR-py3.11/lib/python3.11/site-packages/pysr/sr.py", line 2028, in _run out = SymbolicRegression.equation_search( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/isakbe/.julia/packages/PythonCall/Nr75f/src/JlWrap/any.jl", line 258, in __call__ return self._jl_callmethod($(pyjl_methodnum(pyjlany_call)), args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ juliacall.JuliaError: TaskFailedException Stacktrace: [1] wait @ ./task.jl:352 [inlined] [2] fetch @ ./task.jl:372 [inlined] [3] _main_search_loop!(state::SymbolicRegression.SearchUtilsModule.SearchState{Float32, Float32, Expression{Float32, Node{Float32}, @NamedTuple{operators::Nothing, variable_names::Nothing}}, Distributed.Future, Distributed.RemoteChannel}, datasets::Vector{Dataset{Float32, Float32, Matrix{Float32}, Vector{Float32}, Nothing, @NamedTuple{}, Nothing, Nothing, Nothing, Nothing}}, ropt::SymbolicRegression.SearchUtilsModule.RuntimeOptions{:multiprocessing, 1, true, SRLogger{TensorBoardLogger.TBLogger{String, IOStream}}}, options::Options{SymbolicRegression.CoreModule.OptionsStructModule.ComplexityMapping{Int64, Int64}, DynamicExpressions.OperatorEnumModule.OperatorEnum, Node, Expression, @NamedTuple{}, MutationWeights, false, true, nothing, Nothing, 5}) @ SymbolicRegression ~/.julia/packages/SymbolicRegression/44X04/src/SymbolicRegression.jl:833 [4] _equation_search(datasets::Vector{Dataset{Float32, Float32, Matrix{Float32}, Vector{Float32}, Nothing, @NamedTuple{}, Nothing, Nothing, Nothing, Nothing}}, ropt::SymbolicRegression.SearchUtilsModule.RuntimeOptions{:multiprocessing, 1, true, SRLogger{TensorBoardLogger.TBLogger{String, IOStream}}}, options::Options{SymbolicRegression.CoreModule.OptionsStructModule.ComplexityMapping{Int64, Int64}, DynamicExpressions.OperatorEnumModule.OperatorEnum, Node, Expression, @NamedTuple{}, MutationWeights, false, true, nothing, Nothing, 5}, saved_state::Nothing) @ SymbolicRegression ~/.julia/packages/SymbolicRegression/44X04/src/SymbolicRegression.jl:535 [5] equation_search(datasets::Vector{Dataset{Float32, Float32, Matrix{Float32}, Vector{Float32}, Nothing, @NamedTuple{}, Nothing, Nothing, Nothing, Nothing}}; options::Options{SymbolicRegression.CoreModule.OptionsStructModule.ComplexityMapping{Int64, Int64}, DynamicExpressions.OperatorEnumModule.OperatorEnum, Node, Expression, @NamedTuple{}, MutationWeights, false, true, nothing, Nothing, 5}, saved_state::Nothing, runtime_options::Nothing, runtime_options_kws::@kwargs{niterations::Int64, parallelism::String, numprocs::Int64, procs::Nothing, addprocs_function::Nothing, heap_size_hint_in_bytes::Nothing, runtests::Bool, return_state::Bool, run_id::String, verbosity::Int64, logger::SRLogger{TensorBoardLogger.TBLogger{String, IOStream}}, progress::Bool, v_dim_out::Val{1}}) @ SymbolicRegression ~/.julia/packages/SymbolicRegression/44X04/src/SymbolicRegression.jl:525 [6] equation_search @ ~/.julia/packages/SymbolicRegression/44X04/src/SymbolicRegression.jl:506 [inlined] [7] #equation_search#20 @ ~/.julia/packages/SymbolicRegression/44X04/src/SymbolicRegression.jl:476 [inlined] [8] equation_search @ ~/.julia/packages/SymbolicRegression/44X04/src/SymbolicRegression.jl:422 [inlined] [9] #equation_search#21 @ ~/.julia/packages/SymbolicRegression/44X04/src/SymbolicRegression.jl:499 [inlined] [10] pyjlany_call(self::typeof(equation_search), args_::Py, kwargs_::Py) @ PythonCall.JlWrap ~/.julia/packages/PythonCall/Nr75f/src/JlWrap/any.jl:40 [11] _pyjl_callmethod(f::Any, self_::Ptr{PythonCall.C.PyObject}, args_::Ptr{PythonCall.C.PyObject}, nargs::Int64) @ PythonCall.JlWrap ~/.julia/packages/PythonCall/Nr75f/src/JlWrap/base.jl:73 [12] _pyjl_callmethod(o::Ptr{PythonCall.C.PyObject}, args::Ptr{PythonCall.C.PyObject}) @ PythonCall.JlWrap.Cjl ~/.julia/packages/PythonCall/Nr75f/src/JlWrap/C.jl:63
nested task error: On worker 2:
KeyError: key TensorBoardLogger [899adc3e-224a-11e9-021f-63837185c80f] not found
Stacktrace:
[1] deserialize_module
@ /opt/homebrew/Cellar/julia/1.10.4/share/julia/stdlib/v1.10/Serialization/src/Serialization.jl:994
[2] handle_deserialize
@ /opt/homebrew/Cellar/julia/1.10.4/share/julia/stdlib/v1.10/Serialization/src/Serialization.jl:896
[3] deserialize
@ /opt/homebrew/Cellar/julia/1.10.4/share/julia/stdlib/v1.10/Serialization/src/Serialization.jl:814
[4] deserialize_datatype
@ /opt/homebrew/Cellar/julia/1.10.4/share/julia/stdlib/v1.10/Serialization/src/Serialization.jl:1398
[5] handle_deserialize
@ /opt/homebrew/Cellar/julia/1.10.4/share/julia/stdlib/v1.10/Serialization/src/Serialization.jl:867
[6] deserialize
@ /opt/homebrew/Cellar/julia/1.10.4/share/julia/stdlib/v1.10/Serialization/src/Serialization.jl:814
[7] deserialize_datatype
@ /opt/homebrew/Cellar/julia/1.10.4/share/julia/stdlib/v1.10/Serialization/src/Serialization.jl:1423
[8] handle_deserialize
@ /opt/homebrew/Cellar/julia/1.10.4/share/julia/stdlib/v1.10/Serialization/src/Serialization.jl:867
[9] deserialize
@ /opt/homebrew/Cellar/julia/1.10.4/share/julia/stdlib/v1.10/Serialization/src/Serialization.jl:814
[10] deserialize_datatype
@ /opt/homebrew/Cellar/julia/1.10.4/share/julia/stdlib/v1.10/Serialization/src/Serialization.jl:1423
[11] handle_deserialize
@ /opt/homebrew/Cellar/julia/1.10.4/share/julia/stdlib/v1.10/Serialization/src/Serialization.jl:867
[12] deserialize
@ /opt/homebrew/Cellar/julia/1.10.4/share/julia/stdlib/v1.10/Serialization/src/Serialization.jl:814
[13] deserialize_datatype
@ /opt/homebrew/Cellar/julia/1.10.4/share/julia/stdlib/v1.10/Serialization/src/Serialization.jl:1423
[14] handle_deserialize
@ /opt/homebrew/Cellar/julia/1.10.4/share/julia/stdlib/v1.10/Serialization/src/Serialization.jl:867
[15] deserialize
@ /opt/homebrew/Cellar/julia/1.10.4/share/julia/stdlib/v1.10/Serialization/src/Serialization.jl:814
[16] handle_deserialize
@ /opt/homebrew/Cellar/julia/1.10.4/share/julia/stdlib/v1.10/Serialization/src/Serialization.jl:874
[17] deserialize
@ /opt/homebrew/Cellar/julia/1.10.4/share/julia/stdlib/v1.10/Serialization/src/Serialization.jl:814 [inlined]
[18] deserialize_msg
@ /opt/homebrew/Cellar/julia/1.10.4/share/julia/stdlib/v1.10/Distributed/src/messages.jl:87
[19] #invokelatest#2
@ ./essentials.jl:892 [inlined]
[20] invokelatest
@ ./essentials.jl:889 [inlined]
[21] message_handler_loop
@ /opt/homebrew/Cellar/julia/1.10.4/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:176
[22] process_tcp_streams
@ /opt/homebrew/Cellar/julia/1.10.4/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:133
[23] #103
@ /opt/homebrew/Cellar/julia/1.10.4/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:121
Stacktrace:
[1] remotecall_fetch(f::Function, w::Distributed.Worker, args::Distributed.RRID; kwargs::@kwargs{})
@ Distributed /opt/homebrew/Cellar/julia/1.10.4/share/julia/stdlib/v1.10/Distributed/src/remotecall.jl:465
[2] remotecall_fetch(f::Function, w::Distributed.Worker, args::Distributed.RRID)
@ Distributed /opt/homebrew/Cellar/julia/1.10.4/share/julia/stdlib/v1.10/Distributed/src/remotecall.jl:454
[3] remotecall_fetch
@ /opt/homebrew/Cellar/julia/1.10.4/share/julia/stdlib/v1.10/Distributed/src/remotecall.jl:492 [inlined]
[4] call_on_owner
@ /opt/homebrew/Cellar/julia/1.10.4/share/julia/stdlib/v1.10/Distributed/src/remotecall.jl:565 [inlined]
[5] fetch(r::Distributed.Future)
@ Distributed /opt/homebrew/Cellar/julia/1.10.4/share/julia/stdlib/v1.10/Distributed/src/remotecall.jl:619
[6] (::SymbolicRegression.var"#56#61"{SymbolicRegression.SearchUtilsModule.SearchState{Float32, Float32, Expression{Float32, Node{Float32}, @NamedTuple{operators::Nothing, variable_names::Nothing}}, Distributed.Future, Distributed.RemoteChannel}, Int64, Int64})()
@ SymbolicRegression ~/.julia/packages/SymbolicRegression/44X04/src/SymbolicRegression.jl:810

Extra Info

No response

@ibengtsson ibengtsson added the bug Something isn't working label Dec 2, 2024
@MilesCranmer
Copy link
Owner

Ah, I know what this is. I'll make a patch later today.

It's basically just TensorBoardLogger not being imported on the worker process. There's an easy (internal) fix in SymbolicRegression.jl that I do for all the other extensions, but I guess not for TBLogger yet.

@MilesCranmer
Copy link
Owner

@MilesCranmer
Copy link
Owner

Also please share the issues with the custom loss as I might be able to solve that too

@ibengtsson
Copy link
Author

Thanks for the quick reply! Will post a separate issue for the custom loss then.

@MilesCranmer
Copy link
Owner

#780 should fix this, sorry for the delay.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants