-
Notifications
You must be signed in to change notification settings - Fork 280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TFMA on Flink does not seems to be parallelizing work. #170
Comments
Can you please share the eval_config passed in evaluator component to analyse the root cause of the issue? Thank you! |
Hello ! Thank you for your answer, here is the eval_config used. We used fake very low bounds for testing.
|
@mdreves, Can we dispatch the evaluator between different task managers. Thanks! |
Hello, thank you for checking this issue, did you have time to take a look ? Have you identified anything so far, can I help somehow ? |
This issue is currently preventing us from using the Evaluator component in production, since it makes the memory requirements on a single Flink TaskManager rather huge. |
Hello I am running into an issue running the evaluator component of a tfx pipeline. I use the FlinkRunner for beam and the evaluator component is super slow as the size of data scales. It seems it is because the work is done only by a single Task Manager.
System information
I am running a TFX pipeline using python 3.7. TFX version 1.8.1 which comes with TFMA version
tensorflow-model-analysis==0.39.0
.I don't have a small example to reproduce, I can work on one if you think it will help.
Describe the problem
I use the evaluator TFX component as such
With a simple eval_config without any splits. So we only have the
eval_split
which is used for evaluation.To run the TFX pipeline we use the FlinkRunner for beam. The sidecar image is built from tensorflow/tfx:1.8.1.
We run flink with a parellism of 10. So 10 files of tf_records are in input of the evaluator component.
From what we could gather, beam tells flink to build a single task for the 3 p_transforms:
"ReadFromTFRecordToArrow" | "FlattenExamples" | "ExtractEvaluateAndWriteResults"
Our issue is that this ends up creating a single subtask for Flink, so a single task manager is doing all the work as you can see in the attached screenshot. So the issue seems to be with the beam workflow which does not parallelized.
I have two main questions:
The text was updated successfully, but these errors were encountered: