Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TFMA on Flink does not seems to be parallelizing work. #170

Open
jccarles opened this issue Jan 31, 2023 · 5 comments
Open

TFMA on Flink does not seems to be parallelizing work. #170

jccarles opened this issue Jan 31, 2023 · 5 comments

Comments

@jccarles
Copy link

Hello I am running into an issue running the evaluator component of a tfx pipeline. I use the FlinkRunner for beam and the evaluator component is super slow as the size of data scales. It seems it is because the work is done only by a single Task Manager.

System information

I am running a TFX pipeline using python 3.7. TFX version 1.8.1 which comes with TFMA version tensorflow-model-analysis==0.39.0.
I don't have a small example to reproduce, I can work on one if you think it will help.

Describe the problem

I use the evaluator TFX component as such

evaluator = Evaluator(
        examples=example_gen.outputs[standard_component_specs.EXAMPLES_KEY],
        model=trainer.outputs[standard_component_specs.MODEL_KEY],
        eval_config=eval_config,
    )

With a simple eval_config without any splits. So we only have the eval_split which is used for evaluation.

To run the TFX pipeline we use the FlinkRunner for beam. The sidecar image is built from tensorflow/tfx:1.8.1.

We run flink with a parellism of 10. So 10 files of tf_records are in input of the evaluator component.

From what we could gather, beam tells flink to build a single task for the 3 p_transforms:

"ReadFromTFRecordToArrow" | "FlattenExamples" | "ExtractEvaluateAndWriteResults"

Our issue is that this ends up creating a single subtask for Flink, so a single task manager is doing all the work as you can see in the attached screenshot. So the issue seems to be with the beam workflow which does not parallelized.

I have two main questions:

  • Is this behavior normal ?
  • Is it possible to better dispatch the workload between the different taskmanagers ?

Screenshot from 2023-01-26 15-18-44

@singhniraj08 singhniraj08 self-assigned this Feb 2, 2023
@singhniraj08
Copy link

@jccarles,

Can you please share the eval_config passed in evaluator component to analyse the root cause of the issue? Thank you!

@jccarles
Copy link
Author

jccarles commented Feb 6, 2023

Hello ! Thank you for your answer, here is the eval_config used. We used fake very low bounds for testing.

{
  "model_specs": [
    {
      "signature_name": "serving_default",
      "label_key": "label",
      "preprocessing_function_names": [
        "transform_features"
      ]
    }
  ],
  "metrics_specs": [
    {
      "thresholds": {
        "precision": {
          "value_threshold": {
            "lower_bound": 1e-03
          },
          "change_threshold": {
            "absolute": -1e-10,
            "direction": "HIGHER_IS_BETTER"
          }
        }
      }
    }
  ]
}

@singhniraj08
Copy link

@mdreves, Can we dispatch the evaluator between different task managers. Thanks!

@jccarles
Copy link
Author

jccarles commented Apr 3, 2023

Hello, thank you for checking this issue, did you have time to take a look ? Have you identified anything so far, can I help somehow ?

@Enzo90910
Copy link

This issue is currently preventing us from using the Evaluator component in production, since it makes the memory requirements on a single Flink TaskManager rather huge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants