TFMA on Flink does not seems to be parallelizing work. #170

jccarles · 2023-01-31T15:22:55Z

Hello I am running into an issue running the evaluator component of a tfx pipeline. I use the FlinkRunner for beam and the evaluator component is super slow as the size of data scales. It seems it is because the work is done only by a single Task Manager.

System information

I am running a TFX pipeline using python 3.7. TFX version 1.8.1 which comes with TFMA version tensorflow-model-analysis==0.39.0.
I don't have a small example to reproduce, I can work on one if you think it will help.

Describe the problem

I use the evaluator TFX component as such

evaluator = Evaluator(
        examples=example_gen.outputs[standard_component_specs.EXAMPLES_KEY],
        model=trainer.outputs[standard_component_specs.MODEL_KEY],
        eval_config=eval_config,
    )

With a simple eval_config without any splits. So we only have the eval_split which is used for evaluation.

To run the TFX pipeline we use the FlinkRunner for beam. The sidecar image is built from tensorflow/tfx:1.8.1.

We run flink with a parellism of 10. So 10 files of tf_records are in input of the evaluator component.

From what we could gather, beam tells flink to build a single task for the 3 p_transforms:

"ReadFromTFRecordToArrow" | "FlattenExamples" | "ExtractEvaluateAndWriteResults"

Our issue is that this ends up creating a single subtask for Flink, so a single task manager is doing all the work as you can see in the attached screenshot. So the issue seems to be with the beam workflow which does not parallelized.

I have two main questions:

Is this behavior normal ?
Is it possible to better dispatch the workload between the different taskmanagers ?

The text was updated successfully, but these errors were encountered:

singhniraj08 · 2023-02-06T08:20:28Z

@jccarles,

Can you please share the eval_config passed in evaluator component to analyse the root cause of the issue? Thank you!

jccarles · 2023-02-06T08:53:48Z

Hello ! Thank you for your answer, here is the eval_config used. We used fake very low bounds for testing.

{
  "model_specs": [
    {
      "signature_name": "serving_default",
      "label_key": "label",
      "preprocessing_function_names": [
        "transform_features"
      ]
    }
  ],
  "metrics_specs": [
    {
      "thresholds": {
        "precision": {
          "value_threshold": {
            "lower_bound": 1e-03
          },
          "change_threshold": {
            "absolute": -1e-10,
            "direction": "HIGHER_IS_BETTER"
          }
        }
      }
    }
  ]
}

singhniraj08 · 2023-02-07T09:53:49Z

@mdreves, Can we dispatch the evaluator between different task managers. Thanks!

jccarles · 2023-04-03T14:49:04Z

Hello, thank you for checking this issue, did you have time to take a look ? Have you identified anything so far, can I help somehow ?

Enzo90910 · 2023-04-11T11:09:49Z

This issue is currently preventing us from using the Evaluator component in production, since it makes the memory requirements on a single Flink TaskManager rather huge.

singhniraj08 self-assigned this Feb 2, 2023

singhniraj08 added the type:bug label Feb 2, 2023

singhniraj08 added the stat:awaiting response label Feb 6, 2023

singhniraj08 assigned mdreves and unassigned singhniraj08 Feb 7, 2023

singhniraj08 added stat:awaiting tensorflower and removed stat:awaiting response labels Feb 7, 2023

singhniraj08 added stat:awaiting tensorflower and removed stat:awaiting tensorflower labels Feb 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TFMA on Flink does not seems to be parallelizing work. #170

TFMA on Flink does not seems to be parallelizing work. #170

jccarles commented Jan 31, 2023

singhniraj08 commented Feb 6, 2023

jccarles commented Feb 6, 2023

singhniraj08 commented Feb 7, 2023

jccarles commented Apr 3, 2023

Enzo90910 commented Apr 11, 2023

TFMA on Flink does not seems to be parallelizing work. #170

TFMA on Flink does not seems to be parallelizing work. #170

Comments

jccarles commented Jan 31, 2023

System information

Describe the problem

singhniraj08 commented Feb 6, 2023

jccarles commented Feb 6, 2023

singhniraj08 commented Feb 7, 2023

jccarles commented Apr 3, 2023

Enzo90910 commented Apr 11, 2023