-
Notifications
You must be signed in to change notification settings - Fork 720
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transform and BulkInferrer dynamic batch size grows too large, causing OOM on 16GB GPU #5777
Comments
A possible solution might be to expose the max-batch-size setting on the BulkInferrer Inference spec proto and pass it all the way through. If I had a way of fixing the max batch size to 256, it should work. |
@IzakMaraisTAL, This issue looks like a feature request. Thank you for bringing this up! @lego0901, Please have a look into this feature request to expose max-batch-size setting on the BulkInferrer Inference spec proto. Thanks! |
Ack. Thanks for your request and providing very abundant studies!! |
I am repeatedly getting this same bug with the Transform component too.
This is using the same dataset (2.5M images) and pre-processing layer (creating embeddings by passing the images through Xception). Splitting the dataset into 12 (180k examples each) and running a separate Transform for each resulted in 11 of the 12 Transforms passing and one failing with a similar OOM problem. But this workaround is very manual and makes further processing more difficult. I don't agree with this issue being classified as a feature. TFX is a scalable stream processing framework for ML. If it fails due to increase in dataset size and incorrect usage of (or an underlying bug in) Beam, that is still a bug. The configurable maximum batch size bug-fix suggested above will need to be exposed to the Transform component too. An alternative fix would be for Beam itself to take the available GPU memory into account when determining how much to increase its batch size. |
After trying various workarounds like this (none, which worked 100%), I am on CPU in stead of GPU as the only reliable option. This increases the costs of the transform from $100 to $300. |
My bad.. I will bump up this issue to our side and try to figure out the solution. Sorry for your inconvenience. |
In tfx 1.13 we introduced a new batching mode that tries to deserialize data in batches of ~ 100MB. It can be enabled with tfxio_use_byte_size_batching flag. Could you try updating to 1.13 and setting the flag to True? |
That sounds promising, thank you. I would very much like to upgrade, but unfortunately I am blocked by tensorflow/recommenders#671. Once that is resolved, I will give feedback. |
Depending on how exactly you use transform and BulkInferrer you may also be able to set data (tfxio) source batch size. Or, if you use the instance dict format with transform, then you can also set it through transform context. |
Thanks for the tips. I would like to apply them to the Transform component.
I instantiate a TFX Transform component as described in its documentation and provide it in the list of components to the pipeline class. The input to the Transform is component is channel of serialized examples. I'm not sure how one would leverage tfxio there.
The TFX Transform component constructor does not expose the transform context. I can see the |
Yes, you're right, the component itself does not expose the parameter. Even if we were to add it, it would be available at an even later tfx version than the byte size-based batching. So, unfortunately, updating and using the flag seems like the only option. |
You could try creating a custom component based on transform that overrides the parameter, but that may be pretty involving. |
Thanks for the confirmation, will let you know once we test 1.13 after a compatible ScaNN release has been made. |
A new release of ScaNN is available but it looks like they skipped tensorflow 2.12 altogether and went from 2.11 to 2.13. I will wait for a future release of TFX that depends on tensorflow 2.13 to test the changes. |
I have upgraded to TFX 1.14.0. This wil be tested in the next scheduled run of the pipeline at the start of November. |
The upgrade to TFX 1.14.0 was held back by #6386. I am now applying the workaround mentioned there and should then have results after the next scheduled run at the start of Feb. |
Unfortunately the fix does not work. The Transform component running TFX 1.14.0 ran out of memory on the 16GB GPU in exactly the same way as described previously. |
I see now for the failed TFX 1.14.0 run, I did not set the new flag as requested above. I will investigate how to set global |
I added the flag in the Transform component's tfx.components.Transform(<args>).with_beam_pipeline_args([<other args>, "--tfxio_use_byte_size_batching"]]) In a test using the local tfx runner I could confirm that the flag value of True is propagated to my print("tfxio_use_byte_size_batching value", flags.FLAGS.get_flag_value("tfxio_use_byte_size_batching", False) ) Is this correct, or is there a better way to set the flag this? When running the TFX pipeline with the full dataset on Vertex AI, delegating the Transform to Dataflow on a 16GB GPU I no longer get the error message described above but the Dataflow job still fails after a series of resource allocation errors that try to assign > 20GB. Here is the first one I could find:
As mentioned above, the Transform component sometimes passes if the example count is reduced, so I suspect the problem is still be tied to dynamic batch size growth in some way. Here is the log downloaded-logs-20240213-071807.json.zip. In case it is useful, here is the source code for my |
since the OOM happens when applying the model and setting the
it's ugly, but should help if the problem is in the produced batch size until we have a better solution |
Interesting, will try that out and report back. |
The above suggestion did not work. I see we also set Applied to Transform component def preprocessing_fn(inputs):
tft_beam.Context.get_desired_batch_size = lambda _: 100
gpu_devices = tf.config.experimental.list_physical_devices("GPU")
for device in gpu_devices:
try:
tf.config.experimental.set_memory_growth(device, True)
except Exception as e:
print(f'Ignoring: \n"{e}" \nCannot set memory growth.')
... From the Dataflow worker logs:
UPDATE: removing |
System information
Describe the current behavior
I am using the TFX BulkInferrer to apply a model with an Xception and BERT transform layer to a dataset of 2.5 million Examples with image and text features. After running for 7h and processing on Dataflow an OOM error is triggered.
We see the error happens in the GPU (
device:GPU:0
) in the xception model (node xception/block2_sepconv1/separable_conv2d
) when trying to process large batches (shape[512,...
andshape[448,...
).That is a tensor with 1.8 billion floating point values, with 32-bit precision (4 bytes) should be (1.8e9 * 4bytes = 7.3GB). A single allocation attempt like that could fail on a GPU with 16GB.
Describe the expected behavior
The Beam BatchExecute algorithm should constrain the dynamic batch size to sizes less than 512 or 448 in order to fit onto the 16GB of GPU ram. The OOM happens on the "train" split (80% of the data) after hours of processing. On the smaller "eval" split (10%) the bulkInferrer succeeds. From the Dataflow metrics the batchsize_MAX was 256.
Standalone code to reproduce the issue
The issue is data dependent. It is a basic BulkInferrer with imported examples and an imported model. Relevant Beam Args:
Other info / logs
Here are the logs.
Using a bottom up search of all the python virtual env source files, I searched for the function names in the failed step name highlighted on the Dataflow job-graph:
RunInference[train]/RunInference/RunInferenceImpl/BulkInference/BatchElements/ParDo(_GlobalWindowsBatchingDoFn)
:ml/inference/base.py::RunInference
The text was updated successfully, but these errors were encountered: