Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-34076] Implement TTL-based caching for BigQuery table definitions #34135

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

Strikerrx01
Copy link

@Strikerrx01 Strikerrx01 commented Mar 1, 2025

[BEAM-34076] Implement TTL-based caching for BigQuery table definitions

Description

This change implements a thread-safe caching mechanism for BigQuery table definitions
in the BigQueryWrapper class to address issue BEAM-34076. The implementation uses
a TTL (Time-To-Live) caching strategy to reduce BigQuery API calls while maintaining
data freshness.

Motivation

Previously, get_table() was called independently by each worker, leading to potential
BigQuery quota issues when multiple workers access the same table definitions. This
implementation significantly reduces API calls by reusing cached table definitions
when appropriate.

Modifications

  • Added cachetools.TTLCache with configurable cache parameters
  • Implemented thread-safety using threading.RLock
  • Added methods to configure TTL and cache size limits
  • Added cache invalidation mechanisms (global and selective clearing)
  • Ensured proper cache behavior when changing TTL settings
  • Preserved existing API behavior for backward compatibility

Tests

Added comprehensive unit tests that verify:

  • Correct caching behavior for repeated table lookups
  • Cache entry expiration based on TTL
  • Thread-safety in concurrent environments
  • Proper cache invalidation when configuration changes
  • Selective cache clearing by project, dataset, or table

Documentation

Added method documentation for all new public methods and internal implementation
details to ensure maintainability and clarity.

Fixes #34076

Please Review

@apache/beam-maintainers

Copy link
Contributor

github-actions bot commented Mar 1, 2025

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

@Strikerrx01 Strikerrx01 force-pushed the bigquery-table-cache branch from b31af48 to cfff9ef Compare March 2, 2025 10:22
@Strikerrx01 Strikerrx01 closed this Mar 2, 2025
@Strikerrx01 Strikerrx01 deleted the bigquery-table-cache branch March 2, 2025 10:24
@Strikerrx01 Strikerrx01 restored the bigquery-table-cache branch March 2, 2025 10:26
@Strikerrx01 Strikerrx01 reopened this Mar 2, 2025
@Strikerrx01 Strikerrx01 force-pushed the bigquery-table-cache branch from cfff9ef to 6cc2629 Compare March 2, 2025 10:32
@liferoad liferoad requested a review from ahmedabu98 March 2, 2025 13:22
@Strikerrx01 Strikerrx01 force-pushed the bigquery-table-cache branch 2 times, most recently from f80bd52 to 890670a Compare March 2, 2025 16:39
- Fix field_type.upper() redundant call in beam_row_from_dict
- Fix temp dataset handling logic in BigQueryWrapper.__init__
- Add comprehensive test coverage for table definition caching
- Add proper thread safety with RLock
- Add documentation and comments

Fixes apache#34076
@Strikerrx01 Strikerrx01 force-pushed the bigquery-table-cache branch from 890670a to 8c5d460 Compare March 4, 2025 11:07
@Strikerrx01
Copy link
Author

Strikerrx01 commented Mar 4, 2025

R: @stankiewicz Thanks for the review. I've addressed your feedback:

  1. Removed redundant field_type.upper() call since the type is already uppercase
  2. Fixed temp dataset handling logic to preserve original behavior with temp_dataset_id or (BigQueryWrapper.TEMP_DATASET + self._temporary_table_suffix)
  3. Added comprehensive test coverage for caching functionality in bigquery_tools_test.py

Could you please review the updated code in bigquery_tools.py and bigquery_tools_test.py? The new tests verify caching behavior, TTL validation, and thread safety.

Please let me know if you'd like me to explain anything further.

Copy link
Contributor

github-actions bot commented Mar 4, 2025

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment assign set of reviewers

- Reorder methods for better readability (get_query_location before get_table)
- Use _get_temp_dataset() instead of inlining logic
- Add comprehensive test coverage for temp dataset handling
- Fix method organization as per review feedback

Part of apache#34076
@Strikerrx01 Strikerrx01 requested a review from stankiewicz March 4, 2025 12:09
@Strikerrx01 Strikerrx01 force-pushed the bigquery-table-cache branch from c496455 to 7128d38 Compare March 9, 2025 05:38
@Strikerrx01 Strikerrx01 closed this Mar 9, 2025
@Strikerrx01 Strikerrx01 force-pushed the bigquery-table-cache branch from 7128d38 to ac1c788 Compare March 9, 2025 05:48
@Strikerrx01 Strikerrx01 reopened this Mar 9, 2025
@derrickaw
Copy link
Collaborator

Hi @Strikerrx01, friendly ping to look at finishing this pr. Thanks!

@chamikaramj
Copy link
Contributor

@Strikerrx01 any updates ?

@Strikerrx01
Copy link
Author

@Strikerrx01 any updates ?

@chamikaramj I apologize for the lack of updates on this PR over the past week. I've been dealing with a medical emergency that prevented me from working on this. I'm now back and have just pushed an update with the formatting changes as requested. Thank you for your patience, and I'm ready to address any remaining feedback to get this PR merged.

@Strikerrx01 Strikerrx01 changed the title [Python] Add caching for BigQuery table definitions [BEAM-34076] Implement TTL-based caching for BigQuery table definitions Mar 29, 2025
@stankiewicz
Copy link
Contributor

there are formatting issues -- https://cwiki.apache.org/confluence/display/BEAM/Python+Tips#PythonTips-LintandFormattingChecks and run yapf, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Request]: Cache the BigQuery table definition instead of calling tables.get() from every worker
6 participants