Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Task]: Improve Enrichment docs #33012

Open
2 of 17 tasks
damccorm opened this issue Nov 4, 2024 · 10 comments
Open
2 of 17 tasks

[Task]: Improve Enrichment docs #33012

damccorm opened this issue Nov 4, 2024 · 10 comments

Comments

@damccorm
Copy link
Contributor

damccorm commented Nov 4, 2024

What needs to happen?

There are a few targeted fixes needed for the Enrichment docs:

Issue Priority

Priority: 2 (default / most normal work should be filed as P2)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam YAML
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Infrastructure
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner
@damccorm
Copy link
Contributor Author

damccorm commented Nov 4, 2024

@claudevdm this would be good to pick up at some point when you have space (don't drop other things, just when this fits in nicely)

@Vishesh-Tripathi
Copy link

hello sir , i would like to work on this issue please assign me this one.

@liferoad
Copy link
Contributor

liferoad commented Jan 6, 2025

Thanks! Please check https://beam.apache.org/contribute/:

Comment “.take-issue” on the issue you'd like to work on. This will cause the issue to be assigned to you.

@Vishesh-Tripathi
Copy link

.take-issue

@Vishesh-Tripathi
Copy link

hello sir i am updating the file by adding this section is it correct ?

BigQuery Support

The enrichment transform supports integration with BigQuery to dynamically enrich data using BigQuery datasets. By leveraging BigQuery as an external data source, users can execute efficient lookups for data enrichment directly in their Apache Beam pipelines.

To use BigQuery for enrichment:

  • Configure your BigQuery table as the data source for the enrichment process.
  • Ensure your pipeline has the appropriate credentials and permissions to access the BigQuery dataset.
  • Specify the query to extract the data to be used for enrichment.

This integration is particularly beneficial for use cases that require augmenting real-time streaming data with information stored in BigQuery.


Batching

To optimize requests to external services, the enrichment transform uses batching. Instead of performing a lookup for each individual element, the transform groups multiple elements into a batch and performs a single lookup for the entire batch.

Advantages of Batching:

  • Improved Throughput: Reduces the number of network calls.
  • Lower Latency: Fewer round trips to the external service.
  • Cost Optimization: Minimizes API call costs when working with paid external services.

Users can configure the batch size by specifying parameters in their pipeline setup. Adjusting the batch size can help fine-tune the balance between throughput and latency.


Caching with with_redis_cache

For frequently used enrichment data, caching can significantly improve performance by reducing repeated calls to the remote service. Apache Beam's with_redis_cache method allows you to integrate a Redis cache into the enrichment pipeline.

Benefits of Caching:

  • Reduced Latency: Fetches enrichment data from the cache instead of making network calls.
  • Improved Resilience: Minimizes the impact of network outages or service downtimes.
  • Scalability: Handles large volumes of enrichment requests efficiently.

To enable caching:

  1. Set up a Redis instance accessible by your pipeline.
  2. Use the with_redis_cache method to configure the cache in your enrichment transform.
  3. Specify the time-to-live (TTL) for cache entries to ensure data freshness.

Example:

from apache_beam.transforms.enrichment import with_redis_cache

# Enrichment pipeline with Redis cache
enriched_data = (input_data 
                 | 'Enrich with Cache' >> with_redis_cache(redis_config=redis_config, enrichment_transform=my_enrichment_transform))

@Vishesh-Tripathi
Copy link

Vishesh-Tripathi commented Jan 7, 2025

and i am adding this section to explain Cross join . please tell me if there is any mistake or need of updating.

What is a Cross-Join?

A cross-join is a Cartesian product operation where each row from one table is combined with every row from another table. It is useful when we want to create all possible combinations of two datasets.

Example:

  • Table A:

    A1 A2
    1 X
    2 Y
  • Table B:

    B1 B2
    10 P
    20 Q

Result of Cross-Join:

A1 A2 B1 B2
1 X 10 P
1 X 20 Q
2 Y 10 P
2 Y 20 Q

Cross-joins can be computationally expensive for large datasets, so use them judiciously.

@Vishesh-Tripathi
Copy link

@damccorm sir where I can find correct URL to replace the broken one?

@damccorm
Copy link
Contributor Author

damccorm commented Jan 9, 2025

I think that resource may have been deleted, but we could probably link to

class BigTableEnrichmentHandler(EnrichmentSourceHandler[beam.Row, beam.Row]):
as an example of how to build a handler.

The other changes look reasonable to me at a high level, it is probably easier to just go ahead and open a PR when you have a chance though, that will make it a bit easier to see the difference and review it.

Thanks for doing this!

@mohamedawnallah
Copy link
Contributor

Hi @Vishesh-Tripathi, are you currently working on this issue? I’d like to take it on otherwise. 🙏

cc: @damccorm, @liferoad

@Vishesh-Tripathi
Copy link

Ya I am working on the issue

Vishesh-Tripathi added a commit to Vishesh-Tripathi/beam that referenced this issue Jan 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants