[Task]: Improve Enrichment docs #33012

damccorm · 2024-11-04T21:22:45Z

damccorm · 2024-11-04T21:23:37Z

@claudevdm this would be good to pick up at some point when you have space (don't drop other things, just when this fits in nicely)

Vishesh-Tripathi · 2025-01-06T17:40:30Z

hello sir , i would like to work on this issue please assign me this one.

liferoad · 2025-01-06T20:55:22Z

Thanks! Please check https://beam.apache.org/contribute/:

Comment “.take-issue” on the issue you'd like to work on. This will cause the issue to be assigned to you.

Vishesh-Tripathi · 2025-01-07T02:50:13Z

.take-issue

Vishesh-Tripathi · 2025-01-07T15:48:45Z

hello sir i am updating the file by adding this section is it correct ?

BigQuery Support

The enrichment transform supports integration with BigQuery to dynamically enrich data using BigQuery datasets. By leveraging BigQuery as an external data source, users can execute efficient lookups for data enrichment directly in their Apache Beam pipelines.

To use BigQuery for enrichment:

Configure your BigQuery table as the data source for the enrichment process.
Ensure your pipeline has the appropriate credentials and permissions to access the BigQuery dataset.
Specify the query to extract the data to be used for enrichment.

This integration is particularly beneficial for use cases that require augmenting real-time streaming data with information stored in BigQuery.

Batching

To optimize requests to external services, the enrichment transform uses batching. Instead of performing a lookup for each individual element, the transform groups multiple elements into a batch and performs a single lookup for the entire batch.

Advantages of Batching:

Improved Throughput: Reduces the number of network calls.
Lower Latency: Fewer round trips to the external service.
Cost Optimization: Minimizes API call costs when working with paid external services.

Users can configure the batch size by specifying parameters in their pipeline setup. Adjusting the batch size can help fine-tune the balance between throughput and latency.

Caching with `with_redis_cache`

For frequently used enrichment data, caching can significantly improve performance by reducing repeated calls to the remote service. Apache Beam's with_redis_cache method allows you to integrate a Redis cache into the enrichment pipeline.

Benefits of Caching:

Reduced Latency: Fetches enrichment data from the cache instead of making network calls.
Improved Resilience: Minimizes the impact of network outages or service downtimes.
Scalability: Handles large volumes of enrichment requests efficiently.

To enable caching:

Set up a Redis instance accessible by your pipeline.
Use the with_redis_cache method to configure the cache in your enrichment transform.
Specify the time-to-live (TTL) for cache entries to ensure data freshness.

Example:

from apache_beam.transforms.enrichment import with_redis_cache

# Enrichment pipeline with Redis cache
enriched_data = (input_data 
                 | 'Enrich with Cache' >> with_redis_cache(redis_config=redis_config, enrichment_transform=my_enrichment_transform))

Vishesh-Tripathi · 2025-01-07T16:06:05Z

and i am adding this section to explain Cross join . please tell me if there is any mistake or need of updating.

What is a Cross-Join?

A cross-join is a Cartesian product operation where each row from one table is combined with every row from another table. It is useful when we want to create all possible combinations of two datasets.

Example:

Table A:

A1 A2

1 X

2 Y
Table B:

B1 B2

10 P

20 Q

Result of Cross-Join:

A1	A2	B1	B2
1	X	10	P
1	X	20	Q
2	Y	10	P
2	Y	20	Q

Cross-joins can be computationally expensive for large datasets, so use them judiciously.

Vishesh-Tripathi · 2025-01-09T05:05:00Z

@damccorm sir where I can find correct URL to replace the broken one?

damccorm · 2025-01-09T18:17:16Z

I think that resource may have been deleted, but we could probably link to

beam/sdks/python/apache_beam/transforms/enrichment_handlers/bigtable.py

Line 41 in d50cc15

class BigTableEnrichmentHandler(EnrichmentSourceHandler[beam.Row, beam.Row]):

as an example of how to build a handler.

The other changes look reasonable to me at a high level, it is probably easier to just go ahead and open a PR when you have a chance though, that will make it a bit easier to see the difference and review it.

Thanks for doing this!

mohamedawnallah · 2025-01-15T08:50:19Z

Hi @Vishesh-Tripathi, are you currently working on this issue? I’d like to take it on otherwise. 🙏

cc: @damccorm, @liferoad

Vishesh-Tripathi · 2025-01-15T09:32:46Z

Ya I am working on the issue

damccorm added task awaiting triage good first issue labels Nov 4, 2024

github-actions bot added python website P2 labels Nov 4, 2024

github-actions bot removed the awaiting triage label Jan 7, 2025

github-actions bot assigned Vishesh-Tripathi Jan 7, 2025

Vishesh-Tripathi added a commit to Vishesh-Tripathi/beam that referenced this issue Jan 16, 2025

Enirichment-doc - modifiaction in pr apache#33012

58d8d4f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Task]: Improve Enrichment docs #33012

[Task]: Improve Enrichment docs #33012

damccorm commented Nov 4, 2024

damccorm commented Nov 4, 2024

Vishesh-Tripathi commented Jan 6, 2025

liferoad commented Jan 6, 2025

Vishesh-Tripathi commented Jan 7, 2025

Vishesh-Tripathi commented Jan 7, 2025

Vishesh-Tripathi commented Jan 7, 2025 •

edited

Loading

Vishesh-Tripathi commented Jan 9, 2025

damccorm commented Jan 9, 2025

mohamedawnallah commented Jan 15, 2025

Vishesh-Tripathi commented Jan 15, 2025

[Task]: Improve Enrichment docs #33012

[Task]: Improve Enrichment docs #33012

Comments

damccorm commented Nov 4, 2024

What needs to happen?

Issue Priority

Issue Components

damccorm commented Nov 4, 2024

Vishesh-Tripathi commented Jan 6, 2025

liferoad commented Jan 6, 2025

Vishesh-Tripathi commented Jan 7, 2025

Vishesh-Tripathi commented Jan 7, 2025

BigQuery Support

Batching

Advantages of Batching:

Caching with with_redis_cache

Benefits of Caching:

Vishesh-Tripathi commented Jan 7, 2025 • edited Loading

What is a Cross-Join?

Vishesh-Tripathi commented Jan 9, 2025

damccorm commented Jan 9, 2025

mohamedawnallah commented Jan 15, 2025

Vishesh-Tripathi commented Jan 15, 2025

Caching with `with_redis_cache`

Vishesh-Tripathi commented Jan 7, 2025 •

edited

Loading