Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Source Zuora: issues with high volume events (10m+ rows using live) #25319

Closed
jetvp opened this issue Apr 19, 2023 · 8 comments
Closed

Source Zuora: issues with high volume events (10m+ rows using live) #25319

jetvp opened this issue Apr 19, 2023 · 8 comments

Comments

@jetvp
Copy link

jetvp commented Apr 19, 2023

Environment

  • Airbyte version: 0.44.0
  • OS Version / Instance: AWS EKS
  • Deployment: Kubernetes
  • Source Connector and version: Zuora 0.1.3
  • Destination Connector and version: Snowflake 0.4.61
  • Step where error happened: Sync job for Zuora on high volume events (i.e. revenue, invoice, or usage items)

Current Behavior

It appears that the queries are causing full table scans, and not limited to rows only in the set dates ranges (this could be because the queries are not using the recommended date formats or possibly the order by statement; https://knowledgecenter.zuora.com/Zuora_Central_Platform/Query/Export_ZOQL/G_Dates_and_Datetimes). Accordingly, any dataset with more than 10m rows will fail no matter the date limits set (using live, unlimited will be depreciated by Zuora soon).

Additionally, the Zuora source connector uses date windows to reduce the number of events being extracted from the Zuora API to remain under API limits. However, as Zuora calculates revenue items or invoices items during billing runs, these data sources will very often will have large volumes in a set window size as they generate in very quick burst (even if setting it to sub second values). This is seen in airbyte/airbyte-integrations/connectors/source-zuora/source_zuora/source.py at line 231, where the query function is iterating over date windows.

To better handle stream with high number of events, the connector should look at using the AQuA API.

Expected Behavior

The query should not be causing Zuora to read more than 10m rows, even if the date is selecting a small amount. This causes any window size to fail.

Additionally as the unlimited data source for Zuora is about to be removed, the connector should look at using the recommended AQuA API to support high volume events (which supports stateful updates):
https://knowledgecenter.zuora.com/Zuora_Central_Platform/API/AB_Aggregate_Query_API

Alternatively it could querying from the last known event or start date (as done currently), but with a limit on rows equal to the maximum number of events found here (not using a date window). The subsequent query or stream slice should then use the last retrieved timestamp for the next query.

Logs

2023-04-19 14:06:10 source > Query failed (#20230419_140036_49258_z969q): Input Rows for revenueevent exceeded limit (10000000), QUERY: select * from revenueevent where updateddate >= TIMESTAMP '2023-03-01 00:00:00.000000 +00:00' and updateddate <= TIMESTAMP '2023-03-01 00:00:00.864000 +00:00' order by updateddate asc

Steps to Reproduce

  1. Create a Zuora sync including a stream with more than 10m events
  2. Set a small date window which pulls less than 10m events
  3. Synchronise the event stream using the live connection (unlimited will no longer be an option), and it will fail as the query has pulled over 10m during execution
@jetvp jetvp added needs-triage type/bug Something isn't working labels Apr 19, 2023
@jetvp jetvp changed the title Zuora issues with high volume events that are variable quantity over time (i.e. generated by billing runs) Zuora issues with high volume events (10m+ rows using live) Apr 20, 2023
@jetvp
Copy link
Author

jetvp commented Apr 20, 2023

Just corrected the above to note that current date window query appears to be pulling the full stream of rows on the Zuora side (hence breaching the API limit of 10m rows) even if the number of rows in the date window is small. This will be a game stopping issue as the unlimited connection will be depreciated by Zuora end of April, meaning these streams will fail.

Possible reasons could be either the non-standard date format or order by statement.

@jetvp
Copy link
Author

jetvp commented Apr 20, 2023

Looks like the date issue with rows is connected to the following pull:
24460

@frans-k
Copy link

frans-k commented Apr 24, 2023

@jetvp Zuora uses Trino and claims it's a limitation (I'd call it a bug) in Trino that causes this issue. The generated SQL in my pull request works, I just haven't managed to run the updated connector locally, even as a manually built, custom connector.

@jetvp
Copy link
Author

jetvp commented Apr 24, 2023

Thanks @frans-k!

I tired also building it myself, but had a similar issue. Not sure if this is related to another change in the code base yet.

@jetvp
Copy link
Author

jetvp commented Apr 26, 2023

@frans-k, I've posted comment on your pull request that fixed another bug with the connector causing authentication to fail. Possibly another change caused this.

You might be able to pass the prior test errors with it, as they were not a sign of bad credentials.

@igrankova igrankova changed the title Zuora issues with high volume events (10m+ rows using live) Source Zuora: issues with high volume events (10m+ rows using live) Jun 6, 2023
@jrolom jrolom added the help-welcome Accepting external contributions label Jun 23, 2023
@bmcgilli bmcgilli moved this to To Be Claimed in Hacktoberfest 2023 Sep 28, 2023
@topefolorunso
Copy link
Collaborator

I think this issue is being fixed here - #24460. Please confirm @frans-k @jetvp

@marcosmarxm
Copy link
Member

Zuora was archived and won't receive new updates.

@github-project-automation github-project-automation bot moved this from To Be Claimed to Done in Hacktoberfest 2023 Apr 12, 2024
@jetvp
Copy link
Author

jetvp commented May 29, 2024

Hi @marcosmarxm,

Unfortunately there is not support for Zuora via the connector builder since it creates outputs that need to be fetched. What changes need to be made to update the connector?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Development

No branches or pull requests

8 participants