-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[connector builder] BUG: Incremental substream state is appended rather than replaced (and start_time
is not updated)
#33854
Comments
start_time
is not updated
start_time
is not updatedstart_time
is not updated)
grooming notes:
|
We've had a lot of discussion around how to best tackle this because of how it impacts the Python CDK and the low-code CDK interface methods. Specifically the Temporarily moving this out of the current sprint, but the plan will be to pick this back up once a few other pieces of work get wrapped up. I'm hoping we can find a way to introduce this change w/o the interface change to fix the issue at hand even if the ideal fix can come later or in a follow up |
@brianjlai +1 for this issue. Ran into it for the |
end of sprint update: PR is in progress. should be wrapped up soon. |
Problem
Topic
Connector builder
Relevant information
I built a connector using the low-code connector framework (attached - uploaded as .txt since Github won't accept .yml).
I tried to create an incremental substream. My substream is
https://api.alchemer.com/v5/survey/<survey_id>/surveyresponse
, where I iterate through the surveys to pull the survey responses for each survey. Each response has asubmitted_at
field. The API allows filtering based on this field. So, for each survey, I would like to pull only the responses that were submitted since the last sync (and I will iterate through all surveys on each sync).However, the connector deployed from the connector builder
.yml
attached extracts and loads all survey responses for each survey on every sync.The
state
below is the value after running a daily sync for several days.It appears that Airbyte is not updating the state, but rather appending the state each sync for the substream, and thus keeps using the user-supplied
start_time
in the API request, rather than setting thestart_time
to be that of the last record from the previous sync.alchemer_testing_copy.txt
Implementation Notes
Note: Alchemer's api is only accessible to enterprise customers.
The working hypothesis for why this is happening is as follows:
id
and slice it belongs toparent_slice
.parent_id
andparent_slice
). The end date of the previous state and the current running sync will never match up and this causes the sync to end up doing a full refresh for each parent record and leads to us appending a new state entry instead of overwriting because the partition keys don't matchOptions Discussed:
Option 1: Make
SubstreamPartitionRouter
emit partition and cursor values in separate fieldsstream_slices()
. It returns these back to PerPartitionCursor as an array of objects for each record where each object has aparent_id
set to the record ID and aparent_slice
set to the partition/cursor value.Option 2: SubstreamPartitionRouter should not emit
parent_slice
if parent stream is incrementalWe should spike this proactively to verify that an incremental substream works for a parent stream:
Acceptance Criteria
SubstreamPartitionRouter
level, due to the complexity, we should also at minimum also test this at thePerPartitionRouter
that uses aSubstreamPartitionRouter
source-greenhouse
which supports an incremental parent stream. But it currently uses custom components, so the test manifest might need to be rewritten unfortunatelyThe text was updated successfully, but these errors were encountered: