-
Notifications
You must be signed in to change notification settings - Fork 30
feat: cache properties from endpoint #808
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
👋 Greetings, Airbyte Team Member!Here are some helpful tips and reminders for your convenience. Testing This CDK VersionYou can test this version of the CDK using the following: # Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@maxi297/cache_properties_from_endpoint#egg=airbyte-python-cdk[dev]' --help
# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch maxi297/cache_properties_from_endpointHelpful ResourcesPR Slash CommandsAirbyte Maintainers can execute the following slash commands on your PR:
|
📝 WalkthroughWalkthroughPropertiesFromEndpoint now caches computed property names and returns them as a List[str]; query-property APIs were tightened to accept List[str] and removed stream_slice parameters; call sites and tests updated to match new signatures and verify caching and type coercion. Changes
Sequence Diagram(s)sequenceDiagram
participant Caller
participant PropertiesFromEndpoint
participant Retriever
Caller->>PropertiesFromEndpoint: get_properties_from_endpoint()
alt cached not set
PropertiesFromEndpoint->>Retriever: read_records(stream_slice = {"partition": {}, "cursor_slice": {}})
Retriever-->>PropertiesFromEndpoint: iterable records
rect rgb(220,240,220)
PropertiesFromEndpoint->>PropertiesFromEndpoint: for each record -> _get_property -> build List[str]
PropertiesFromEndpoint->>PropertiesFromEndpoint: store List[str] in _cached_properties
end
PropertiesFromEndpoint-->>Caller: return List[str]
else cached
PropertiesFromEndpoint-->>Caller: return cached List[str]
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Should cached properties ever be invalidated (e.g., if endpoint values can change per-run or per-slice), or is it acceptable to assume endpoint properties remain stable for the connector's lifetime — wdyt? Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
unit_tests/sources/declarative/requesters/query_properties/test_properties_from_endpoint.py (1)
1-1: Don't forget the formatting fix!The pipeline is reporting formatting issues for this file too. Could you run
ruff formatto fix them, wdyt?#!/bin/bash cd unit_tests/sources/declarative/requesters/query_properties ruff format test_properties_from_endpoint.py
🧹 Nitpick comments (1)
unit_tests/sources/declarative/requesters/query_properties/test_property_chunking.py (1)
108-128: Test logic is solid—just needs a formatting touch-up.The test is excellent for validating the non-mutation guarantee. However, verification confirms the pipeline was right: the file is missing a final newline at the end.
Could you run
ruff format unit_tests/sources/declarative/requesters/query_properties/test_property_chunking.pyto add it, wdyt?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
airbyte_cdk/sources/declarative/requesters/query_properties/properties_from_endpoint.py(2 hunks)airbyte_cdk/sources/declarative/requesters/query_properties/property_chunking.py(1 hunks)airbyte_cdk/sources/declarative/requesters/query_properties/query_properties.py(1 hunks)unit_tests/sources/declarative/requesters/query_properties/test_properties_from_endpoint.py(3 hunks)unit_tests/sources/declarative/requesters/query_properties/test_property_chunking.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
unit_tests/sources/declarative/requesters/query_properties/test_property_chunking.py (1)
airbyte_cdk/sources/declarative/requesters/query_properties/property_chunking.py (3)
PropertyChunking(25-71)PropertyLimitType(14-21)get_request_property_chunks(42-68)
airbyte_cdk/sources/declarative/requesters/query_properties/properties_from_endpoint.py (3)
airbyte_cdk/sources/declarative/interpolation/interpolated_string.py (1)
InterpolatedString(13-79)airbyte_cdk/sources/types.py (1)
StreamSlice(75-169)airbyte_cdk/sources/declarative/retrievers/simple_retriever.py (1)
read_records(512-554)
unit_tests/sources/declarative/requesters/query_properties/test_properties_from_endpoint.py (3)
airbyte_cdk/sources/declarative/requesters/query_properties/properties_from_endpoint.py (2)
get_properties_from_endpoint(34-37)PropertiesFromEndpoint(15-44)airbyte_cdk/sources/types.py (5)
StreamSlice(75-169)cursor_slice(107-112)partition(99-104)Record(21-72)data(35-36)airbyte_cdk/sources/declarative/retrievers/simple_retriever.py (1)
read_records(512-554)
🪛 GitHub Actions: Linters
unit_tests/sources/declarative/requesters/query_properties/test_property_chunking.py
[error] 1-1: Ruff format check failed. 3 files would be reformatted. Run 'ruff format' to fix code style issues in this file.
airbyte_cdk/sources/declarative/requesters/query_properties/properties_from_endpoint.py
[error] 1-1: Ruff format check failed. 3 files would be reformatted. Run 'ruff format' to fix code style issues in this file.
unit_tests/sources/declarative/requesters/query_properties/test_properties_from_endpoint.py
[error] 1-1: Ruff format check failed. 3 files would be reformatted. Run 'ruff format' to fix code style issues in this file.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
- GitHub Check: Check: source-pokeapi
- GitHub Check: Check: source-intercom
- GitHub Check: Check: source-hardcoded-records
- GitHub Check: Check: destination-motherduck
- GitHub Check: Check: source-shopify
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Pytest (All, Python 3.13, Ubuntu)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (All, Python 3.12, Ubuntu)
- GitHub Check: Pytest (Fast)
- GitHub Check: SDM Docker Image Build
- GitHub Check: Manifest Server Docker Image Build
🔇 Additional comments (6)
airbyte_cdk/sources/declarative/requesters/query_properties/property_chunking.py (1)
42-44: LGTM! Type tightening aligns with implementation.The change from
Iterable[str]toList[str]matches the actual usage pattern—the function converts to a list immediately on line 46 anyway. This makes the contract more explicit.airbyte_cdk/sources/declarative/requesters/query_properties/query_properties.py (1)
37-41: LGTM! Type annotation correctly tightened.The local
fieldsvariable type now accurately reflects thatget_properties_from_endpointreturnsList[str], making the typing more precise.airbyte_cdk/sources/declarative/requesters/query_properties/properties_from_endpoint.py (1)
39-44: Nice helper method for property extraction!The
_get_propertymethod cleanly handles path evaluation and type coercion to string. The logic for handling both string andInterpolatedStringnodes looks solid.unit_tests/sources/declarative/requesters/query_properties/test_properties_from_endpoint.py (3)
47-50: LGTM! Test updated for new return type.Removing the
list()wrapper is correct sinceget_properties_from_endpointnow returnsList[str]directly.
136-156: Good caching test, but consider testing differentstream_slicevalues?This test verifies that the retriever is called only once, which is great! However, it uses the same
stream_slicefor all three calls. Given my concern about the cache not accounting for different slices (see my comment onproperties_from_endpoint.py), it might be valuable to add a test that uses differentstream_slicevalues to verify the expected behavior, wdyt?
158-177: Excellent test for type coercion!This test ensures that integer property values are correctly converted to strings, which is important for the consistent
List[str]return type. Nice edge case coverage!
airbyte_cdk/sources/declarative/requesters/query_properties/properties_from_endpoint.py
Outdated
Show resolved
Hide resolved
| def get_properties_from_endpoint(self, stream_slice: Optional[StreamSlice]) -> List[str]: | ||
| if self._cached_properties is None: | ||
| self._cached_properties = list(map(self._get_property, self.retriever.read_records(records_schema={}, stream_slice=stream_slice))) # type: ignore # extracted will be a MutableMapping, given input data structure | ||
| return self._cached_properties |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Critical: Cache doesn't account for stream_slice parameter.
The caching logic assumes all calls return the same properties, but stream_slice is passed to retriever.read_records. If different stream_slice values produce different properties, subsequent calls with different slices will incorrectly return the first cached result.
Could you verify whether stream_slice affects the properties returned? If it does, the cache should be keyed by the slice. If not, perhaps the parameter could be removed or documented as unused, wdyt?
Run this script to check if any retriever implementations use stream_slice when reading property records:
🏁 Script executed:
#!/bin/bash
# Search for retriever implementations that might use stream_slice in read_records
# Look for read_records method implementations and usage patterns
ast-grep --pattern $'def read_records($$$, stream_slice$$$):
$$$'
# Also check if PropertiesFromEndpoint is used in contexts where stream_slice varies
rg -n -A5 -B5 'PropertiesFromEndpoint' --type=pyLength of output: 42887
Now let me search for test cases and actual usage to confirm whether stream_slice variations would cause issues:
Based on my verification, the original review comment raises a valid and critical concern. Here's what I found:
Stream slices typically return a list of slices each with a different hourly timestamp to be used when creating requests, and the stream slice description (the input dict) is passed as input to each stream's read_stream method so the connector can use that to make queries as needed. The codebase demonstrates this through its factory comments that explicitly state the properties endpoint is called for every slice.
The factory code's comment—"properties endpoint will be called for every slice"—directly contradicts the current implementation. While there is HTTP-level caching on the retriever, the _cached_properties field on PropertiesFromEndpoint is keyed only by initialization, not by stream_slice. This means if the endpoint's response depends on the stream_slice parameter (which is clearly possible given it's passed as a parameter), you'll get incorrect results after the first slice.
Cache doesn't account for stream_slice parameter.
The _cached_properties field caches properties globally on the PropertiesFromEndpoint instance without considering the stream_slice parameter. Since stream_slice is passed to retriever.read_records() and can vary across multiple stream partitions (different dates, IDs, etc.), different slices may return different properties—but the cache will return the first result regardless.
If the properties endpoint accepts stream_slice as a filter (e.g., per-date properties), this will cause incorrect data to be used for all slices after the first one. Consider whether the cache should be keyed by stream_slice, or if the parameter is unused and could be removed or documented, wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brianjlai this comment make sense to me but I'm also confused as to why the stream slice would be useful in extracting properties. Do you have more context on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment only makes sense if we allow for the stream_slice or whatever to be a part of the interpolation context for the API request we make to fetch the properties. It doesn't really feel like we have a good use case for needing stream slice to be in the interpolation context because schemas are 1:1 with the stream and therefore global. I can't think of an api that would have different properties or schemas for different date ranges or parent records.
We would also lose a bit of optimization if we keyed by stream slice too since all date ranges would have to make at least 1 HTTP request to cache versus using the global properties on the initial request.
We could just:
- Not include the
stream_slicecontext in the HttpRequester component since we control what we paste into the - Raise an error if the properties from endpoint attempts to use the stream slice context
I feel like we won't run into this issue very often if ever since stream slice doesn't make sense to perform schema discovery. but this extra work would just be to protect against it. I'm not worried though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok! So I think the step forward is to remove the stream_slice from the properties from endpoint stuff. I'll look at it a bit more then, stay tuned!
airbyte_cdk/sources/declarative/requesters/query_properties/properties_from_endpoint.py
Show resolved
Hide resolved
|
/autofix
|
| if self._cached_properties is None: | ||
| self._cached_properties = list( | ||
| map( | ||
| self._get_property, # type: ignore # SimpleRetriever and AsyncRetriever only returns Record. Should we change the return type of Retriever.read_records? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous # type: ignore was catching both the return type of Retriever.read_records and the Mapping being actually a MutableMapping
| ] | ||
| yield dpath.get(property_obj, path, default=[]) # type: ignore # extracted will be a MutableMapping, given input data structure | ||
| def get_properties_from_endpoint(self, stream_slice: Optional[StreamSlice]) -> List[str]: | ||
| if self._cached_properties is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concern to be resolved as if the cache should be per StreamSlice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
airbyte_cdk/sources/declarative/requesters/query_properties/properties_from_endpoint.py(1 hunks)airbyte_cdk/sources/declarative/retrievers/simple_retriever.py(0 hunks)
💤 Files with no reviewable changes (1)
- airbyte_cdk/sources/declarative/retrievers/simple_retriever.py
🧰 Additional context used
🧬 Code graph analysis (1)
airbyte_cdk/sources/declarative/requesters/query_properties/properties_from_endpoint.py (3)
airbyte_cdk/sources/declarative/interpolation/interpolated_string.py (1)
InterpolatedString(13-79)airbyte_cdk/sources/types.py (1)
StreamSlice(75-169)airbyte_cdk/sources/declarative/retrievers/simple_retriever.py (1)
read_records(512-553)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
- GitHub Check: Check: source-pokeapi
- GitHub Check: Check: source-hardcoded-records
- GitHub Check: Check: source-intercom
- GitHub Check: Check: source-shopify
- GitHub Check: Check: destination-motherduck
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Pytest (All, Python 3.13, Ubuntu)
- GitHub Check: Pytest (All, Python 3.12, Ubuntu)
- GitHub Check: Manifest Server Docker Image Build
- GitHub Check: SDM Docker Image Build
- GitHub Check: Pytest (Fast)
- GitHub Check: Analyze (python)
airbyte_cdk/sources/declarative/requesters/query_properties/properties_from_endpoint.py
Show resolved
Hide resolved
PyTest Results (Full)3 808 tests 3 796 ✅ 11m 15s ⏱️ Results for commit d22170d. ♻️ This comment has been updated with latest results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left a few notes, but I think this change makes sense and nothing to block. Let me know how you feel about the notes I mentioned.
| def get_properties_from_endpoint(self, stream_slice: Optional[StreamSlice]) -> List[str]: | ||
| if self._cached_properties is None: | ||
| self._cached_properties = list(map(self._get_property, self.retriever.read_records(records_schema={}, stream_slice=stream_slice))) # type: ignore # extracted will be a MutableMapping, given input data structure | ||
| return self._cached_properties |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment only makes sense if we allow for the stream_slice or whatever to be a part of the interpolation context for the API request we make to fetch the properties. It doesn't really feel like we have a good use case for needing stream slice to be in the interpolation context because schemas are 1:1 with the stream and therefore global. I can't think of an api that would have different properties or schemas for different date ranges or parent records.
We would also lose a bit of optimization if we keyed by stream slice too since all date ranges would have to make at least 1 HTTP request to cache versus using the global properties on the initial request.
We could just:
- Not include the
stream_slicecontext in the HttpRequester component since we control what we paste into the - Raise an error if the properties from endpoint attempts to use the stream slice context
I feel like we won't run into this issue very often if ever since stream slice doesn't make sense to perform schema discovery. but this extra work would just be to protect against it. I'm not worried though
| for node in self._property_field_path | ||
| ] | ||
| yield dpath.get(property_obj, path, default=[]) # type: ignore # extracted will be a MutableMapping, given input data structure | ||
| def get_properties_from_endpoint(self, stream_slice: Optional[StreamSlice]) -> List[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one risk that this does pose because we cache is if we have a crazy large number of properties and we run out of memory. That was why I had originally made this use an iterable so that we yield properties as they come in.
Granted it feels pretty necessary in order to allow caching, but it does also help make the argument for only maintaining a global set of property mapping. Otherwise if we had a large set of properties and we had to store it for every slice that might be more concerning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other risk I've thought of is if someone removes a property during the sync. Then, it might fail because we ask for a property that HubSpot doesn't know about. That being said, I assume this happens rarely and if it's too much of a pain, we can have a refresh on properties maybe later...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've pushed the change. I know declarative_component_schema.yaml reference the SimpleRetriever component which allows for interpolation on StreamSlices but I don't have an easy way to remove that. Do you think we should improve the description of PropertiesFromEndpoint to clarify this or we don't think people would do anything in the context of stream slice for properties retrieval?
|
/autofix
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
airbyte_cdk/sources/declarative/requesters/query_properties/properties_from_endpoint.py (1)
43-48: Missing property path returns literal string"[]"instead of empty string.When
dpath.getdoesn't find the property path, it returns the default value[](empty list), whichstr()then converts to the literal string"[]". This means properties missing from the endpoint response will appear as the string"[]"in the property list rather than an empty string or being filtered out.Is this the intended behavior, or should missing properties yield an empty string (using
default="") or be skipped entirely, wdyt?If you'd prefer empty strings for missing properties, apply this diff:
- return str(dpath.get(property_obj, path, default=[])) # type: ignore # extracted will be a MutableMapping, given input data structure + result = dpath.get(property_obj, path, default="") + return str(result) if result else "" # type: ignore # extracted will be a MutableMapping, given input data structureAlternatively, if missing properties should be skipped, you could filter them out in
get_properties_from_endpointinstead.
🧹 Nitpick comments (1)
unit_tests/sources/declarative/requesters/query_properties/test_query_properties.py (1)
88-90: Mock return type should match the actual implementation.The mock returns an iterator (
iter([...])) butget_properties_from_endpointnow returns a concreteList[str]. While Python's duck typing makes this work in practice, the mock should match the actual type for accuracy and to catch potential type-related issues, wdyt?Apply this diff to align the mock with the actual return type:
- properties_from_endpoint_mock.get_properties_from_endpoint.return_value = iter( - ["alice", "clover", "dio", "k", "luna", "phi", "quark", "sigma", "tenmyouji"] - ) + properties_from_endpoint_mock.get_properties_from_endpoint.return_value = [ + "alice", "clover", "dio", "k", "luna", "phi", "quark", "sigma", "tenmyouji" + ]
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
airbyte_cdk/sources/declarative/requesters/query_properties/properties_from_endpoint.py(1 hunks)airbyte_cdk/sources/declarative/requesters/query_properties/query_properties.py(1 hunks)airbyte_cdk/sources/declarative/retrievers/simple_retriever.py(1 hunks)unit_tests/sources/declarative/requesters/query_properties/test_properties_from_endpoint.py(3 hunks)unit_tests/sources/declarative/requesters/query_properties/test_query_properties.py(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- unit_tests/sources/declarative/requesters/query_properties/test_properties_from_endpoint.py
- airbyte_cdk/sources/declarative/retrievers/simple_retriever.py
🧰 Additional context used
🧬 Code graph analysis (3)
airbyte_cdk/sources/declarative/requesters/query_properties/properties_from_endpoint.py (3)
airbyte_cdk/sources/declarative/interpolation/interpolated_string.py (1)
InterpolatedString(13-79)airbyte_cdk/sources/declarative/retrievers/simple_retriever.py (1)
read_records(510-551)airbyte_cdk/sources/types.py (3)
StreamSlice(75-169)partition(99-104)cursor_slice(107-112)
unit_tests/sources/declarative/requesters/query_properties/test_query_properties.py (2)
airbyte_cdk/sources/declarative/requesters/query_properties/query_properties.py (1)
get_request_property_chunks(28-46)airbyte_cdk/sources/declarative/requesters/query_properties/property_chunking.py (1)
get_request_property_chunks(42-68)
airbyte_cdk/sources/declarative/requesters/query_properties/query_properties.py (3)
airbyte_cdk/sources/declarative/requesters/query_properties/property_chunking.py (1)
get_request_property_chunks(42-68)airbyte_cdk/sources/declarative/requesters/query_properties/properties_from_endpoint.py (2)
PropertiesFromEndpoint(14-48)get_properties_from_endpoint(33-41)airbyte_cdk/sources/declarative/models/declarative_component_schema.py (1)
PropertiesFromEndpoint(2771-2782)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
- GitHub Check: Check: destination-motherduck
- GitHub Check: Check: source-pokeapi
- GitHub Check: Check: source-shopify
- GitHub Check: Check: source-intercom
- GitHub Check: Check: source-hardcoded-records
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (All, Python 3.12, Ubuntu)
- GitHub Check: Pytest (All, Python 3.13, Ubuntu)
- GitHub Check: Manifest Server Docker Image Build
- GitHub Check: SDM Docker Image Build
- GitHub Check: Pytest (Fast)
🔇 Additional comments (2)
airbyte_cdk/sources/declarative/requesters/query_properties/properties_from_endpoint.py (2)
25-26: LGTM! Cache field appropriately initialized.The
_cached_propertiesfield is properly typed asOptional[List[str]]and initialized toNone, enabling the lazy-load pattern implemented below.
33-41: LGTM! Caching implementation is sound.The caching logic correctly checks for
Noneon first call, populates the cache viamap, and returns the cached result on subsequent calls. This ensures properties are fetched once per instance, as intended.
airbyte_cdk/sources/declarative/requesters/query_properties/query_properties.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
airbyte_cdk/sources/declarative/requesters/query_properties/properties_from_endpoint.py (1)
10-10: Minor: Consider removing unusedStreamSliceimport?After removing the
stream_sliceparameter fromget_properties_from_endpoint, theStreamSliceimport on line 10 appears to be unused. Would you like to clean it up to keep imports minimal, wdyt?#!/bin/bash # Verify if StreamSlice is used anywhere in the file rg -n "StreamSlice" airbyte_cdk/sources/declarative/requesters/query_properties/properties_from_endpoint.py | grep -v "^10:"
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
airbyte_cdk/sources/declarative/declarative_component_schema.yaml(1 hunks)airbyte_cdk/sources/declarative/requesters/query_properties/properties_from_endpoint.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
airbyte_cdk/sources/declarative/requesters/query_properties/properties_from_endpoint.py (2)
airbyte_cdk/sources/declarative/interpolation/interpolated_string.py (1)
InterpolatedString(13-79)airbyte_cdk/sources/declarative/retrievers/simple_retriever.py (1)
read_records(512-553)
🔇 Additional comments (4)
airbyte_cdk/sources/declarative/declarative_component_schema.yaml (1)
3328-3328: LGTM! Clear documentation of the stream_slice limitation.The note about stream_slices not being interpolatable from this retriever is a helpful clarification that aligns with the API changes in
properties_from_endpoint.py. This will prevent users from attempting to use stream_slice context where it's not supported.airbyte_cdk/sources/declarative/requesters/query_properties/properties_from_endpoint.py (3)
25-26: Good addition of the cache field.The
_cached_propertiesfield withOptional[List[str]]type hint andNoneinitialization is a clean way to implement lazy caching. The private naming convention is appropriate since this is an internal implementation detail.
33-41: Clean caching implementation for property retrieval.The caching logic is straightforward and effective:
- On first call, retrieves records and maps them through
_get_property- Subsequent calls return the cached list
- Using
stream_slice=Nonealigns with the discussions about properties being global to the streamOne question: Line 37's
# type: ignorecomment mentions that the return type ofRetriever.read_recordsmight need updating. Is this something that should be addressed at the interface level, or is the type: ignore acceptable here, wdyt?
43-48: Clarify the intended behavior for missing property fields.The tests all cover records with complete data—none test scenarios where the property field path is missing from a record. When
dpath.getdoesn't find the path, it returns[](the default), whichstr()converts to the literal string"[]". This means incomplete records would contribute"[]"to the cached properties list.The gap in test coverage makes it unclear whether this behavior is intentional. Should records with missing property fields contribute empty strings (
"") instead, or is the"[]"behavior by design? Could you verify this and add a test case for missing properties to clarify the expected behavior, wdyt?
What
https://github.com/airbytehq/airbyte-internal-issues/issues/14929
How
Adding a cache within PropertiesFromEndpoint. Note that with this solution, every instance of PropertiesFromEndpoint will have a different cache so it may be that the same stream as a child/main stream vs as a parent stream have different properties from endpoint if a field is added between the read of those streams. I don't see a case where this would be problematic though.
Summary by CodeRabbit
Performance Improvements
Bug Fixes
Improvements
Tests
Documentation