Skip to content

Conversation

brianjlai
Copy link
Contributor

@brianjlai brianjlai commented Oct 14, 2025

What

An optimization that we've been looking into for some of our certified connectors that require multiple requests to fetch the entire set of properties for each record. Some examples are:

  • source-linkedin-ads - Analytics streams have a static list of almost 100 fields that requires multiple requests to get the complete record.
  • source-hubspot - Various CRM search and other streams have a dynamic list of fields to fetch and often will require multiple requests to get each complete record.

The issue is that if a customer were to only select a few few fields when configuring their catalog, we still make the extra requests to get the complete record even if many of those fields are ultimately not emitted in the response. This can lead to a lot of wasted requests.

This change filters out schema properties that were de-selected by users which should allow for fewer API requests for customers who do not need to sync every column of a stream.

How

There's two pieces to focus on here:

  1. Loading the configured catalog
  • We now take the incoming configured_catalog during a read and we load that into the ModelToComponentFactory and structure it as a mapping from stream names to configured streams.
  • When building each stream's simple retriever, we pass in the configured stream. And we now have access to the json_schema with the list of selected fields
  • One edge case to consider is that for CHECK/DISCOVER operations, we don't have a catalog so we have to assume that there are three types of property sets:
    • None for check/read : Fetch all properties which is the default case.
    • ['selected_1, selected_2]: Fetch only selected values
    • []: all values de-selected which should not fetch any properties
  • That is the main edge case here that we treat None and [] as different behaviors.
  1. Handling a stream catalog whose json_schema was transformed
  • This was something unique to Hubspot (though generalizable to any connector that transforms the json schema) because if the json_schema transforms some of the fields, the schema we receive during a read will not match what we need to request.
  • Our attempt to work around that in a very limited scope is to allow the connector developer to specify a new component JsonSchemaPropertySelector that takes the arbitrary json_schema that comes in the catalog. We then treat this like a single record which allows us to reuse the existing record transformation logic. For Hubspot we need to rename all properties fields to remove the properties_ prefix. For other sources, we can reuse this logic to remove unneeded fields like id, created_at etc.

Ruled out approaches

  • Make a breaking change to hubspot so that we don't transform the json schema. This introduces a lot of overhead and the other problem is that if we don't prefix all the customer properties with properties_, we might end up with a collision at the top level which would be very bad
  • Allowing for a declarative component that points to where in the json_schema the list of properties to fetch are located. This would've worked for Hubspot since all properties are under the top level properties object. However, the current platform does not allow column selection for nested objects, only at the top level

Summary by CodeRabbit

  • New Features

    • Property selection: filter and transform which stream properties are requested based on configured stream schemas.
    • Per-stream configuration propagation so selectors use the configured catalog during requests.
    • Property chunking now respects configured property sets for optimized request batching.
  • Tests

    • Expanded unit tests covering property selectors, transformations, chunking, and configured-catalog scenarios.

@github-actions
Copy link

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

Testing This CDK Version

You can test this version of the CDK using the following:

# Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@brian/property_chunking_only_fetch_enabled_fields#egg=airbyte-python-cdk[dev]' --help

# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch brian/property_chunking_only_fetch_enabled_fields

Helpful Resources

PR Slash Commands

Airbyte Maintainers can execute the following slash commands on your PR:

  • /autofix - Fixes most formatting and linting issues
  • /poetry-lock - Updates poetry.lock file
  • /test - Runs connector tests with the updated CDK
  • /poe build - Regenerate git-committed build artifacts, such as the pydantic models which are generated from the manifest JSON schema in YAML.
  • /poe <command> - Runs any poe command in the CDK environment

📝 Edit this welcome message.

@github-actions github-actions bot added the enhancement New feature or request label Oct 14, 2025
@github-actions
Copy link

github-actions bot commented Oct 14, 2025

PyTest Results (Fast)

3 813 tests  +11   3 801 ✅ +11   6m 16s ⏱️ -27s
    1 suites ± 0      12 💤 ± 0 
    1 files   ± 0       0 ❌ ± 0 

Results for commit afc67fd. ± Comparison against base commit 20ae208.

This pull request removes 1 and adds 12 tests. Note that renamed tests count towards both.
unit_tests.sources.declarative.requesters.query_properties.test_query_properties ‑ test_get_request_property_chunks_static_list_with_chunking
unit_tests.sources.declarative.parsers.test_model_to_component_factory ‑ test_create_component_with_configured_catalog
unit_tests.sources.declarative.requesters.query_properties.property_selector.test_json_schema_property_selector ‑ test_select_properties[configured_stream_is_none]
unit_tests.sources.declarative.requesters.query_properties.property_selector.test_json_schema_property_selector ‑ test_select_properties[configured_stream_no_properties_key_in_json_schema]
unit_tests.sources.declarative.requesters.query_properties.property_selector.test_json_schema_property_selector ‑ test_select_properties[test_select_properties_with_transformations]
unit_tests.sources.declarative.requesters.query_properties.test_property_chunking ‑ test_get_request_property_chunks_empty_configured_properties
unit_tests.sources.declarative.requesters.query_properties.test_property_chunking ‑ test_get_request_property_chunks_none_configured_properties
unit_tests.sources.declarative.requesters.query_properties.test_query_properties ‑ test_get_request_no_property_chunking_always_include_properties
unit_tests.sources.declarative.requesters.query_properties.test_query_properties ‑ test_get_request_no_property_chunking_selected_properties_always_include_properties
unit_tests.sources.declarative.requesters.query_properties.test_query_properties ‑ test_get_request_property_chunks_static_list_with_chunking_property_selection
unit_tests.sources.declarative.requesters.query_properties.test_query_properties ‑ test_get_request_property_chunks_with_configured_catalog_dynamic_endpoint
…

♻️ This comment has been updated with latest results.

@github-actions
Copy link

github-actions bot commented Oct 14, 2025

PyTest Results (Full)

3 816 tests   3 804 ✅  11m 14s ⏱️
    1 suites     12 💤
    1 files        0 ❌

Results for commit afc67fd.

♻️ This comment has been updated with latest results.

@brianjlai brianjlai marked this pull request as ready for review October 20, 2025 03:31
@brianjlai brianjlai requested a review from maxi297 October 20, 2025 03:31
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 20, 2025

📝 Walkthrough

Walkthrough

Adds a JsonSchemaPropertySelector and integrates per-stream configured catalog propagation through ModelToComponentFactory; QueryProperties and PropertyChunking are updated to use selectors to filter requested properties. Tests and factory wiring updated to support selector creation and transformations.

Changes

Cohort / File(s) Summary
Schema & Models
airbyte_cdk/sources/declarative/declarative_component_schema.yaml, airbyte_cdk/sources/declarative/models/declarative_component_schema.py
Adds JsonSchemaPropertySelector schema and model; adds optional property_selector field to QueryProperties.
Factory & Source Wiring
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py, airbyte_cdk/sources/declarative/concurrent_declarative_source.py
Threads configured_catalog into ModelToComponentFactory, builds stream-name → configured-stream map, and adds handler to instantiate JsonSchemaPropertySelector; propagates stream_name when creating query properties.
Property Selector Abstractions
airbyte_cdk/sources/declarative/requesters/query_properties/property_selector/property_selector.py, airbyte_cdk/sources/declarative/requesters/query_properties/property_selector/json_schema_property_selector.py, airbyte_cdk/sources/declarative/requesters/query_properties/property_selector/__init__.py
Introduces PropertySelector abstract class and concrete JsonSchemaPropertySelector which reads configured stream schema, applies transformations, and returns selected property names.
Property Chunking & QueryProperties
airbyte_cdk/sources/declarative/requesters/query_properties/property_chunking.py, airbyte_cdk/sources/declarative/requesters/query_properties/query_properties.py
PropertyChunking.get_request_property_chunks gains configured_properties param and skips fields not in that set; QueryProperties accepts property_selector and uses it to compute configured_properties before chunking or static handling.
Retriever Minor
airbyte_cdk/sources/declarative/retrievers/simple_retriever.py
Stylistic trailing-comma change when calling get_request_property_chunks.
Tests
unit_tests/connector_builder/test_property_chunking.py, unit_tests/sources/declarative/parsers/test_model_to_component_factory.py, unit_tests/sources/declarative/requesters/query_properties/property_selector/*, unit_tests/sources/declarative/requesters/query_properties/test_property_chunking.py, unit_tests/sources/declarative/requesters/query_properties/test_query_properties.py, unit_tests/sources/declarative/retrievers/test_simple_retriever.py
Adds/updates tests covering configured-catalog mapping, JsonSchemaPropertySelector behavior (including transformations), chunking with configured_properties, and integration scenarios for static/dynamic endpoints.

Sequence Diagram(s)

sequenceDiagram
    participant Source as ConcurrentDeclarativeSource
    participant Factory as ModelToComponentFactory
    participant QueryProps as QueryProperties
    participant Selector as JsonSchemaPropertySelector
    participant Transform as Transformation
    participant Chunking as PropertyChunking

    Note over Source,Factory: Factory is constructed with configured_catalog
    Source->>Factory: instantiate(configured_catalog)
    Factory->>Factory: build stream_name -> configured_stream map

    Factory->>QueryProps: create_query_properties(model, stream_name)
    QueryProps->>Factory: request property_selector creation
    Factory->>Selector: create_json_schema_property_selector(model, stream_name)

    Selector->>Selector: load configured_stream.json_schema.properties
    alt transformations present
        Selector->>Transform: apply each transformation
        Transform-->>Selector: transformed properties
    end
    Selector-->>QueryProps: property_selector ready

    QueryProps->>Selector: select()
    Selector-->>QueryProps: configured_properties (Set[str] or None)

    QueryProps->>Chunking: get_request_property_chunks(property_fields, always_include, configured_properties)
    Chunking-->>QueryProps: property chunks
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Suggested reviewers

  • maxi297
  • darynaishchenko
  • pnilan

Would you like me to add a short checklist of things to verify during review (e.g., stream mapping edge cases, transformation ordering, and performance of chunking with selectors)? wdyt?

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title accurately reflects the primary business objective of the changeset: implementing property selection to only fetch properties included in the configured catalog's stream schema. While the title could be more technically precise about the mechanism (introducing JsonSchemaPropertySelector and property selection infrastructure through ModelToComponentFactory), it clearly communicates the main optimization goal. A teammate scanning the repository history would understand that this PR addresses property-fetching optimization. However, the phrasing "Add configured catalog to SimpleRetriever" is slightly imprecise—the catalog is threaded through ModelToComponentFactory rather than directly modified into SimpleRetriever (which receives minimal changes).
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch brian/property_chunking_only_fetch_enabled_fields

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
airbyte_cdk/sources/declarative/requesters/query_properties/property_chunking.py (1)

48-53: Respect configured property selection when not chunking

When property_limit is None we still return the raw property_fields, so any configured_properties passed in are ignored. Streams that rely on the selector without chunking will continue to request every field, which defeats the point of this optimization. Could we filter the iterable before building the single chunk so we only send the selected set (plus any always-include fields) in this branch as well? wdyt?

-        single_property_chunk = list(property_fields)
+        filtered_fields = [
+            field
+            for field in property_fields
+            if configured_properties is None or field in configured_properties
+        ]
+        single_property_chunk = filtered_fields
🧹 Nitpick comments (4)
unit_tests/sources/declarative/requesters/query_properties/property_selector/test_json_schema_property_selector.py (1)

41-94: Consider verifying transformation application more explicitly?

The test comprehensively covers the three catalog states (transformations, None, empty). However, in the first test case (lines 44-54), we're validating the final output but not explicitly checking that the transformations were applied. For instance, we could verify that "properties_" was indeed removed from "properties_statistics" → "statistics".

This might help catch bugs where transformations are silently failing but the test passes due to coincidental matching. Wdyt?

airbyte_cdk/sources/declarative/requesters/query_properties/query_properties.py (1)

60-63: Consider simplifying the yield statement?

The logic correctly filters fields when configured_properties is not None. However, line 61 uses yield from [[field for field in fields if field in configured_properties]], which is functionally equivalent to yield [field for field in fields if field in configured_properties].

The yield from with a single-element list is a bit unusual. Would direct yield be clearer? E.g.:

-            if configured_properties is not None:
-                yield from [[field for field in fields if field in configured_properties]]
-            else:
-                yield list(fields)
+            if configured_properties is not None:
+                yield [field for field in fields if field in configured_properties]
+            else:
+                yield list(fields)

This maintains the same behavior while being slightly more idiomatic. Wdyt?

unit_tests/sources/declarative/parsers/test_model_to_component_factory.py (1)

4566-4571: Consider testing the property_selector's select() method behavior?

The test validates that the property_selector is correctly instantiated with the expected transformation. However, it doesn't call select() to verify that the transformation is actually applied to the configured stream's properties. Would it be valuable to add an assertion that calls property_selector.select() and verifies the returned property names are transformed as expected (e.g., "properties_field" → "field")? wdyt?

airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)

679-679: Thread configured_catalog through factory - well done!

The implementation correctly threads the configured_catalog through the factory and builds the stream name mapping. The static method is clean and efficient.

One consideration: if the configured_catalog contains multiple streams with the same name, the dictionary comprehension (line 817) would silently keep only the last one. While this is probably prevented by platform constraints, would it be worth adding a validation check to fail fast if duplicate stream names are detected? Something like:

if configured_catalog:
    stream_names = [stream.stream.name for stream in configured_catalog.streams]
    if len(stream_names) != len(set(stream_names)):
        raise ValueError("Configured catalog contains duplicate stream names")
    return {stream.stream.name: stream for stream in configured_catalog.streams}

This would make debugging easier if the platform constraint is ever violated. Wdyt?

Also applies to: 690-692, 812-820

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 041c201 and 901a313.

📒 Files selected for processing (17)
  • airbyte_cdk/sources/declarative/concurrent_declarative_source.py (1 hunks)
  • airbyte_cdk/sources/declarative/declarative_component_schema.yaml (4 hunks)
  • airbyte_cdk/sources/declarative/models/declarative_component_schema.py (3 hunks)
  • airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (12 hunks)
  • airbyte_cdk/sources/declarative/requesters/query_properties/json_schema_property_selector.py (1 hunks)
  • airbyte_cdk/sources/declarative/requesters/query_properties/property_chunking.py (3 hunks)
  • airbyte_cdk/sources/declarative/requesters/query_properties/property_selector/__init__.py (1 hunks)
  • airbyte_cdk/sources/declarative/requesters/query_properties/property_selector/json_schema_property_selector.py (1 hunks)
  • airbyte_cdk/sources/declarative/requesters/query_properties/property_selector/property_selector.py (1 hunks)
  • airbyte_cdk/sources/declarative/requesters/query_properties/query_properties.py (3 hunks)
  • airbyte_cdk/sources/declarative/retrievers/simple_retriever.py (1 hunks)
  • unit_tests/connector_builder/test_property_chunking.py (1 hunks)
  • unit_tests/sources/declarative/parsers/test_model_to_component_factory.py (7 hunks)
  • unit_tests/sources/declarative/requesters/query_properties/property_selector/__init__.py (1 hunks)
  • unit_tests/sources/declarative/requesters/query_properties/property_selector/test_json_schema_property_selector.py (1 hunks)
  • unit_tests/sources/declarative/requesters/query_properties/test_property_chunking.py (3 hunks)
  • unit_tests/sources/declarative/requesters/query_properties/test_query_properties.py (8 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2024-12-11T16:34:46.319Z
Learnt from: pnilan
PR: airbytehq/airbyte-python-cdk#0
File: :0-0
Timestamp: 2024-12-11T16:34:46.319Z
Learning: In the airbytehq/airbyte-python-cdk repository, the `declarative_component_schema.py` file is auto-generated from `declarative_component_schema.yaml` and should be ignored in the recommended reviewing order.

Applied to files:

  • airbyte_cdk/sources/declarative/models/declarative_component_schema.py
🧬 Code graph analysis (12)
airbyte_cdk/sources/declarative/requesters/query_properties/property_selector/__init__.py (2)
airbyte_cdk/sources/declarative/requesters/query_properties/property_selector/json_schema_property_selector.py (1)
  • JsonSchemaPropertySelector (16-52)
airbyte_cdk/sources/declarative/requesters/query_properties/property_selector/property_selector.py (1)
  • PropertySelector (9-24)
unit_tests/sources/declarative/requesters/query_properties/test_property_chunking.py (4)
airbyte_cdk/sources/declarative/models/declarative_component_schema.py (2)
  • PropertyChunking (1040-1057)
  • PropertyLimitType (1035-1037)
airbyte_cdk/sources/declarative/requesters/query_properties/property_chunking.py (3)
  • PropertyChunking (25-76)
  • PropertyLimitType (14-21)
  • get_request_property_chunks (42-73)
airbyte_cdk/sources/declarative/requesters/query_properties/strategies/group_by_key.py (1)
  • GroupByKey (13-33)
airbyte_cdk/sources/declarative/requesters/query_properties/query_properties.py (1)
  • get_request_property_chunks (33-63)
airbyte_cdk/sources/declarative/requesters/query_properties/property_selector/property_selector.py (2)
airbyte_cdk/sources/declarative/requesters/query_properties/json_schema_property_selector.py (1)
  • select (26-38)
airbyte_cdk/sources/declarative/requesters/query_properties/property_selector/json_schema_property_selector.py (1)
  • select (30-52)
unit_tests/sources/declarative/requesters/query_properties/test_query_properties.py (5)
airbyte_cdk/sources/declarative/requesters/query_properties/property_selector/json_schema_property_selector.py (1)
  • JsonSchemaPropertySelector (16-52)
unit_tests/sources/declarative/requesters/query_properties/property_selector/test_json_schema_property_selector.py (1)
  • _create_configured_airbyte_stream (24-38)
airbyte_cdk/sources/declarative/requesters/query_properties/query_properties.py (2)
  • get_request_property_chunks (33-63)
  • QueryProperties (18-63)
airbyte_cdk/sources/declarative/requesters/query_properties/property_chunking.py (2)
  • get_request_property_chunks (42-73)
  • PropertyLimitType (14-21)
airbyte_cdk/sources/declarative/requesters/query_properties/properties_from_endpoint.py (2)
  • PropertiesFromEndpoint (14-40)
  • get_properties_from_endpoint (31-40)
unit_tests/sources/declarative/requesters/query_properties/property_selector/test_json_schema_property_selector.py (5)
airbyte_cdk/sources/declarative/models/declarative_component_schema.py (2)
  • JsonSchemaPropertySelector (2034-2054)
  • RemoveFields (1079-1096)
airbyte_cdk/sources/declarative/requesters/query_properties/property_selector/json_schema_property_selector.py (2)
  • JsonSchemaPropertySelector (16-52)
  • select (30-52)
airbyte_cdk/sources/declarative/transformations/transformation.py (1)
  • RecordTransformation (13-37)
airbyte_cdk/sources/declarative/transformations/keys_replace_transformation.py (1)
  • KeysReplaceTransformation (14-61)
airbyte_cdk/sources/declarative/requesters/query_properties/property_selector/property_selector.py (1)
  • select (16-24)
airbyte_cdk/sources/declarative/requesters/query_properties/property_selector/json_schema_property_selector.py (3)
airbyte_cdk/sources/declarative/requesters/query_properties/property_selector/property_selector.py (2)
  • PropertySelector (9-24)
  • select (16-24)
airbyte_cdk/sources/declarative/transformations/transformation.py (1)
  • RecordTransformation (13-37)
airbyte_cdk/sources/declarative/models/declarative_component_schema.py (17)
  • Config (136-137)
  • Config (150-151)
  • Config (164-165)
  • Config (178-179)
  • Config (192-193)
  • Config (206-207)
  • Config (220-221)
  • Config (234-235)
  • Config (248-249)
  • Config (262-263)
  • Config (276-277)
  • Config (292-293)
  • Config (306-307)
  • Config (320-321)
  • Config (354-355)
  • Config (378-379)
  • JsonSchemaPropertySelector (2034-2054)
airbyte_cdk/sources/declarative/concurrent_declarative_source.py (2)
unit_tests/connector_builder/test_connector_builder_handler.py (1)
  • configured_catalog (468-471)
unit_tests/sources/test_source.py (1)
  • catalog (70-93)
airbyte_cdk/sources/declarative/requesters/query_properties/query_properties.py (4)
airbyte_cdk/sources/declarative/requesters/query_properties/properties_from_endpoint.py (1)
  • PropertiesFromEndpoint (14-40)
airbyte_cdk/sources/declarative/requesters/query_properties/property_chunking.py (2)
  • PropertyChunking (25-76)
  • get_request_property_chunks (42-73)
airbyte_cdk/sources/declarative/requesters/query_properties/property_selector/property_selector.py (2)
  • PropertySelector (9-24)
  • select (16-24)
airbyte_cdk/sources/declarative/requesters/query_properties/property_selector/json_schema_property_selector.py (1)
  • select (30-52)
airbyte_cdk/sources/declarative/models/declarative_component_schema.py (6)
airbyte_cdk/sources/declarative/requesters/query_properties/json_schema_property_selector.py (1)
  • JsonSchemaPropertySelector (13-38)
airbyte_cdk/sources/declarative/requesters/query_properties/property_selector/json_schema_property_selector.py (1)
  • JsonSchemaPropertySelector (16-52)
airbyte_cdk/sources/declarative/transformations/add_fields.py (1)
  • AddFields (37-148)
airbyte_cdk/sources/declarative/transformations/remove_fields.py (1)
  • RemoveFields (17-75)
airbyte_cdk/sources/declarative/transformations/flatten_fields.py (1)
  • FlattenFields (13-52)
airbyte_cdk/sources/declarative/transformations/dpath_flatten_fields.py (1)
  • DpathFlattenFields (30-100)
unit_tests/sources/declarative/parsers/test_model_to_component_factory.py (3)
airbyte_cdk/sources/declarative/models/declarative_component_schema.py (1)
  • JsonSchemaPropertySelector (2034-2054)
airbyte_cdk/sources/declarative/requesters/query_properties/property_selector/json_schema_property_selector.py (1)
  • JsonSchemaPropertySelector (16-52)
airbyte_cdk/sources/declarative/transformations/keys_replace_transformation.py (1)
  • KeysReplaceTransformation (14-61)
airbyte_cdk/sources/declarative/requesters/query_properties/json_schema_property_selector.py (3)
airbyte_cdk/sources/declarative/transformations/transformation.py (1)
  • RecordTransformation (13-37)
airbyte_cdk/sources/declarative/models/declarative_component_schema.py (17)
  • Config (136-137)
  • Config (150-151)
  • Config (164-165)
  • Config (178-179)
  • Config (192-193)
  • Config (206-207)
  • Config (220-221)
  • Config (234-235)
  • Config (248-249)
  • Config (262-263)
  • Config (276-277)
  • Config (292-293)
  • Config (306-307)
  • Config (320-321)
  • Config (354-355)
  • Config (378-379)
  • JsonSchemaPropertySelector (2034-2054)
airbyte_cdk/sources/declarative/requesters/query_properties/property_selector/json_schema_property_selector.py (2)
  • JsonSchemaPropertySelector (16-52)
  • select (30-52)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (2)
airbyte_cdk/sources/declarative/models/declarative_component_schema.py (18)
  • JsonSchemaPropertySelector (2034-2054)
  • Config (136-137)
  • Config (150-151)
  • Config (164-165)
  • Config (178-179)
  • Config (192-193)
  • Config (206-207)
  • Config (220-221)
  • Config (234-235)
  • Config (248-249)
  • Config (262-263)
  • Config (276-277)
  • Config (292-293)
  • Config (306-307)
  • Config (320-321)
  • Config (354-355)
  • Config (378-379)
  • QueryProperties (2810-2832)
airbyte_cdk/sources/declarative/requesters/query_properties/property_selector/json_schema_property_selector.py (1)
  • JsonSchemaPropertySelector (16-52)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: Check: source-shopify
  • GitHub Check: Pytest (All, Python 3.13, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.12, Ubuntu)
  • GitHub Check: Pytest (Fast)
🔇 Additional comments (16)
airbyte_cdk/sources/declarative/models/declarative_component_schema.py (1)

1-2: Note: Auto-generated file

Based on learnings, this file is auto-generated from declarative_component_schema.yaml and should be ignored in the review order. The schema changes look consistent with the manual implementations in the PR.

unit_tests/sources/declarative/requesters/query_properties/property_selector/test_json_schema_property_selector.py (1)

24-38: LGTM! Clean helper function

The _create_configured_airbyte_stream helper provides a nice abstraction for test setup. The dynamic property generation using a dictionary comprehension is elegant.

unit_tests/sources/declarative/requesters/query_properties/test_property_chunking.py (3)

75-90: LGTM! Correct parameter handling

The addition of configured_properties = set(property_fields) before converting property_fields to an iterator is the right approach. This ensures all properties are configured by default in these tests, maintaining backward compatibility.


98-119: Good coverage of empty configured_properties

This test validates the important edge case where configured_properties is an empty set, ensuring only always_include_properties are included in chunks. This behavior aligns with the intent that disabled properties should be filtered out.


121-144: LGTM! Validates the None vs empty set distinction

This test correctly validates that configured_properties=None results in no filtering (all properties are chunked), which differs from an empty set. This distinction is important for non-READ operations like CHECK and DISCOVER where there's no configured catalog.

airbyte_cdk/sources/declarative/requesters/query_properties/property_selector/property_selector.py (1)

8-24: LGTM! Clean abstract interface

The PropertySelector ABC follows Python best practices with clear documentation. The Optional[Set[str]] return type correctly captures the three states: None (no filtering), empty set (no properties), or a populated set (selected properties).

airbyte_cdk/sources/declarative/requesters/query_properties/query_properties.py (2)

29-29: LGTM! Property selector integration

The property_selector field is correctly typed and positioned. Its optional nature allows backward compatibility for manifests that don't specify column selection.


45-45: LGTM! Correct configured_properties derivation

The logic correctly derives configured_properties by calling property_selector.select() when available, otherwise defaulting to None (no filtering).

unit_tests/sources/declarative/parsers/test_model_to_component_factory.py (1)

246-266: LGTM - Clean test for configured catalog mapping!

This test nicely validates that the ModelToComponentFactory correctly maps stream names to their configured streams when initialized with a configured_catalog. The setup is clear and the assertion directly checks the expected behavior.

unit_tests/sources/declarative/requesters/query_properties/test_query_properties.py (2)

31-46: LGTM - Nice helper function!

This helper cleanly creates a ConfiguredAirbyteStream for testing purposes. The pattern of generating properties from a list is elegant and reusable.


48-93: Verify test expectations align with property_selector behavior

In this test, the property list includes "remove_me" (line 66), and the configured stream also includes it (line 50). The RemoveFields transformation should remove it from the configured properties, which would then filter it out during chunking.

The expected result shows 2 chunks without "remove_me" (lines 90-92), which makes sense. However, could you verify this test actually passes? I have concerns about the JsonSchemaPropertySelector.select() implementation (see my comments on that file) that might cause this filtering to not work as expected.

airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (5)

29-29: LGTM! Imports are clean and necessary.

The new imports for ConfiguredAirbyteStream, ConfiguredAirbyteCatalog, JsonSchemaPropertySelectorModel, and JsonSchemaPropertySelector are all used appropriately throughout the file to support the property selection feature.

Also applies to: 46-46, 319-321, 509-511


749-749: LGTM! Model-to-constructor mapping updated correctly.

The JsonSchemaPropertySelectorModel is properly wired to its constructor method, following the established pattern in the factory.


3047-3069: LGTM! JsonSchemaPropertySelector factory method is well-implemented.

The new create_json_schema_property_selector method follows the established factory pattern and correctly:

  • Retrieves the configured_stream from the mapping (handling None gracefully)
  • Creates transformations by recursively processing transformation models
  • Passes all necessary parameters to the JsonSchemaPropertySelector constructor

The handling of configured_stream being None is intentional and correct for CHECK/DISCOVER operations or when a stream isn't in the configured catalog, as documented in the relevant code snippets.


3294-3294: LGTM! All call sites correctly pass stream_name.

All invocations of create_query_properties within this file have been updated to pass the required stream_name parameter. The name variable is available in the method context and represents the stream name appropriately.

Also applies to: 3320-3320, 3326-3326


3013-3045: Implementation is correct—all call sites properly updated.

The changes look solid. You've correctly threaded stream_name through to the property_selector creation (line 3032 in your diff) where it's needed, while appropriately omitting it from property_list and property_chunking which don't require stream context.

I verified that:

  • Direct call sites within this file are updated (lines 3318-3322 and 3324-3327 both pass stream_name)
  • QueryPropertiesModel only appears in this file—no external usages found
  • The nested dispatch for property_selector correctly passes stream_name, enabling stream-aware component behavior

Since this is an internal factory method and all callers are accounted for, the breaking change concern doesn't apply here. The implementation is ready.

Copy link
Contributor

@maxi297 maxi297 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall makes sense. Just need a bit more context

Copy link
Contributor

@maxi297 maxi297 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (8)
airbyte_cdk/sources/declarative/requesters/query_properties/query_properties.py (4)

42-44: Docstring param is stale (configured_stream).

The method no longer accepts configured_stream; the doc should reflect the property_selector-based flow. Can we update the docstring accordingly, wdyt?

Apply this minimal fix:

-        :param configured_stream: The customer configured stream being synced which is needed to identify which
-        record fields to query for and emit.
+        The configured stream context is provided via `property_selector` when present.

45-45: Avoid recomputing and side effects: cache selected properties.

select() may be non‑trivial (and can mutate internal state in current impl). Can we memoize once per instance to keep results stable across slices, wdyt?

Apply this change:

-        configured_properties = self.property_selector.select() if self.property_selector else None
+        if self.property_selector:
+            if not hasattr(self, "_configured_properties_cache"):
+                self._configured_properties_cache = self.property_selector.select()
+            configured_properties = self._configured_properties_cache
+        else:
+            configured_properties = None

60-65: Simplify redundant conditional.

Inner if configured_properties is not None is redundant inside the same guard. Can we trim it for clarity, wdyt?

-            if configured_properties is not None:
-                all_fields = (
-                    [field for field in fields if field in configured_properties]
-                    if configured_properties is not None
-                    else list(fields)
-                )
+            if configured_properties is not None:
+                all_fields = [field for field in fields if field in configured_properties]
             else:
                 all_fields = list(fields)

6-6: Optional: remove unused import.

ConfiguredAirbyteStream isn’t referenced here. Shall we drop it to keep imports tight, wdyt?

unit_tests/sources/declarative/requesters/query_properties/test_query_properties.py (2)

152-189: Add a test for “empty configured properties.”

To lock in semantics, could we add a case where the configured catalog has zero properties selected and assert the produced chunks (empty vs none) and downstream behavior, wdyt?


191-236: Edge case: no property_limit ordering.

When property_limit is unset, PropertyChunking currently appends always_include_properties at the end of the chunk, while the no‑chunking path prepends it. Could we add a test to pin desired ordering and align implementations later, wdyt?

airbyte_cdk/sources/declarative/requesters/query_properties/property_selector/json_schema_property_selector.py (2)

25-25: Minor: default_factory can be list directly.

Using default_factory=list is slightly cleaner than a lambda. Want to switch, wdyt?

-    properties_transformations: List[RecordTransformation] = field(default_factory=lambda: [])
+    properties_transformations: List[RecordTransformation] = field(default_factory=list)

27-29: Optional: _parameters is unused.

We store _parameters but don’t reference it. Should we drop it or add a TODO explaining upcoming use, wdyt?

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4b872d6 and 19f4cf7.

📒 Files selected for processing (3)
  • airbyte_cdk/sources/declarative/requesters/query_properties/property_selector/json_schema_property_selector.py (1 hunks)
  • airbyte_cdk/sources/declarative/requesters/query_properties/query_properties.py (3 hunks)
  • unit_tests/sources/declarative/requesters/query_properties/test_query_properties.py (8 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
airbyte_cdk/sources/declarative/requesters/query_properties/query_properties.py (5)
airbyte_cdk/sources/declarative/requesters/query_properties/properties_from_endpoint.py (1)
  • PropertiesFromEndpoint (14-40)
airbyte_cdk/sources/declarative/requesters/query_properties/property_chunking.py (2)
  • PropertyChunking (25-76)
  • get_request_property_chunks (42-73)
airbyte_cdk/sources/declarative/requesters/query_properties/property_selector/property_selector.py (2)
  • PropertySelector (9-24)
  • select (16-24)
airbyte_cdk/sources/types.py (1)
  • StreamSlice (75-169)
airbyte_cdk/sources/declarative/requesters/query_properties/property_selector/json_schema_property_selector.py (1)
  • select (30-52)
airbyte_cdk/sources/declarative/requesters/query_properties/property_selector/json_schema_property_selector.py (3)
airbyte_cdk/sources/declarative/requesters/query_properties/property_selector/property_selector.py (2)
  • PropertySelector (9-24)
  • select (16-24)
airbyte_cdk/sources/declarative/transformations/transformation.py (1)
  • RecordTransformation (13-37)
airbyte_cdk/sources/declarative/models/declarative_component_schema.py (17)
  • Config (136-137)
  • Config (150-151)
  • Config (164-165)
  • Config (178-179)
  • Config (192-193)
  • Config (206-207)
  • Config (220-221)
  • Config (234-235)
  • Config (248-249)
  • Config (262-263)
  • Config (276-277)
  • Config (292-293)
  • Config (306-307)
  • Config (320-321)
  • Config (354-355)
  • Config (378-379)
  • JsonSchemaPropertySelector (2034-2054)
unit_tests/sources/declarative/requesters/query_properties/test_query_properties.py (5)
airbyte_cdk/sources/declarative/requesters/query_properties/property_selector/json_schema_property_selector.py (1)
  • JsonSchemaPropertySelector (16-52)
airbyte_cdk/sources/declarative/models/declarative_component_schema.py (5)
  • JsonSchemaPropertySelector (2034-2054)
  • RemoveFields (1079-1096)
  • QueryProperties (2810-2832)
  • PropertyLimitType (1035-1037)
  • PropertiesFromEndpoint (2796-2807)
airbyte_cdk/sources/declarative/requesters/query_properties/query_properties.py (2)
  • get_request_property_chunks (33-72)
  • QueryProperties (18-72)
airbyte_cdk/sources/declarative/requesters/query_properties/property_chunking.py (2)
  • get_request_property_chunks (42-73)
  • PropertyLimitType (14-21)
airbyte_cdk/sources/declarative/requesters/query_properties/properties_from_endpoint.py (2)
  • PropertiesFromEndpoint (14-40)
  • get_properties_from_endpoint (31-40)
🔇 Additional comments (3)
airbyte_cdk/sources/declarative/requesters/query_properties/query_properties.py (1)

53-58: Confirm “empty selection” semantics.

If configured_properties is an empty set, chunking yields a single empty chunk (unless always_include_properties). Is that intended (vs. yielding no chunks / no request)? Could this produce a request with an empty fields param in some retrievers? Should we short‑circuit to yield nothing when both configured_properties and always_include_properties are empty, wdyt?

unit_tests/sources/declarative/requesters/query_properties/test_query_properties.py (2)

31-46: Helper looks good.

Factory for ConfiguredAirbyteStream is clear and reusable. Nice.


48-93: Great coverage for static list + selector + chunking.

Assertions match the new selection behavior; this is valuable.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
airbyte_cdk/sources/declarative/requesters/query_properties/property_selector/json_schema_property_selector.py (1)

27-28: Is self._parameters needed?

I notice that parameters is stored in self._parameters but never referenced elsewhere in the class. If it's not required for any base class protocol or future use, we could simplify by removing this line, wdyt?

-    def __post_init__(self, parameters: Mapping[str, Any]) -> None:
-        self._parameters = parameters
+    def __post_init__(self, parameters: Mapping[str, Any]) -> None:
+        pass
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 19f4cf7 and afc67fd.

📒 Files selected for processing (1)
  • airbyte_cdk/sources/declarative/requesters/query_properties/property_selector/json_schema_property_selector.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
airbyte_cdk/sources/declarative/requesters/query_properties/property_selector/json_schema_property_selector.py (3)
airbyte_cdk/sources/declarative/requesters/query_properties/property_selector/property_selector.py (2)
  • PropertySelector (9-24)
  • select (16-24)
airbyte_cdk/sources/declarative/transformations/transformation.py (1)
  • RecordTransformation (13-37)
airbyte_cdk/sources/declarative/models/declarative_component_schema.py (17)
  • Config (136-137)
  • Config (150-151)
  • Config (164-165)
  • Config (178-179)
  • Config (192-193)
  • Config (206-207)
  • Config (220-221)
  • Config (234-235)
  • Config (248-249)
  • Config (262-263)
  • Config (276-277)
  • Config (292-293)
  • Config (306-307)
  • Config (320-321)
  • Config (354-355)
  • Config (378-379)
  • JsonSchemaPropertySelector (2034-2054)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
  • GitHub Check: Check: source-shopify
  • GitHub Check: Check: source-intercom
  • GitHub Check: Check: source-pokeapi
  • GitHub Check: Check: destination-motherduck
  • GitHub Check: Check: source-hardcoded-records
  • GitHub Check: Pytest (All, Python 3.12, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.13, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: SDM Docker Image Build
  • GitHub Check: Manifest Server Docker Image Build
  • GitHub Check: Analyze (python)
🔇 Additional comments (1)
airbyte_cdk/sources/declarative/requesters/query_properties/property_selector/json_schema_property_selector.py (1)

45-54: Nice work on the deep copy!

The use of copy.deepcopy() correctly addresses the mutation concern from the previous review. The transformation logic looks solid—applying transformations to the copied schema and returning the property keys maintains the integrity of the original configured catalog.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants