-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: (CDK) (Connector Builder) - refactor the MessageGrouper
> TestRead
#332
base: main
Are you sure you want to change the base?
Conversation
📝 WalkthroughWalkthroughThis pull request refactors the code for test reading and message grouping. The main changes include updating dataclass import statements and replacing the Changes
Sequence Diagram(s)sequenceDiagram
participant TS as Test Suite
participant TR as TestReader
participant DS as DeclarativeSource
participant UH as Utility Helpers
TS->>TR: run_test_read(source, config, catalog, state, limit)
TR->>DS: _read_stream(...)
DS-->>TR: Iterator of AirbyteMessages
TR->>UH: Invoke helper functions (_categorise_groups, schema inference, etc.)
UH-->>TR: Processed message groups
TR-->>TS: Returns StreamRead object with slices, log messages, and auxiliary requests
Possibly related PRs
Suggested labels
Suggested reviewers
✨ Finishing Touches
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🔭 Outside diff range comments (2)
airbyte_cdk/connector_builder/test_reader/helpers.py (1)
556-582
: 🛠️ Refactor suggestionPotential optimization for frequent record fields.
_handle_record_message
accumulates both schema and datetime formats on every record. If a stream is large, we might slow performance. Any interest in offering a caching approach or a sampling strategy (e.g., only inferring every N records)? Wdyt?unit_tests/sources/declarative/decoders/test_decoders_memory_usage.py (1)
19-31
: 🛠️ Refactor suggestionConsider consolidating duplicate fixtures, wdyt?
This fixture is identical to the one in
test_json_decoder.py
. Consider moving it to a shared test utilities module to avoid duplication:# test_utils.py @pytest.fixture(name="large_events_response") def large_event_response_fixture(): # ... shared implementation ...Then import and use it in both test files.
🧹 Nitpick comments (9)
airbyte_cdk/connector_builder/test_reader/message_grouper.py (2)
40-82
: Docstring clarity check.This docstring is very thorough, but would you like to clarify in the docstring how partial slices/pages are handled if the loop ends early (e.g., due to record limit)? It might help future readers. Wdyt?
118-124
: Consolidate duplicate slice parsing logic?We call
_parse_slice_description(message.log.message)
at lines 119 and 124. Would you consider extracting that repeated call into a small helper to avoid duplication? Wdyt?- yield _processed_slice() - current_slice_descriptor = _parse_slice_description(message.log.message) # type: ignore + yield _processed_slice() + current_slice_descriptor = _parse_current_slice_descriptor(message)airbyte_cdk/connector_builder/test_reader/reader.py (4)
71-80
: Constructor param naming.The constructor uses
_max_pages_per_slice
,_max_slices
,_max_record_limit
. Any interest in clarifying them as well in the docstring or inline comment? For example, clarifying if they apply to all streams, or only a single stream read. Wdyt?
301-336
: Typo in function name.The function
_get_infered_schema
has a small spelling mismatch ("infered" vs. "inferred"). Would you consider renaming it to_get_inferred_schema
for clarity? Wdyt?-def _get_infered_schema( +def _get_inferred_schema(
239-300
: Add test coverage for unknown message group types.In
_categorise_groups
, an unknown message group type raises aValueError
. Would you consider adding a unit test to confirm that scenario is handled gracefully and the error is raised as expected? Wdyt?
410-442
: Validate multiple limits more explicitly.
_has_reached_limit
checks slices, pages, and record counts. Is it worth adding a short docstring note or logging line to clarify which limit was triggered if the function returns True? For large codebases, that might simplify debugging. Wdyt?airbyte_cdk/connector_builder/test_reader/helpers.py (2)
174-200
: Consider merging_close_page
and_close_current_page
.Currently,
_close_page
and_close_current_page
compose similar operations. Would you like to merge them into a single function that closes a page and returns reset objects, avoiding duplication? Wdyt?
474-521
: Refine approach for unknown log messages.In
_handle_log_message
, only HTTP-based logs are processed in detail, while everything else is appended as a raw log. Would you consider capturing more metadata for non-HTTP logs (e.g., partial parsing) to aid debugging? Wdyt?unit_tests/sources/declarative/decoders/test_json_decoder.py (1)
51-63
: Consider extracting magic numbers into constants, wdyt?The fixture uses magic numbers for the file size. Consider extracting these into named constants for better maintainability:
+LINES_IN_RESPONSE = 2_000_000 # ≈ 58 MB of response +TEST_EMAIL = "[email protected]" @pytest.mark.slow @pytest.fixture(name="large_events_response") def large_event_response_fixture(): - data = {"email": "[email protected]"} - jsonl_string = f"{json.dumps(data)}\n" - lines_in_response = 2_000_000 # ≈ 58 MB of response + data = {"email": TEST_EMAIL} + jsonl_string = f"{json.dumps(data)}\n"
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
airbyte_cdk/connector_builder/connector_builder_handler.py
(3 hunks)airbyte_cdk/connector_builder/message_grouper.py
(0 hunks)airbyte_cdk/connector_builder/test_reader/__init__.py
(1 hunks)airbyte_cdk/connector_builder/test_reader/helpers.py
(1 hunks)airbyte_cdk/connector_builder/test_reader/message_grouper.py
(1 hunks)airbyte_cdk/connector_builder/test_reader/reader.py
(1 hunks)airbyte_cdk/connector_builder/test_reader/types.py
(1 hunks)unit_tests/connector_builder/test_connector_builder_handler.py
(2 hunks)unit_tests/connector_builder/test_message_grouper.py
(32 hunks)unit_tests/sources/declarative/decoders/test_decoders_memory_usage.py
(1 hunks)unit_tests/sources/declarative/decoders/test_json_decoder.py
(1 hunks)
💤 Files with no reviewable changes (1)
- airbyte_cdk/connector_builder/message_grouper.py
✅ Files skipped from review due to trivial changes (1)
- airbyte_cdk/connector_builder/test_reader/init.py
🔇 Additional comments (8)
airbyte_cdk/connector_builder/test_reader/message_grouper.py (2)
106-116
: Consider verifying the page's partial content.Inside the
while records_count < limit
loop, we close the page unconditionally once_need_to_close_page
is true. Would you like to add an extra check to handle a scenario where the page might have zero records? Or is that an intentional design choice? Wdyt?
157-166
: Ensure final slice is yielded if no records are processed.In the
else
block, we only finalize the page and slice ifcurrent_page_request or current_page_response or current_page_records
is truthy. Could a slice with zero records and no HTTP logs be missed? Would you like to confirm that scenario? Wdyt?airbyte_cdk/connector_builder/test_reader/helpers.py (1)
263-283
: Graceful error handling for slice JSON parse.
_parse_slice_description
could potentially fail if the log message is malformed or missing the prefix. Do you want to catch JSON errors here to yield a more explanatory log message or fallback? Wdyt?airbyte_cdk/connector_builder/test_reader/types.py (1)
1-76
: Great job on the type definitions and documentation!The type aliases are well-organized, comprehensive, and thoroughly documented. The docstrings provide clear explanations of each type's purpose and structure.
airbyte_cdk/connector_builder/connector_builder_handler.py (1)
6-9
: Clean refactoring of imports and TestReader integration!The changes improve code organization through better import structure and the transition to TestReader is clean and well-implemented.
Also applies to: 77-86
unit_tests/connector_builder/test_message_grouper.py (2)
19-20
: LGTM! The imports look good.The imports are correctly updated to use the new
TestReader
class and its helper functions.
148-148
: LGTM! The test cases are correctly updated.The test cases are properly updated to use
TestReader
instead ofMessageGrouper
, and the method calls are changed fromget_message_groups
torun_test_read
. The test coverage is maintained.Also applies to: 206-207, 290-292, 343-343, 446-449, 504-506, 617-619, 661-663, 689-691, 708-709, 733-734, 763-764, 791-792, 808-809, 822-823, 848-850, 877-879
unit_tests/connector_builder/test_connector_builder_handler.py (1)
553-553
: LGTM! The mock patch paths are correctly updated.The mock patch paths are properly updated to use
TestReader.run_test_read
instead ofMessageGrouper.get_message_groups
. The test coverage is maintained.Also applies to: 1172-1172
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (8)
airbyte_cdk/connector_builder/test_reader/helpers.py (5)
65-83
: Consider removing nested keys with__
prefixes as well?Currently,
clean_config
only deletes top-level keys that start with__
. Would you like to remove nested keys with__
prefixes too, ensuring thorough cleanup of internal or meta-data fields? wdyt?
86-128
: Default to an empty dictionary for headers?
create_request_from_log_message
setsheaders
to whateverrequest.get("headers")
provides. For safety, especially if headers can be absent, would you consider defaulting it to an empty dict to avoidNoneType
issues? wdyt?
151-173
: Log malformed JSON errors for visibility?
parse_json
silently returnsNone
onJSONDecodeError
. Would you consider logging a debug/info message to indicate that JSON parsing failed (for easier troubleshooting)? wdyt?
198-228
: Fix docstring reference to_is_page_http_request
?The docstring in
should_close_page
references_is_page_http_request
, but the actual function isis_page_http_request
. Want to update it for clarity? wdyt?
434-464
: Return a more descriptive structure than(None, None)
?
handle_current_page
returns(None, None)
to clear references. Would you consider returning a small named container (e.g., a tuple with fields) or an explicit indicator to improve readability? wdyt?airbyte_cdk/connector_builder/test_reader/message_grouper.py (1)
133-146
: Ensure final slice is yielded on manual break.In the
else:
block after thewhile
, we only finalize pages if the loop finishes naturally. If we decide to break early in the future, we might skip yielding. Is that acceptable or do we need a fallback ensuring the final slice is yielded? wdyt?unit_tests/connector_builder/test_message_grouper.py (2)
206-207
: Refactored usage ofTestReader
in test functions.All these line changes consistently replace
MessageGrouper
usage withTestReader
calls. This aligns with the new code structure.
Would you like to renameconnector_builder_handler
totest_reader_handler
in tests for clarity? wdyt?Also applies to: 290-292, 314-314, 343-344, 422-423, 458-459, 504-507
647-671
: General test flow and structure.The series of tests confirm boundary conditions (maximum slices, pages, no slices, error streams, multiple control messages, etc.). Everything appears logically consistent.
Would you like to unify repeated test scaffolding (e.g., source set up) with a fixture or helper? wdyt?Also applies to: 673-699, 701-719, 722-772, 774-800, 802-817, 819-888
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
airbyte_cdk/connector_builder/test_reader/helpers.py
(1 hunks)airbyte_cdk/connector_builder/test_reader/message_grouper.py
(1 hunks)airbyte_cdk/connector_builder/test_reader/reader.py
(1 hunks)unit_tests/connector_builder/test_message_grouper.py
(32 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- airbyte_cdk/connector_builder/test_reader/reader.py
⏰ Context from checks skipped due to timeout of 90000ms (8)
- GitHub Check: Check: 'source-pokeapi' (skip=false)
- GitHub Check: Check: 'source-the-guardian-api' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (Fast)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Analyze (python)
🔇 Additional comments (4)
airbyte_cdk/connector_builder/test_reader/message_grouper.py (1)
78-78
: Confirm processing logic forlimit == 0
.The loop condition is
while records_count < limit and (message := next(messages, None)):
. Iflimit
is 0, the loop never runs. Is that intended behavior, or should we handle that differently? wdyt?unit_tests/connector_builder/test_message_grouper.py (3)
19-20
: Imports look correct.The new import references
TestReader
fromairbyte_cdk.connector_builder.test_reader
. This aligns with the recent refactor. All good here!
148-219
: Use@patch
consistently for the test?The test at lines 148-219 correctly patches
AirbyteEntrypoint.read
. All logic appears valid. Perhaps confirm if we need additional patches or mocks for external network calls? wdyt?
592-645
: Spot check large test coverage.The test from lines 592-645 adds multiple slices verifying correct grouping. The scenario thoroughly tests partial slices, multiple slices, and the final state message. Looks solid!
What
Resolving:
How
The pull request refactors the
airbyte_cdk/connector_builder/message_grouper.py
module by replacing theMessageGrouper
class with the newTestReader
class. Key changes include:Imports and Class Replacements:
airbyte_cdk/connector_builder/test_reader
dir holds theTestRead
and related logic implementation:reader
holds the logic of how we actually make a test read.message_grouper
holds the logic of how we group the emitted source messages.helpers
holds a bunch of functions to re-use inreader
andmessage_grouper
types
holds the specific type-collection used to mark the output from certain methods used inreader
andmessage_grouper
dataclass
and field from thedataclasses
.Unit Tests Updates:
MessageGrouper
toTestReader
.New Files Added:
test_reader
, including:__init__.py
,helpers.py
,message_grouper.py
,reader.py
, andtypes.py
.Test Markers:
Added
@pytest.mark.slow
to some test cases to indicate they may take longer to run, when running locally.These changes aim to improve the code's readability and maintainability by introducing the TestReader class, which has enhanced functionality and updated testing procedures.
User Impact
No impact is expected; the logic and behavior remain the same.
Local Testing using this guide:
Passed:
Summary by CodeRabbit
New Features
TestReader
for performing test reads and validating configurations.Refactor
Tests