Skip to content

Conversation

@aaronsteers
Copy link
Contributor

@aaronsteers aaronsteers commented Oct 21, 2025

feat: Add metadata models package with dynamic schema download

Summary

This PR adds a new airbyte_cdk.metadata_models package containing auto-generated Pydantic models for validating connector metadata.yaml files. The models are generated from JSON Schema YAML files maintained in the airbytehq/airbyte repository.

Key changes:

  • Added airbyte_cdk/metadata_models/ package with generated Pydantic models (~3400 lines)
  • Extended bin/generate_component_manifest_files.py to download schemas from GitHub on-demand during build
  • Models use pydantic.v1 compatibility layer for consistency with existing declarative component models
  • Includes comprehensive README with usage examples
  • Added py.typed marker for type hint compliance

Usage example:

from airbyte_cdk.metadata_models import ConnectorMetadataDefinitionV0
import yaml

metadata = ConnectorMetadataDefinitionV0(**yaml.safe_load(metadata_yaml))

Review & Testing Checklist for Human

This is a yellow risk PR - the implementation is straightforward but has network dependencies and generated code that need verification:

  • Test end-to-end validation: Try importing and validating a real connector's metadata.yaml file to ensure the generated models work correctly
  • Verify build in CI: Run poetry run poe build in CI to ensure it works in that environment (may hit GitHub rate limits)
  • Consider GitHub authentication: The current implementation makes unauthenticated requests to GitHub API, which hit rate limits during development. Consider adding authentication or caching mechanism for production use

Notes

Requested by: AJ Steers ([email protected]) / @aaronsteers
Devin session: https://app.devin.ai/sessions/9b487ed33f5842c087a1de30d33db888

Known limitations:

  • GitHub rate limiting: Build process downloads schemas from GitHub without authentication, which can hit rate limits if run frequently (we encountered this during testing). Consider adding caching or GitHub token authentication for CI.
  • Network dependency: Build requires network access to download schemas. No fallback mechanism if GitHub is unavailable.
  • Generated code testing: The generated models have only been tested with a basic import. Comprehensive validation testing with real metadata.yaml files is recommended.

Why this approach:

  • Leverages existing datamodel-code-generator infrastructure used for declarative component models
  • Keeps schemas in airbytehq/airbyte repo as the single source of truth
  • No need for separate package management or CI/CD workflows
  • Models automatically stay in sync with schema updates during builds

Summary by CodeRabbit

  • New Features

    • Connector metadata models added for validation and testing, including a consolidated machine-readable metadata schema.
    • Model generation workflow enhanced to produce a single metadata models file for easier consumption.
  • Documentation

    • Added guide describing connector metadata models, usage examples, validation, and regeneration workflow.

- Add airbyte_cdk.metadata_models package with auto-generated Pydantic models
- Models are generated from JSON schemas in airbytehq/airbyte repository
- Schemas are downloaded on-demand during build process (no submodules)
- Uses pydantic.v1 compatibility layer for consistency with declarative models
- Includes comprehensive README with usage examples
- Adds py.typed marker for type hint compliance

The metadata models enable validation of connector metadata.yaml files
using the same schemas maintained in the main Airbyte repository.

Co-Authored-By: AJ Steers <[email protected]>
@Copilot Copilot AI review requested due to automatic review settings October 21, 2025 00:39
@devin-ai-integration
Copy link
Contributor

Original prompt from AJ Steers
Received message in Slack channel #ask-devin-ai:

@Devin Can you research to see if anyone has started or worked on a json schema file or other validation mechanism (pydantic?) for metadata.yaml files?
Thread URL: https://airbytehq-team.slack.com/archives/C08BHPUMEPJ/p1760999875625349?thread_ts=1760999875.625349

Quote of conversation (https://airbytehq-team.slack.com/archives/C02U9R3AF37/p1760999798826039?thread_ts=1760978308.303779&cid=C02U9R3AF37):
> From AJ Steers
> What would give me more confidence on the above personally would be to have a pydantic model and/or JSON schema we could use to validate metadata.yaml files. Then we could assert via static analysis of all files that the new config is "valid" across hundreds of connectors very quickly.
> Posted on October 20, 2025 at 10:36 PM

@devin-ai-integration
Copy link
Contributor

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a new airbyte_cdk.metadata_models package containing auto-generated Pydantic models for validating connector metadata.yaml files. The models are dynamically generated during build by downloading JSON schemas from the airbytehq/airbyte repository.

Key changes:

  • New metadata_models package with auto-generated Pydantic validation models
  • Build script extension to download schemas from GitHub during build process
  • Comprehensive README with usage examples and documentation

Reviewed Changes

Copilot reviewed 3 out of 35 changed files in this pull request and generated no comments.

File Description
airbyte_cdk/metadata_models/py.typed Marker file for PEP 561 type hint compliance
airbyte_cdk/metadata_models/init.py Package initialization re-exporting generated models
airbyte_cdk/metadata_models/README.md Documentation covering usage, available models, and regeneration process

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@github-actions github-actions bot added the enhancement New feature or request label Oct 21, 2025
@github-actions
Copy link

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

Testing This CDK Version

You can test this version of the CDK using the following:

# Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@devin/1760999875-add-metadata-models#egg=airbyte-python-cdk[dev]' --help

# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch devin/1760999875-add-metadata-models

Helpful Resources

PR Slash Commands

Airbyte Maintainers can execute the following slash commands on your PR:

  • /autofix - Fixes most formatting and linting issues
  • /poetry-lock - Updates poetry.lock file
  • /test - Runs connector tests with the updated CDK
  • /poe build - Regenerate git-committed build artifacts, such as the pydantic models which are generated from the manifest JSON schema in YAML.
  • /poe <command> - Runs any poe command in the CDK environment

📝 Edit this welcome message.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 21, 2025

📝 Walkthrough

Walkthrough

Adds infrastructure to download external metadata YAML schemas and generate Python models alongside existing declarative component model generation; refactors the main orchestration to run a two-phase flow and introduces new helper functions, constants, and a connector_metadata test package.

Changes

Cohort / File(s) Change Summary
Metadata schema download & codegen orchestration
bin/generate_component_manifest_files.py
Added imports (json, os, tempfile, httpx, yaml). Introduced constants METADATA_SCHEMAS_GITHUB_URL, METADATA_SCHEMAS_RAW_URL_BASE, LOCAL_METADATA_OUTPUT_DIR_PATH. Added get_all_yaml_files_from_dir(), download_metadata_schemas(), generate_models_from_schemas(), post_process_metadata_models(), consolidate_yaml_schemas_to_json(), generate_metadata_models_single_file(). Updated generate_init_module_content() signature to accept yaml_files: list[str]. Rewrote async main() to orchestrate declarative component generation and remote metadata schema download + generation; removed get_all_yaml_files_without_ext.
Connector metadata test package
airbyte_cdk/test/models/connector_metadata/__init__.py, airbyte_cdk/test/models/connector_metadata/py.typed, airbyte_cdk/test/models/connector_metadata/README.md
New package: __init__.py re-exports ConnectorMetadataDefinitionV0 and ConnectorTestSuiteOptions; added py.typed marker; added README describing schema origin, usage, validation, regeneration steps, and outputs.
Test models parent exports
airbyte_cdk/test/models/__init__.py
Imported and re-exported ConnectorMetadataDefinitionV0 and ConnectorTestSuiteOptions in package __all__.
Formatting-only edits
airbyte_cdk/sources/declarative/models/declarative_component_schema.py
Whitespace/line-wrapping reflow of existing Field definitions and schemas; no semantic changes.

Sequence Diagram(s)

sequenceDiagram
    participant Main
    participant Phase1 as Declarative Components
    participant Phase2 as Metadata Models
    participant GitHub as GitHub Schemas
    participant Codegen as Datamodel-Codegen

    Main->>Phase1: get_all_yaml_files_from_dir()
    Phase1->>Codegen: generate models from YAML list
    Codegen-->>Phase1: generated models
    Phase1->>Phase1: post_process_codegen
    Phase1-->>Main: phase1 complete

    Main->>Phase2: download_metadata_schemas(temp_dir)
    Phase2->>GitHub: fetch YAMLs (raw URLs, token if present)
    GitHub-->>Phase2: YAML files
    Phase2->>Phase2: write to temp_dir
    Phase2-->>Main: return yaml_stems

    Main->>Phase2: generate_models_from_schemas(..., metadata_models=True)
    Phase2->>Codegen: generate metadata single-file models
    Codegen-->>Phase2: generated models
    Phase2->>Phase2: post_process_metadata_models
    Phase2-->>Main: phase2 complete

    Main->>Main: consolidate_yaml_schemas_to_json() & finish
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Would you agree this estimate looks right?

Possibly related PRs

Suggested reviewers

  • maxi297
  • pnilan
  • aldogonzalez8

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 54.55% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title "feat: Add metadata models package with dynamic schema download" directly and clearly reflects the core changes in this pull request. The PR introduces a new airbyte_cdk.metadata_models package with auto-generated Pydantic models for connector metadata validation, and significantly extends bin/generate_component_manifest_files.py with new functions to download YAML schemas from GitHub during the build process. The title captures both the primary deliverable (metadata models package) and the key implementation mechanism (dynamic schema download), making it specific enough for a teammate to understand the main change when scanning commit history without being overly verbose.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch devin/1760999875-add-metadata-models

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 62902f6 and 933d478.

⛔ Files ignored due to path filters (1)
  • airbyte_cdk/test/models/connector_metadata/generated/__init__.py is excluded by !**/generated/**
📒 Files selected for processing (4)
  • airbyte_cdk/sources/declarative/models/declarative_component_schema.py (13 hunks)
  • airbyte_cdk/test/models/connector_metadata/README.md (1 hunks)
  • airbyte_cdk/test/models/connector_metadata/__init__.py (1 hunks)
  • bin/generate_component_manifest_files.py (3 hunks)
✅ Files skipped from review due to trivial changes (2)
  • airbyte_cdk/test/models/connector_metadata/init.py
  • airbyte_cdk/sources/declarative/models/declarative_component_schema.py
🧰 Additional context used
🪛 markdownlint-cli2 (0.18.1)
airbyte_cdk/test/models/connector_metadata/README.md

8-8: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Manifest Server Docker Image Build
  • GitHub Check: SDM Docker Image Build
🔇 Additional comments (4)
bin/generate_component_manifest_files.py (4)

166-182: LGTM!

The metadata models post-processing function is clean and focused. It appropriately uses the pydantic.v1 compatibility layer to align with existing declarative component models, as mentioned in the PR objectives.


276-320: LGTM!

The single-file generation approach is clean and aligns with the PR objectives. Generating all metadata models into one file simplifies imports and is appropriate given the consolidated nature of the schemas.


33-41: LGTM!

The refactoring to introduce get_all_yaml_files_from_dir as a parameterized version while keeping the original function for existing code paths is a clean approach. The updated generate_init_module_content signature makes the dependency on yaml_files explicit, which improves testability.


46-52: No changes needed—the code follows GitHub's current recommended authentication format.

GitHub's current documentation recommends Authorization: Bearer <YOUR_TOKEN>, which is exactly what the code uses at line 52. The older token prefix is still supported but considered the legacy convention. The implementation is correct as written.

Likely an incorrect or invalid review comment.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (4)
airbyte_cdk/metadata_models/__init__.py (1)

1-1: Consider adding an explicit __all__ for API clarity, wdyt?

While wildcard imports are acceptable for package __init__.py files, defining an explicit __all__ list would make the public API surface more obvious and give you finer control over what gets exported. This is especially useful for generated code where you might want to filter out internal types or helpers.

Example approach:

+__all__ = [
+    "ConnectorMetadataDefinitionV0",
+    # Add other public models as needed
+]
+
 from .generated import *

Alternatively, you could generate the __all__ list in generated.py itself during the build process.

airbyte_cdk/metadata_models/README.md (1)

37-47: Second example assumes undefined metadata_dict, wdyt?

The second usage example references metadata_dict on line 40, but this variable isn't defined in that code block. While readers might understand it comes from the first example, making this example standalone (or explicitly noting it continues from the first) would improve clarity.

Consider either:

  1. Making it self-contained:
from airbyte_cdk.metadata_models import ConnectorMetadataDefinitionV0
import yaml

metadata_dict = yaml.safe_load(Path("path/to/metadata.yaml").read_text())
metadata = ConnectorMetadataDefinitionV0(**metadata_dict)

# Access fields with full type safety
print(f"Connector: {metadata.data.name}")
# ...
  1. Or adding a comment linking to the first example:
# Continuing from the validation example above...
from airbyte_cdk.metadata_models import ConnectorMetadataDefinitionV0

metadata = ConnectorMetadataDefinitionV0(**metadata_dict)
# ...
bin/generate_component_manifest_files.py (2)

26-27: Consider consolidating the YAML file discovery functions, wdyt?

Both get_all_yaml_files_without_ext() and get_all_yaml_files_from_dir() perform the same operation, with the new one being more parameterized. The old function is still used on line 237 for declarative models.

You could simplify by using the new parameterized version everywhere:

-def get_all_yaml_files_without_ext() -> list[str]:
-    return [Path(f).stem for f in glob(f"{LOCAL_YAML_DIR_PATH}/*.yaml")]
-
-
 def get_all_yaml_files_from_dir(directory: str) -> list[str]:
     return [Path(f).stem for f in glob(f"{directory}/*.yaml")]

Then update line 237:

-        declarative_yaml_files = get_all_yaml_files_without_ext()
+        declarative_yaml_files = get_all_yaml_files_from_dir(LOCAL_YAML_DIR_PATH)

Also applies to: 30-32


160-179: Consider extracting common post-processing logic, wdyt?

There's code duplication between post_process_metadata_models and post_process_codegen (lines 136-157). Both functions:

  • Create /generated_post_processed directory
  • Iterate through .py files
  • Replace pydantic imports
  • Write to new files

You could extract the common pattern:

async def apply_post_processing(
    codegen_container: dagger.Container,
    transform_fn: callable[[str], str]
) -> dagger.Container:
    """Apply a transformation function to all generated Python files."""
    codegen_container = codegen_container.with_exec(
        ["mkdir", "/generated_post_processed"], use_entrypoint=True
    )
    for generated_file in await codegen_container.directory("/generated").entries():
        if generated_file.endswith(".py"):
            original_content = await codegen_container.file(
                f"/generated/{generated_file}"
            ).contents()
            
            post_processed_content = transform_fn(original_content)
            
            codegen_container = codegen_container.with_new_file(
                f"/generated_post_processed/{generated_file}", 
                contents=post_processed_content
            )
    return codegen_container

async def post_process_metadata_models(codegen_container: dagger.Container):
    """Post-process metadata models to use pydantic.v1 compatibility layer."""
    def transform(content: str) -> str:
        return content.replace("from pydantic", "from pydantic.v1")
    return await apply_post_processing(codegen_container, transform)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6b747fe and ff01ea0.

⛔ Files ignored due to path filters (31)
  • airbyte_cdk/metadata_models/generated/ActorDefinitionResourceRequirements.py is excluded by !**/generated/**
  • airbyte_cdk/metadata_models/generated/AirbyteInternal.py is excluded by !**/generated/**
  • airbyte_cdk/metadata_models/generated/AllowedHosts.py is excluded by !**/generated/**
  • airbyte_cdk/metadata_models/generated/ConnectorBreakingChanges.py is excluded by !**/generated/**
  • airbyte_cdk/metadata_models/generated/ConnectorBuildOptions.py is excluded by !**/generated/**
  • airbyte_cdk/metadata_models/generated/ConnectorIPCOptions.py is excluded by !**/generated/**
  • airbyte_cdk/metadata_models/generated/ConnectorMetadataDefinitionV0.py is excluded by !**/generated/**
  • airbyte_cdk/metadata_models/generated/ConnectorMetrics.py is excluded by !**/generated/**
  • airbyte_cdk/metadata_models/generated/ConnectorPackageInfo.py is excluded by !**/generated/**
  • airbyte_cdk/metadata_models/generated/ConnectorRegistryDestinationDefinition.py is excluded by !**/generated/**
  • airbyte_cdk/metadata_models/generated/ConnectorRegistryReleases.py is excluded by !**/generated/**
  • airbyte_cdk/metadata_models/generated/ConnectorRegistrySourceDefinition.py is excluded by !**/generated/**
  • airbyte_cdk/metadata_models/generated/ConnectorRegistryV0.py is excluded by !**/generated/**
  • airbyte_cdk/metadata_models/generated/ConnectorReleases.py is excluded by !**/generated/**
  • airbyte_cdk/metadata_models/generated/ConnectorTestSuiteOptions.py is excluded by !**/generated/**
  • airbyte_cdk/metadata_models/generated/GeneratedFields.py is excluded by !**/generated/**
  • airbyte_cdk/metadata_models/generated/GitInfo.py is excluded by !**/generated/**
  • airbyte_cdk/metadata_models/generated/JobType.py is excluded by !**/generated/**
  • airbyte_cdk/metadata_models/generated/NormalizationDestinationDefinitionConfig.py is excluded by !**/generated/**
  • airbyte_cdk/metadata_models/generated/RegistryOverrides.py is excluded by !**/generated/**
  • airbyte_cdk/metadata_models/generated/ReleaseStage.py is excluded by !**/generated/**
  • airbyte_cdk/metadata_models/generated/RemoteRegistries.py is excluded by !**/generated/**
  • airbyte_cdk/metadata_models/generated/ResourceRequirements.py is excluded by !**/generated/**
  • airbyte_cdk/metadata_models/generated/RolloutConfiguration.py is excluded by !**/generated/**
  • airbyte_cdk/metadata_models/generated/Secret.py is excluded by !**/generated/**
  • airbyte_cdk/metadata_models/generated/SecretStore.py is excluded by !**/generated/**
  • airbyte_cdk/metadata_models/generated/SourceFileInfo.py is excluded by !**/generated/**
  • airbyte_cdk/metadata_models/generated/SuggestedStreams.py is excluded by !**/generated/**
  • airbyte_cdk/metadata_models/generated/SupportLevel.py is excluded by !**/generated/**
  • airbyte_cdk/metadata_models/generated/TestConnections.py is excluded by !**/generated/**
  • airbyte_cdk/metadata_models/generated/__init__.py is excluded by !**/generated/**
📒 Files selected for processing (4)
  • airbyte_cdk/metadata_models/README.md (1 hunks)
  • airbyte_cdk/metadata_models/__init__.py (1 hunks)
  • airbyte_cdk/metadata_models/py.typed (1 hunks)
  • bin/generate_component_manifest_files.py (3 hunks)
🧰 Additional context used
🪛 markdownlint-cli2 (0.18.1)
airbyte_cdk/metadata_models/README.md

8-8: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🔇 Additional comments (3)
airbyte_cdk/metadata_models/py.typed (1)

1-1: Looks good! Quick packaging question though.

The py.typed marker is correctly placed and will signal to type checkers that the package is fully typed. Just to confirm—have you verified that py.typed will be included in the package distribution? This is sometimes missed if the build configuration (pyproject.toml, setup.py, or MANIFEST.in) doesn't explicitly include package marker files. Wdyt?

bin/generate_component_manifest_files.py (2)

5-5: LGTM on the new imports!

The addition of tempfile for temporary schema storage and httpx for async HTTP requests are appropriate choices for this functionality.

Also applies to: 11-11


224-231: Clarify mutual exclusivity of post_process and metadata_models flags, wdyt?

The conditional logic here suggests post_process and metadata_models should be mutually exclusive, but the function signature allows both to be True. If both are True, the elif on line 227 means metadata_models takes precedence and post_process is silently ignored.

Consider either:

  1. Adding validation to ensure only one is True:
 ) -> None:
     """Generate Pydantic models from YAML schemas using datamodel-codegen."""
+    if post_process and metadata_models:
+        raise ValueError("post_process and metadata_models cannot both be True")
+    
     init_module_content = generate_init_module_content(yaml_files)
  1. Or simplifying to a single enum parameter:
async def generate_models_from_schemas(
    dagger_client: dagger.Client,
    yaml_dir_path: str,
    output_dir_path: str,
    yaml_files: list[str],
    processing_mode: Literal["none", "declarative", "metadata"] = "none",
) -> None:

This would make the intent clearer and prevent misuse.

Comment on lines +8 to +10
```
airbyte-ci/connectors/metadata_service/lib/metadata_service/models/src/
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add language specifier to fenced code block, wdyt?

The fenced code block showing the repository path should have a language specifier for proper syntax highlighting and markdown lint compliance.

Apply this diff:

-```
+```text
 airbyte-ci/connectors/metadata_service/lib/metadata_service/models/src/

<details>
<summary>🧰 Tools</summary>

<details>
<summary>🪛 markdownlint-cli2 (0.18.1)</summary>

8-8: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

</details>

</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

In airbyte_cdk/metadata_models/README.md around lines 8 to 10, the fenced code
block lacks a language specifier; update the opening fence from totext
so the block becomes a plain-text fenced code block (i.e., replace the existing
fence with ```text and keep the content and closing fence unchanged).


</details>

<!-- This is an auto-generated comment by CodeRabbit -->

Comment on lines 17 to 19
METADATA_SCHEMAS_GITHUB_URL = "https://api.github.com/repos/airbytehq/airbyte/contents/airbyte-ci/connectors/metadata_service/lib/metadata_service/models/src"
METADATA_SCHEMAS_RAW_URL_BASE = "https://raw.githubusercontent.com/airbytehq/airbyte/master/airbyte-ci/connectors/metadata_service/lib/metadata_service/models/src"
LOCAL_METADATA_OUTPUT_DIR_PATH = "airbyte_cdk/metadata_models/generated"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Consider pinning to a stable ref instead of master, wdyt?

The URLs point to the master branch, which means builds will always fetch the latest schemas. If the upstream repository introduces breaking changes or the master branch becomes temporarily unstable, CDK builds will fail.

Consider one of these approaches:

  1. Pin to a specific commit SHA or tag for reproducible builds
  2. Add a configuration mechanism to specify the ref (defaulting to a known-good SHA)
  3. Implement a fallback to cached/vendored schemas if download fails

This would improve build reliability and reproducibility.

Example with pinning:

-METADATA_SCHEMAS_GITHUB_URL = "https://api.github.com/repos/airbytehq/airbyte/contents/airbyte-ci/connectors/metadata_service/lib/metadata_service/models/src"
-METADATA_SCHEMAS_RAW_URL_BASE = "https://raw.githubusercontent.com/airbytehq/airbyte/master/airbyte-ci/connectors/metadata_service/lib/metadata_service/models/src"
+# Pin to a specific commit for reproducible builds
+METADATA_SCHEMAS_REF = "a1b2c3d4e5f6"  # Update this when schemas change
+METADATA_SCHEMAS_GITHUB_URL = f"https://api.github.com/repos/airbytehq/airbyte/contents/airbyte-ci/connectors/metadata_service/lib/metadata_service/models/src?ref={METADATA_SCHEMAS_REF}"
+METADATA_SCHEMAS_RAW_URL_BASE = f"https://raw.githubusercontent.com/airbytehq/airbyte/{METADATA_SCHEMAS_REF}/airbyte-ci/connectors/metadata_service/lib/metadata_service/models/src"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
METADATA_SCHEMAS_GITHUB_URL = "https://api.github.com/repos/airbytehq/airbyte/contents/airbyte-ci/connectors/metadata_service/lib/metadata_service/models/src"
METADATA_SCHEMAS_RAW_URL_BASE = "https://raw.githubusercontent.com/airbytehq/airbyte/master/airbyte-ci/connectors/metadata_service/lib/metadata_service/models/src"
LOCAL_METADATA_OUTPUT_DIR_PATH = "airbyte_cdk/metadata_models/generated"
# Pin to a specific commit for reproducible builds
METADATA_SCHEMAS_REF = "a1b2c3d4e5f6" # Update this when schemas change
METADATA_SCHEMAS_GITHUB_URL = f"https://api.github.com/repos/airbytehq/airbyte/contents/airbyte-ci/connectors/metadata_service/lib/metadata_service/models/src?ref={METADATA_SCHEMAS_REF}"
METADATA_SCHEMAS_RAW_URL_BASE = f"https://raw.githubusercontent.com/airbytehq/airbyte/{METADATA_SCHEMAS_REF}/airbyte-ci/connectors/metadata_service/lib/metadata_service/models/src"
LOCAL_METADATA_OUTPUT_DIR_PATH = "airbyte_cdk/metadata_models/generated"

Comment on lines 41 to 77
async def download_metadata_schemas(temp_dir: Path) -> list[str]:
"""Download metadata schema YAML files from GitHub to a temporary directory."""
headers = {
"User-Agent": "airbyte-python-cdk-build",
"Accept": "application/vnd.github.v3+json",
}

async with httpx.AsyncClient(headers=headers, timeout=30.0) as client:
try:
response = await client.get(METADATA_SCHEMAS_GITHUB_URL)
response.raise_for_status()
files_info = response.json()
except httpx.HTTPStatusError as e:
if e.response.status_code == 403:
print(
"Warning: GitHub API rate limit exceeded. Using cached schemas if available.",
file=sys.stderr,
)
raise
raise

yaml_files = []
for file_info in files_info:
if file_info["name"].endswith(".yaml"):
file_name = file_info["name"]
file_url = f"{METADATA_SCHEMAS_RAW_URL_BASE}/{file_name}"

print(f"Downloading {file_name}...", file=sys.stderr)
file_response = await client.get(file_url)
file_response.raise_for_status()

file_path = temp_dir / file_name
file_path.write_text(file_response.text)
yaml_files.append(Path(file_name).stem)

return yaml_files

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Rate limit and reliability concerns in schema download, wdyt?

The function has several reliability issues that could cause builds to fail:

  1. Rate limit handling is misleading: Lines 54-60 catch 403 (rate limit) and print "Using cached schemas if available" but then re-raise the exception, causing the build to fail anyway. There's no actual fallback to cached schemas.

  2. No GitHub authentication: Unauthenticated requests have a much lower rate limit (60/hour vs 5000/hour). Builds could easily hit this limit, especially in CI environments.

  3. No fallback mechanism: If GitHub is down or rate-limited, builds will always fail. Consider vendoring a copy of the schemas as a fallback.

  4. No retry logic: Transient network errors will cause immediate build failure.

Suggested improvements:

 async def download_metadata_schemas(temp_dir: Path) -> list[str]:
     """Download metadata schema YAML files from GitHub to a temporary directory."""
+    # Try to use GitHub token for higher rate limits
+    github_token = os.environ.get("GITHUB_TOKEN")
     headers = {
         "User-Agent": "airbyte-python-cdk-build",
         "Accept": "application/vnd.github.v3+json",
     }
+    if github_token:
+        headers["Authorization"] = f"token {github_token}"
     
-    async with httpx.AsyncClient(headers=headers, timeout=30.0) as client:
+    async with httpx.AsyncClient(headers=headers, timeout=30.0, max_retries=3) as client:
         try:
             response = await client.get(METADATA_SCHEMAS_GITHUB_URL)
             response.raise_for_status()
             files_info = response.json()
         except httpx.HTTPStatusError as e:
             if e.response.status_code == 403:
-                print(
-                    "Warning: GitHub API rate limit exceeded. Using cached schemas if available.",
-                    file=sys.stderr,
-                )
-                raise
+                print("Error: GitHub API rate limit exceeded.", file=sys.stderr)
+                print("Set GITHUB_TOKEN environment variable for higher rate limits.", file=sys.stderr)
+                # TODO: Fall back to vendored schemas here
+                raise
             raise

Would you like me to open an issue to track adding a proper fallback mechanism?

Committable suggestion skipped: line range outside the PR's diff.

@github-actions
Copy link

github-actions bot commented Oct 21, 2025

PyTest Results (Fast)

1 tests   - 3 801   0 ✅  - 3 790   2s ⏱️ - 6m 32s
1 suites ±    0   0 💤  -    12 
1 files   ±    0   0 ❌ ±    0   1 🔥 +1 

For more details on these errors, see this check.

Results for commit 933d478. ± Comparison against base commit 6b747fe.

This pull request removes 3802 and adds 1 tests. Note that renamed tests count towards both.
unit_tests.cli.airbyte_cdk.test_secret_masks ‑ test_print_ci_secrets_masks_for_config[config0-expected_calls0]
unit_tests.cli.airbyte_cdk.test_secret_masks ‑ test_print_ci_secrets_masks_for_config[config1-expected_calls1]
unit_tests.cli.airbyte_cdk.test_secret_masks ‑ test_print_ci_secrets_masks_for_config[config2-expected_calls2]
unit_tests.cli.airbyte_cdk.test_secret_masks ‑ test_print_ci_secrets_masks_for_config[config3-expected_calls3]
unit_tests.cli.airbyte_cdk.test_secret_masks ‑ test_print_ci_secrets_masks_for_config[config4-expected_calls4]
unit_tests.cli.airbyte_cdk.test_secret_masks ‑ test_print_ci_secrets_masks_for_config[config5-expected_calls5]
unit_tests.cli.airbyte_cdk.test_secret_masks ‑ test_print_ci_secrets_masks_for_config[config6-expected_calls6]
unit_tests.cli.airbyte_cdk.test_secret_masks ‑ test_print_ci_secrets_masks_for_config[config7-expected_calls7]
unit_tests.cli.airbyte_cdk.test_secret_masks ‑ test_print_ci_secrets_masks_for_config[config8-expected_calls8]
unit_tests.cli.airbyte_cdk.test_secrets.TestFetch ‑ test_fetch_with_all_failed_secrets
…
unit_tests.cli.airbyte_cdk.test_secret_masks

♻️ This comment has been updated with latest results.

@github-actions
Copy link

github-actions bot commented Oct 21, 2025

PyTest Results (Full)

18 tests    0 ✅  14s ⏱️
 1 suites   0 💤
 1 files     0 ❌  18 🔥

For more details on these errors, see this check.

Results for commit 933d478.

♻️ This comment has been updated with latest results.

Copy link
Contributor Author

@aaronsteers aaronsteers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin, let's move the generated models module until this module instead: airbyte_cdk.test.models.connector_metadata

…etadata

- Moved from airbyte_cdk.metadata_models.generated to airbyte_cdk.test.models.connector_metadata
- Updated build script to output to new location
- Updated README with new import paths

Co-Authored-By: AJ Steers <[email protected]>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 15

♻️ Duplicate comments (13)
airbyte_cdk/test/models/connector_metadata/README.md (1)

8-10: Address the markdown lint issue.

A previous review flagged this fenced code block for missing a language specifier. While this is a minor formatting issue, it's good practice to fix for markdown compliance.

Apply this diff:

-```
+```text
 airbyte-ci/connectors/metadata_service/lib/metadata_service/models/src/

</blockquote></details>
<details>
<summary>airbyte_cdk/test/models/connector_metadata/ReleaseStage.py (1)</summary><blockquote>

`9-13`: **Same duplication issue as SupportLevel, wdyt?**

The `ReleaseStage` enum is also defined identically in at least 5 other generated files (ConnectorMetadataDefinitionV0.py, ConnectorRegistryDestinationDefinition.py, ConnectorRegistryReleases.py, ConnectorRegistrySourceDefinition.py, ConnectorRegistryV0.py), causing the same type identity and import conflicts described in the SupportLevel comment.




This is part of the same systemic code generation issue where shared types are being duplicated rather than imported.

</blockquote></details>
<details>
<summary>airbyte_cdk/test/models/connector_metadata/SourceFileInfo.py (1)</summary><blockquote>

`11-16`: **Same duplication issue affects SourceFileInfo, wdyt?**

The `SourceFileInfo` model is defined identically in at least 6 other files (ConnectorMetadataDefinitionV0.py, ConnectorRegistryDestinationDefinition.py, ConnectorRegistryReleases.py, ConnectorRegistrySourceDefinition.py, ConnectorRegistryV0.py, GeneratedFields.py). This is part of the same systemic code generation issue.




Since this is a Pydantic model rather than just an enum, the type conflicts are even more severe—instances won't validate against models from different modules.

</blockquote></details>
<details>
<summary>airbyte_cdk/test/models/connector_metadata/ResourceRequirements.py (1)</summary><blockquote>

`11-18`: **ResourceRequirements also duplicated across 5+ files, wdyt?**

The `ResourceRequirements` model is defined identically in at least 5 other files (ActorDefinitionResourceRequirements.py, ConnectorMetadataDefinitionV0.py, ConnectorRegistryDestinationDefinition.py, ConnectorRegistryReleases.py, RegistryOverrides.py), part of the same code generation issue.

</blockquote></details>
<details>
<summary>airbyte_cdk/test/models/connector_metadata/TestConnections.py (1)</summary><blockquote>

`9-14`: **TestConnections duplicated in 2 files, wdyt?**

The `TestConnections` model is defined identically in ConnectorMetadataDefinitionV0.py (lines 56-61) and ConnectorTestSuiteOptions.py (lines 31-36), part of the same code generation issue.

</blockquote></details>
<details>
<summary>airbyte_cdk/test/models/connector_metadata/RemoteRegistries.py (1)</summary><blockquote>

`11-23`: **Both PyPi and RemoteRegistries duplicated, wdyt?**

Both `PyPi` (lines 11-16) and `RemoteRegistries` (lines 19-23) are defined identically in ConnectorMetadataDefinitionV0.py (PyPi at lines 203-208, RemoteRegistries at lines 322-326), part of the same code generation issue.

</blockquote></details>
<details>
<summary>airbyte_cdk/test/models/connector_metadata/GeneratedFields.py (1)</summary><blockquote>

`13-32`: **GitInfo duplication already flagged.**

This `GitInfo` definition is identical to the one in `GitInfo.py`. See the earlier comment on that file regarding the broader duplication pattern.

</blockquote></details>
<details>
<summary>airbyte_cdk/test/models/connector_metadata/GitInfo.py (1)</summary><blockquote>

`12-31`: **Duplicate model definition - part of broader pattern.**

The `GitInfo` model is duplicated across at least 6 files in this package (ConnectorMetadataDefinitionV0.py, ConnectorRegistryDestinationDefinition.py, ConnectorRegistryReleases.py, ConnectorRegistrySourceDefinition.py, ConnectorRegistryV0.py, and GeneratedFields.py). This is the most widely duplicated model in the codebase.

Consider this as part of the broader duplication issue flagged in Secret.py. The generation script should be configured to create these shared models once and use imports. Wdyt?

</blockquote></details>
<details>
<summary>airbyte_cdk/test/models/connector_metadata/RolloutConfiguration.py (1)</summary><blockquote>

`11-29`: **Model structure looks good; duplication noted.**

The `RolloutConfiguration` model has well-chosen defaults and appropriate validation constraints (percentage ranges, minimum delay). However, it's duplicated in ConnectorMetadataDefinitionV0.py, ConnectorRegistryReleases.py, and ConnectorRegistrySourceDefinition.py.

This is part of the broader duplication pattern. Wdyt about consolidating these shared models?

</blockquote></details>
<details>
<summary>airbyte_cdk/test/models/connector_metadata/AirbyteInternal.py (1)</summary><blockquote>

`29-39`: **Model structure is appropriate; duplication noted.**

The `AirbyteInternal` model correctly uses `Extra.allow` (unlike most other models in this package that use `Extra.forbid`), which is appropriate for internal/extensible metadata. The defaults are sensible (isEnterprise=False, requireVersionIncrementsInPullRequests=True).

However, this model along with the `Sl` and `Ql` enums is duplicated in ConnectorMetadataDefinitionV0.py, ConnectorRegistryDestinationDefinition.py, and ConnectorRegistrySourceDefinition.py—part of the broader duplication pattern.

</blockquote></details>
<details>
<summary>airbyte_cdk/test/models/connector_metadata/ConnectorBreakingChanges.py (1)</summary><blockquote>

`13-70`: **Breaking change models are well-structured; duplication noted.**

The breaking change models use appropriate Pydantic v1 patterns:
- `const=True` for type discrimination in `StreamBreakingChangeScope`
- `__root__` wrappers for dict-based schemas
- `date` and `AnyUrl` types for proper validation

However, all five models are duplicated in ConnectorMetadataDefinitionV0.py—part of the broader duplication pattern discussed in earlier files.

</blockquote></details>
<details>
<summary>airbyte_cdk/test/models/connector_metadata/ConnectorTestSuiteOptions.py (1)</summary><blockquote>

`19-49`: **Duplicate definitions: Should import from Secret.py instead.**

`SecretStore` (lines 19-29) and `Secret` (lines 40-49) are identical to the models defined in `Secret.py` within the same PR. Since both files are in the same package (`airbyte_cdk/test/models/connector_metadata/`), `ConnectorTestSuiteOptions.py` should import these models rather than redefining them:

```python
from airbyte_cdk.test.models.connector_metadata.Secret import Secret, SecretStore

This would eliminate the duplication and ensure consistency. Wdyt?

airbyte_cdk/test/models/connector_metadata/ActorDefinitionResourceRequirements.py (1)

22-48: Resource requirements models are well-structured; duplication noted.

The models provide good flexibility with global defaults and per-job overrides. The JobType enum comprehensively covers connector operations.

However, all four models are duplicated in ConnectorMetadataDefinitionV0.py—part of the broader duplication pattern.

🧹 Nitpick comments (22)
airbyte_cdk/test/models/connector_metadata/ConnectorMetrics.py (1)

12-15: Consider stronger typing for metrics fields, wdyt?

The ConnectorMetrics class uses Optional[Any] for all three fields (all, cloud, oss), which sacrifices type safety. This might be intentional for flexible metric schemas, but if the metrics structure is known, consider using more specific types (e.g., Optional[ConnectorMetric] or a typed dictionary).

That said, the flexibility might be needed if metrics schemas vary. What's the rationale here?

bin/generate_component_manifest_files.py (1)

160-179: Consider extracting the shared post-processing logic, wdyt?

The post_process_metadata_models function has nearly identical logic to post_process_codegen (lines 136-157), both iterating through generated files and applying string replacements. The main difference is that post_process_metadata_models only replaces pydantic imports while post_process_codegen does additional replacements.

Consider extracting the common pattern into a shared helper:

+async def post_process_generated_files(
+    codegen_container: dagger.Container,
+    replacements: list[tuple[str, str]],
+    apply_deprecation_fix: bool = False
+) -> dagger.Container:
+    """Apply string replacements to generated Python files."""
+    codegen_container = codegen_container.with_exec(
+        ["mkdir", "/generated_post_processed"], use_entrypoint=True
+    )
+    for generated_file in await codegen_container.directory("/generated").entries():
+        if generated_file.endswith(".py"):
+            original_content = await codegen_container.file(
+                f"/generated/{generated_file}"
+            ).contents()
+            
+            post_processed_content = original_content
+            for old, new in replacements:
+                post_processed_content = post_processed_content.replace(old, new)
+            
+            if apply_deprecation_fix:
+                post_processed_content = replace_base_model_for_classes_with_deprecated_fields(
+                    post_processed_content
+                )
+            
+            codegen_container = codegen_container.with_new_file(
+                f"/generated_post_processed/{generated_file}", contents=post_processed_content
+            )
+    return codegen_container

 async def post_process_codegen(codegen_container: dagger.Container):
-    codegen_container = codegen_container.with_exec(
-        ["mkdir", "/generated_post_processed"], use_entrypoint=True
-    )
-    for generated_file in await codegen_container.directory("/generated").entries():
-        if generated_file.endswith(".py"):
-            original_content = await codegen_container.file(
-                f"/generated/{generated_file}"
-            ).contents()
-            post_processed_content = original_content.replace(
-                " _parameters:", " parameters:"
-            ).replace("from pydantic", "from pydantic.v1")
-
-            post_processed_content = replace_base_model_for_classes_with_deprecated_fields(
-                post_processed_content
-            )
-
-            codegen_container = codegen_container.with_new_file(
-                f"/generated_post_processed/{generated_file}", contents=post_processed_content
-            )
-    return codegen_container
+    return await post_process_generated_files(
+        codegen_container,
+        replacements=[(" _parameters:", " parameters:"), ("from pydantic", "from pydantic.v1")],
+        apply_deprecation_fix=True
+    )

 async def post_process_metadata_models(codegen_container: dagger.Container):
     """Post-process metadata models to use pydantic.v1 compatibility layer."""
-    codegen_container = codegen_container.with_exec(
-        ["mkdir", "/generated_post_processed"], use_entrypoint=True
-    )
-    for generated_file in await codegen_container.directory("/generated").entries():
-        if generated_file.endswith(".py"):
-            original_content = await codegen_container.file(
-                f"/generated/{generated_file}"
-            ).contents()
-            
-            post_processed_content = original_content.replace(
-                "from pydantic", "from pydantic.v1"
-            )
-            
-            codegen_container = codegen_container.with_new_file(
-                f"/generated_post_processed/{generated_file}", contents=post_processed_content
-            )
-    return codegen_container
+    return await post_process_generated_files(
+        codegen_container,
+        replacements=[("from pydantic", "from pydantic.v1")],
+        apply_deprecation_fix=False
+    )
airbyte_cdk/test/models/connector_metadata/ActorDefinitionResourceRequirements.py (1)

12-19: Consider adding validation for resource requirement formats.

The ResourceRequirements fields are Optional[str] with no format validation. While this flexibility allows Kubernetes-style values like "100m" or "1Gi", it also accepts invalid values like "abc" or "100xyz" that would fail at runtime.

Since these are auto-generated models, would it make sense to add format validation in the upstream YAML schema (e.g., regex patterns or custom validators)? Wdyt?

airbyte_cdk/test/models/connector_metadata/GeneratedFields.py (1)

49-58: Consider consolidating identical enums.

Usage and SyncSuccessRate have identical members (low, medium, high). If these represent the same conceptual scale, they could be consolidated into a single enum like MetricLevel to reduce duplication. However, keeping them separate provides better semantic clarity if they represent distinct dimensions.

Wdyt—worth consolidating or better to keep separate for clarity?

airbyte_cdk/test/models/connector_metadata/RegistryOverrides.py (1)

12-18: Should Extra be forbid on simple leaf models to catch typos?

AllowedHosts/SuggestedStreams/NormalizationDestinationDefinitionConfig use Extra.allow, which can mask misspelled fields on these small shapes. Would switching to Extra.forbid for these specific leaves help catch config mistakes, or is “additionalProperties” intentionally allowed in the upstream schema, wdyt?

Also applies to: 22-37

airbyte_cdk/test/models/connector_metadata/ConnectorReleases.py (2)

43-44: Prefer Literal for const fields for clearer runtime checks

Using Any with const works but is looser than necessary. Shall we make this a Literal for better type clarity, wdyt?

-from typing import Any, Dict, List, Optional
+from typing import Any, Dict, List, Optional, Literal
@@ class StreamBreakingChangeScope(BaseModel):
-    scopeType: Any = Field("stream", const=True)
+    scopeType: Literal["stream"] = "stream"

17-32: Add cross-field check: initialPercentage ≤ maxPercentage

A simple validator will prevent inconsistent rollout configs. Should we add it, wdyt?

 class RolloutConfiguration(BaseModel):
@@
     advanceDelayMinutes: Optional[conint(ge=10)] = Field(
         10,
         description="The number of minutes to wait before advancing the rollout percentage.",
     )
+
+    from pydantic.v1 import root_validator
+
+    @root_validator
+    def _validate_rollout_bounds(cls, values):
+        init = values.get("initialPercentage")
+        maxp = values.get("maxPercentage")
+        if init is not None and maxp is not None and init > maxp:
+            raise ValueError("initialPercentage must be ≤ maxPercentage")
+        return values
airbyte_cdk/test/models/connector_metadata/ConnectorRegistrySourceDefinition.py (1)

298-301: Tighten URL/time/number types and add a small integrity check

Would you be open to these stricter types and a light data check, wdyt?

  • Use AnyUrl where fields are URLs:
-    documentationUrl: str
+    documentationUrl: AnyUrl
@@
-    sbomUrl: Optional[str] = Field(None, description="URL to the SBOM file")
+    sbomUrl: Optional[AnyUrl] = Field(None, description="URL to the SBOM file")
  • Use datetime for timestamps in SourceFileInfo:
-    metadata_last_modified: Optional[str] = None
-    registry_entry_generated_at: Optional[str] = None
+    metadata_last_modified: Optional[datetime] = None
+    registry_entry_generated_at: Optional[datetime] = None
  • Enforce positive values for maxSecondsBetweenMessages:
-from typing import Any, Dict, List, Optional, Union
+from typing import Any, Dict, List, Optional, Union
@@
-    maxSecondsBetweenMessages: Optional[int] = Field(
+    maxSecondsBetweenMessages: Optional[conint(ge=1)] = Field(
         None,
         description="Number of seconds allowed between 2 airbyte protocol messages. The source will timeout if this delay is reach",
     )
  • Ensure unique jobType in jobSpecific (same pattern as suggested elsewhere):
 class ActorDefinitionResourceRequirements(BaseModel):
@@
     jobSpecific: Optional[List[JobTypeResourceLimit]] = None
+    from pydantic.v1 import validator
+    @validator("jobSpecific")
+    def _unique_job_types(cls, v):
+        if not v:
+            return v
+        seen = set()
+        for item in v:
+            if item.jobType in seen:
+                raise ValueError(f"duplicate jobType '{item.jobType.value}' in jobSpecific")
+            seen.add(item.jobType)
+        return v

Also applies to: 331-334, 240-241, 165-170, 186-188, 326-329, 247-252

airbyte_cdk/test/models/connector_metadata/ConnectorRegistryV0.py (1)

303-304: Optional tightenings: URLs, datetimes, and rollout bounds

Would you consider these small improvements, wdyt?

  • Use AnyUrl for documentationUrl and sbomUrl:
-    documentationUrl: str
+    documentationUrl: AnyUrl
@@
-    sbomUrl: Optional[str] = Field(None, description="URL to the SBOM file")
+    sbomUrl: Optional[AnyUrl] = Field(None, description="URL to the SBOM file")
  • Use datetime for SourceFileInfo timestamps:
-    metadata_last_modified: Optional[str] = None
-    registry_entry_generated_at: Optional[str] = None
+    metadata_last_modified: Optional[datetime] = None
+    registry_entry_generated_at: Optional[datetime] = None
  • Add rollout bound check:
 class RolloutConfiguration(BaseModel):
@@
     advanceDelayMinutes: Optional[conint(ge=10)] = Field(
         10,
         description="The number of minutes to wait before advancing the rollout percentage.",
     )
+    from pydantic.v1 import root_validator
+    @root_validator
+    def _validate_rollout_bounds(cls, values):
+        init, maxp = values.get("initialPercentage"), values.get("maxPercentage")
+        if init is not None and maxp is not None and init > maxp:
+            raise ValueError("initialPercentage must be ≤ maxPercentage")
+        return values

Also applies to: 395-399, 240-241, 183-189, 79-94

airbyte_cdk/test/models/connector_metadata/ConnectorRegistryReleases.py (1)

236-241: Minor hardening: URLs, datetimes, Literal scope type, rollout bound, uniqueness

Would you consider these small improvements, wdyt?

  • Use AnyUrl for sbomUrl:
-    sbomUrl: Optional[str] = Field(None, description="URL to the SBOM file")
+    sbomUrl: Optional[AnyUrl] = Field(None, description="URL to the SBOM file")
  • Use datetime for SourceFileInfo timestamps:
-    metadata_last_modified: Optional[str] = None
-    registry_entry_generated_at: Optional[str] = None
+    metadata_last_modified: Optional[datetime] = None
+    registry_entry_generated_at: Optional[datetime] = None
  • Prefer Literal for scopeType:
-from typing import Any, Dict, List, Optional, Union
+from typing import Any, Dict, List, Optional, Union, Literal
@@ class StreamBreakingChangeScope(BaseModel):
-    scopeType: Any = Field("stream", const=True)
+    scopeType: Literal["stream"] = "stream"
  • Add rollout bounds validator:
 class RolloutConfiguration(BaseModel):
@@
     advanceDelayMinutes: Optional[conint(ge=10)] = Field(
         10,
         description="The number of minutes to wait before advancing the rollout percentage.",
     )
+    from pydantic.v1 import root_validator
+    @root_validator
+    def _validate_rollout_bounds(cls, values):
+        init, maxp = values.get("initialPercentage"), values.get("maxPercentage")
+        if init is not None and maxp is not None and init > maxp:
+            raise ValueError("initialPercentage must be ≤ maxPercentage")
+        return values
  • Ensure unique jobType in jobSpecific:
 class ActorDefinitionResourceRequirements(BaseModel):
@@
     jobSpecific: Optional[List[JobTypeResourceLimit]] = None
+    from pydantic.v1 import validator
+    @validator("jobSpecific")
+    def _unique_job_types(cls, v):
+        if not v:
+            return v
+        seen = set()
+        for item in v:
+            if item.jobType in seen:
+                raise ValueError(f"duplicate jobType '{item.jobType.value}' in jobSpecific")
+            seen.add(item.jobType)
+        return v

Also applies to: 165-170, 146-161, 221-225, 35-37, 18-32

airbyte_cdk/test/models/connector_metadata/ConnectorRegistryDestinationDefinition.py (6)

1-3: Reproducibility: embed generator + schema provenance?

Since builds fetch schemas at build time, could we stamp the generator version, source schema URL, and commit SHA in this header to make artifacts reproducible and debuggable, wdyt?


101-111: Use Literal for const field to tighten validation and typing

scopeType is effectively a constant. Shall we use Literal["stream"] instead of Any+const for stronger type-checking and clearer errors, wdyt?

Apply this diff:

-from typing import Any, Dict, List, Optional, Union
+from typing import Any, Dict, List, Optional, Union, Literal
@@
-class StreamBreakingChangeScope(BaseModel):
+class StreamBreakingChangeScope(BaseModel):
@@
-    scopeType: Any = Field("stream", const=True)
+    scopeType: Literal["stream"] = Field("stream", const=True)

236-241: URL typing: prefer AnyUrl for sbomUrl

sbomUrl looks like a URL. Would you switch to AnyUrl for stricter validation, wdyt?

-    sbomUrl: Optional[str] = Field(None, description="URL to the SBOM file")
+    sbomUrl: Optional[AnyUrl] = Field(None, description="URL to the SBOM file")

323-327: Semver: add a regex constraint on protocolVersion

If protocolVersion is semver, shall we add a constraint to avoid accidental non-semver strings, wdyt?

-    protocolVersion: Optional[str] = Field(
+    protocolVersion: Optional[constr(regex=r'^\d+\.\d+\.\d+(?:-[0-9A-Za-z-.]+)?$')] = Field(
         None, description="the Airbyte Protocol version supported by the connector"
     )

294-301: <!-- [scratchpad_update]
[observations_updated]

  • Web search confirms AnyUrl is a valid Pydantic type from the networks module
  • AnyUrl provides URL validation that catches malformed values at parse-time
  • Pydantic v2 behavior: AnyUrl appends trailing slash; may affect downstream usage
  • The suggestion to use AnyUrl is technically sound and recommended by Pydantic docs
  • Need to verify: (1) Pydantic version used in this codebase, (2) whether AnyUrl is imported

[pending]

  • Check Pydantic import and version in ConnectorRegistryDestinationDefinition.py
  • Verify if changing to AnyUrl would break any dependent code (due to type differences)
  • Determine if this needs coordination across all 8 instances found
    [scratchpad_end] --><!-- [scratchpad_update_2]
    [key_finding]
  • Airbyte CDK 2.0 (released mid-2024) uses Pydantic V2
  • The files being reviewed are test/models/connector_metadata files
  • These are likely part of the updated Pydantic V2 stack
  • AnyUrl is available in Pydantic V2

[critical_observation]

  • Need to check if these files are already using Pydantic V2 imports
  • Suggested change (str -> AnyUrl) is valid for Pydantic V2
  • The pattern exists across 8 locations in 4 files
  • This represents a systemic opportunity for coordinated refactoring

[pending]

  • Verify imports in target file to confirm Pydantic V2 usage
  • Check if AnyUrl is already imported
  • Determine scope of refactor (all 8 instances or just the target file?)
    [scratchpad_end] -->documentationUrl/iconUrl: widen refactor scope—apply across all registry models

The suggestion to use AnyUrl for validation is sound. However, the verification found this pattern repeated in 8 locations across 4 files:

  • ConnectorRegistryV0.py (lines 303, 395)
  • ConnectorRegistrySourceDefinition.py (lines 298, 389)
  • ConnectorRegistryReleases.py (lines 335, 389)
  • ConnectorRegistryDestinationDefinition.py (lines 298, 390)

Rather than fixing just the target file, consider coordinating this into a single refactor pass across all registry models to maintain consistency. With CDK 2.0, Pydantic V2 is available, so AnyUrl is a valid choice. Would updating all four files at once work for your workflow, or should we proceed incrementally?


436-439: Consider adding locals() and including SourceDefinition for robustness and consistency

Your suggestion is spot on. The file defines ConnectorRegistrySourceDefinition at line 382, yet it's not receiving an update_forward_refs() call while the other three registry classes are. Passing locals() also aligns with Pydantic best practices for ensuring forward references resolve against all types in the current module scope.

The refactored version would look like:

_ns = locals()
ConnectorRegistryDestinationDefinition.update_forward_refs(**_ns)
ConnectorRegistryReleases.update_forward_refs(**_ns)
ConnectorReleaseCandidates.update_forward_refs(**_ns)
VersionReleaseCandidate.update_forward_refs(**_ns)
ConnectorRegistrySourceDefinition.update_forward_refs(**_ns)

This makes the handling symmetric and more resilient. Worth adopting?

airbyte_cdk/test/models/connector_metadata/ConnectorMetadataDefinitionV0.py (6)

1-3: Reproducibility: capture generator + schema commit in header

To help future debugging (and rate-limit workarounds), could we include the datamodel-codegen version, source schema URL(s), and pinned commit SHA here, wdyt?


166-172: Use Literal for const field

Same as the other module: would you switch scopeType to Literal["stream"] for stricter typing, wdyt?

-    scopeType: Any = Field("stream", const=True)
+    scopeType: Literal["stream"] = Field("stream", const=True)

330-335: URL typing: sbomUrl could be AnyUrl

Align with other URL fields to validate early. Shall we switch, wdyt?

-    sbomUrl: Optional[str] = Field(None, description="URL to the SBOM file")
+    sbomUrl: Optional[AnyUrl] = Field(None, description="URL to the SBOM file")

397-404: connectorSubtype override: align type with enum while keeping flexibility

For RegistryOverrides.connectorSubtype, would you type it as Optional[Union[str, ConnectorSubtype]] so we validate known values but still accept future strings, wdyt?

-    connectorSubtype: Optional[str] = None
+    connectorSubtype: Optional[Union[str, ConnectorSubtype]] = None

441-488: Standardize documentationUrl type across metadata and registry definitions

The inconsistency is confirmed: ConnectorMetadataDefinitionV0.py uses AnyUrl for documentationUrl (line 452), while all registry definition files (ConnectorRegistry*) use str (6 instances across multiple files). This divergence could introduce parsing surprises when the same field is handled in different contexts. Would standardizing on AnyUrl across both codegen flows make sense for your use cases?


471-474: Consider refactoring tags field to use default_factory=list with non-optional type

The current implementation uses a mutable default argument with an Optional type, which is a Python anti-pattern. The suggested change to List[str] = Field(default_factory=list, ...) aligns with Pydantic best practices and eliminates the mutable default concern. A broader search of the codebase didn't surface other similar patterns, so this appears to be an isolated instance. Wdyt about making this change?

-    tags: Optional[List[str]] = Field(
-        [],
-        description="An array of tags that describe the connector. E.g: language:python, keyword:rds, etc.",
-    )
+    tags: List[str] = Field(
+        default_factory=list,
+        description="An array of tags that describe the connector. E.g: language:python, keyword:rds, etc.",
+    )
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ff01ea0 and 63930c6.

📒 Files selected for processing (34)
  • airbyte_cdk/test/models/connector_metadata/ActorDefinitionResourceRequirements.py (1 hunks)
  • airbyte_cdk/test/models/connector_metadata/AirbyteInternal.py (1 hunks)
  • airbyte_cdk/test/models/connector_metadata/AllowedHosts.py (1 hunks)
  • airbyte_cdk/test/models/connector_metadata/ConnectorBreakingChanges.py (1 hunks)
  • airbyte_cdk/test/models/connector_metadata/ConnectorBuildOptions.py (1 hunks)
  • airbyte_cdk/test/models/connector_metadata/ConnectorIPCOptions.py (1 hunks)
  • airbyte_cdk/test/models/connector_metadata/ConnectorMetadataDefinitionV0.py (1 hunks)
  • airbyte_cdk/test/models/connector_metadata/ConnectorMetrics.py (1 hunks)
  • airbyte_cdk/test/models/connector_metadata/ConnectorPackageInfo.py (1 hunks)
  • airbyte_cdk/test/models/connector_metadata/ConnectorRegistryDestinationDefinition.py (1 hunks)
  • airbyte_cdk/test/models/connector_metadata/ConnectorRegistryReleases.py (1 hunks)
  • airbyte_cdk/test/models/connector_metadata/ConnectorRegistrySourceDefinition.py (1 hunks)
  • airbyte_cdk/test/models/connector_metadata/ConnectorRegistryV0.py (1 hunks)
  • airbyte_cdk/test/models/connector_metadata/ConnectorReleases.py (1 hunks)
  • airbyte_cdk/test/models/connector_metadata/ConnectorTestSuiteOptions.py (1 hunks)
  • airbyte_cdk/test/models/connector_metadata/GeneratedFields.py (1 hunks)
  • airbyte_cdk/test/models/connector_metadata/GitInfo.py (1 hunks)
  • airbyte_cdk/test/models/connector_metadata/JobType.py (1 hunks)
  • airbyte_cdk/test/models/connector_metadata/NormalizationDestinationDefinitionConfig.py (1 hunks)
  • airbyte_cdk/test/models/connector_metadata/README.md (1 hunks)
  • airbyte_cdk/test/models/connector_metadata/RegistryOverrides.py (1 hunks)
  • airbyte_cdk/test/models/connector_metadata/ReleaseStage.py (1 hunks)
  • airbyte_cdk/test/models/connector_metadata/RemoteRegistries.py (1 hunks)
  • airbyte_cdk/test/models/connector_metadata/ResourceRequirements.py (1 hunks)
  • airbyte_cdk/test/models/connector_metadata/RolloutConfiguration.py (1 hunks)
  • airbyte_cdk/test/models/connector_metadata/Secret.py (1 hunks)
  • airbyte_cdk/test/models/connector_metadata/SecretStore.py (1 hunks)
  • airbyte_cdk/test/models/connector_metadata/SourceFileInfo.py (1 hunks)
  • airbyte_cdk/test/models/connector_metadata/SuggestedStreams.py (1 hunks)
  • airbyte_cdk/test/models/connector_metadata/SupportLevel.py (1 hunks)
  • airbyte_cdk/test/models/connector_metadata/TestConnections.py (1 hunks)
  • airbyte_cdk/test/models/connector_metadata/__init__.py (1 hunks)
  • airbyte_cdk/test/models/connector_metadata/py.typed (1 hunks)
  • bin/generate_component_manifest_files.py (3 hunks)
✅ Files skipped from review due to trivial changes (3)
  • airbyte_cdk/test/models/connector_metadata/py.typed
  • airbyte_cdk/test/models/connector_metadata/init.py
  • airbyte_cdk/test/models/connector_metadata/JobType.py
🧰 Additional context used
🧬 Code graph analysis (21)
airbyte_cdk/test/models/connector_metadata/SupportLevel.py (5)
airbyte_cdk/test/models/connector_metadata/ConnectorMetadataDefinitionV0.py (1)
  • SupportLevel (72-75)
airbyte_cdk/test/models/connector_metadata/ConnectorRegistryDestinationDefinition.py (1)
  • SupportLevel (21-24)
airbyte_cdk/test/models/connector_metadata/ConnectorRegistryReleases.py (1)
  • SupportLevel (66-69)
airbyte_cdk/test/models/connector_metadata/ConnectorRegistrySourceDefinition.py (1)
  • SupportLevel (28-31)
airbyte_cdk/test/models/connector_metadata/ConnectorRegistryV0.py (1)
  • SupportLevel (21-24)
airbyte_cdk/test/models/connector_metadata/AllowedHosts.py (6)
airbyte_cdk/test/models/connector_metadata/ConnectorMetadataDefinitionV0.py (7)
  • AllowedHosts (78-85)
  • Config (31-32)
  • Config (45-46)
  • Config (58-59)
  • Config (79-80)
  • Config (89-90)
  • Config (107-108)
airbyte_cdk/test/models/connector_metadata/ConnectorRegistryDestinationDefinition.py (1)
  • AllowedHosts (65-72)
airbyte_cdk/test/models/connector_metadata/ConnectorRegistryReleases.py (1)
  • AllowedHosts (92-99)
airbyte_cdk/test/models/connector_metadata/ConnectorRegistrySourceDefinition.py (1)
  • AllowedHosts (54-61)
airbyte_cdk/test/models/connector_metadata/ConnectorRegistryV0.py (1)
  • AllowedHosts (65-72)
airbyte_cdk/test/models/connector_metadata/RegistryOverrides.py (1)
  • AllowedHosts (12-19)
airbyte_cdk/test/models/connector_metadata/ResourceRequirements.py (5)
airbyte_cdk/test/models/connector_metadata/ActorDefinitionResourceRequirements.py (4)
  • ResourceRequirements (12-19)
  • Config (13-14)
  • Config (33-34)
  • Config (41-42)
airbyte_cdk/test/models/connector_metadata/ConnectorMetadataDefinitionV0.py (6)
  • ResourceRequirements (116-123)
  • Config (31-32)
  • Config (45-46)
  • Config (58-59)
  • Config (79-80)
  • Config (89-90)
airbyte_cdk/test/models/connector_metadata/ConnectorRegistryDestinationDefinition.py (1)
  • ResourceRequirements (27-34)
airbyte_cdk/test/models/connector_metadata/ConnectorRegistryReleases.py (1)
  • ResourceRequirements (72-79)
airbyte_cdk/test/models/connector_metadata/RegistryOverrides.py (1)
  • ResourceRequirements (50-57)
airbyte_cdk/test/models/connector_metadata/Secret.py (3)
airbyte_cdk/test/models/connector_metadata/ConnectorMetadataDefinitionV0.py (7)
  • SecretStore (44-54)
  • Config (31-32)
  • Config (45-46)
  • Config (58-59)
  • Config (79-80)
  • Config (89-90)
  • Secret (296-305)
airbyte_cdk/test/models/connector_metadata/ConnectorTestSuiteOptions.py (2)
  • SecretStore (19-29)
  • Secret (40-49)
airbyte_cdk/test/models/connector_metadata/SecretStore.py (1)
  • SecretStore (11-21)
airbyte_cdk/test/models/connector_metadata/ConnectorIPCOptions.py (1)
airbyte_cdk/test/models/connector_metadata/ConnectorMetadataDefinitionV0.py (11)
  • SupportedSerializationEnum (269-272)
  • SupportedTransportEnum (275-277)
  • DataChannel (280-286)
  • Config (31-32)
  • Config (45-46)
  • Config (58-59)
  • Config (79-80)
  • Config (89-90)
  • Config (107-108)
  • Config (117-118)
  • ConnectorIPCOptions (289-293)
airbyte_cdk/test/models/connector_metadata/NormalizationDestinationDefinitionConfig.py (2)
airbyte_cdk/test/models/connector_metadata/ConnectorMetadataDefinitionV0.py (6)
  • NormalizationDestinationDefinitionConfig (88-103)
  • Config (31-32)
  • Config (45-46)
  • Config (58-59)
  • Config (79-80)
  • Config (89-90)
airbyte_cdk/test/models/connector_metadata/ConnectorRegistryDestinationDefinition.py (1)
  • NormalizationDestinationDefinitionConfig (47-62)
airbyte_cdk/test/models/connector_metadata/ConnectorBreakingChanges.py (1)
airbyte_cdk/test/models/connector_metadata/ConnectorMetadataDefinitionV0.py (13)
  • DeadlineAction (157-159)
  • StreamBreakingChangeScope (162-171)
  • Config (31-32)
  • Config (45-46)
  • Config (58-59)
  • Config (79-80)
  • Config (89-90)
  • Config (107-108)
  • Config (117-118)
  • Config (137-138)
  • BreakingChangeScope (316-320)
  • VersionBreakingChange (362-384)
  • ConnectorBreakingChanges (406-414)
airbyte_cdk/test/models/connector_metadata/RegistryOverrides.py (4)
airbyte_cdk/test/models/connector_metadata/ActorDefinitionResourceRequirements.py (7)
  • Config (13-14)
  • Config (33-34)
  • Config (41-42)
  • ResourceRequirements (12-19)
  • JobType (22-29)
  • JobTypeResourceLimit (32-37)
  • ActorDefinitionResourceRequirements (40-48)
airbyte_cdk/test/models/connector_metadata/NormalizationDestinationDefinitionConfig.py (1)
  • NormalizationDestinationDefinitionConfig (9-24)
airbyte_cdk/test/models/connector_metadata/ResourceRequirements.py (1)
  • ResourceRequirements (11-18)
airbyte_cdk/test/models/connector_metadata/JobType.py (1)
  • JobType (9-16)
airbyte_cdk/test/models/connector_metadata/RemoteRegistries.py (1)
airbyte_cdk/test/models/connector_metadata/ConnectorMetadataDefinitionV0.py (7)
  • PyPi (204-209)
  • Config (31-32)
  • Config (45-46)
  • Config (58-59)
  • Config (79-80)
  • Config (89-90)
  • RemoteRegistries (323-327)
airbyte_cdk/test/models/connector_metadata/ConnectorPackageInfo.py (4)
airbyte_cdk/test/models/connector_metadata/ConnectorRegistryDestinationDefinition.py (1)
  • ConnectorPackageInfo (217-218)
airbyte_cdk/test/models/connector_metadata/ConnectorRegistryReleases.py (1)
  • ConnectorPackageInfo (199-200)
airbyte_cdk/test/models/connector_metadata/ConnectorRegistrySourceDefinition.py (1)
  • ConnectorPackageInfo (217-218)
airbyte_cdk/test/models/connector_metadata/ConnectorRegistryV0.py (1)
  • ConnectorPackageInfo (217-218)
airbyte_cdk/test/models/connector_metadata/TestConnections.py (2)
airbyte_cdk/test/models/connector_metadata/ConnectorMetadataDefinitionV0.py (6)
  • TestConnections (57-62)
  • Config (31-32)
  • Config (45-46)
  • Config (58-59)
  • Config (79-80)
  • Config (89-90)
airbyte_cdk/test/models/connector_metadata/ConnectorTestSuiteOptions.py (1)
  • TestConnections (32-37)
airbyte_cdk/test/models/connector_metadata/ConnectorBuildOptions.py (1)
airbyte_cdk/test/models/connector_metadata/ConnectorMetadataDefinitionV0.py (7)
  • ConnectorBuildOptions (30-34)
  • Config (31-32)
  • Config (45-46)
  • Config (58-59)
  • Config (79-80)
  • Config (89-90)
  • Config (107-108)
airbyte_cdk/test/models/connector_metadata/SourceFileInfo.py (6)
airbyte_cdk/test/models/connector_metadata/ConnectorMetadataDefinitionV0.py (1)
  • SourceFileInfo (234-239)
airbyte_cdk/test/models/connector_metadata/ConnectorRegistryDestinationDefinition.py (1)
  • SourceFileInfo (182-187)
airbyte_cdk/test/models/connector_metadata/ConnectorRegistryReleases.py (1)
  • SourceFileInfo (164-169)
airbyte_cdk/test/models/connector_metadata/ConnectorRegistrySourceDefinition.py (1)
  • SourceFileInfo (182-187)
airbyte_cdk/test/models/connector_metadata/ConnectorRegistryV0.py (1)
  • SourceFileInfo (182-187)
airbyte_cdk/test/models/connector_metadata/GeneratedFields.py (1)
  • SourceFileInfo (35-40)
airbyte_cdk/test/models/connector_metadata/GitInfo.py (6)
airbyte_cdk/test/models/connector_metadata/ConnectorMetadataDefinitionV0.py (6)
  • GitInfo (212-231)
  • Config (31-32)
  • Config (45-46)
  • Config (58-59)
  • Config (79-80)
  • Config (89-90)
airbyte_cdk/test/models/connector_metadata/ConnectorRegistryDestinationDefinition.py (1)
  • GitInfo (160-179)
airbyte_cdk/test/models/connector_metadata/ConnectorRegistryReleases.py (1)
  • GitInfo (142-161)
airbyte_cdk/test/models/connector_metadata/ConnectorRegistrySourceDefinition.py (1)
  • GitInfo (160-179)
airbyte_cdk/test/models/connector_metadata/ConnectorRegistryV0.py (1)
  • GitInfo (160-179)
airbyte_cdk/test/models/connector_metadata/GeneratedFields.py (1)
  • GitInfo (13-32)
airbyte_cdk/test/models/connector_metadata/ActorDefinitionResourceRequirements.py (1)
airbyte_cdk/test/models/connector_metadata/ConnectorMetadataDefinitionV0.py (12)
  • ResourceRequirements (116-123)
  • Config (31-32)
  • Config (45-46)
  • Config (58-59)
  • Config (79-80)
  • Config (89-90)
  • Config (107-108)
  • Config (117-118)
  • Config (137-138)
  • JobType (126-133)
  • JobTypeResourceLimit (308-313)
  • ActorDefinitionResourceRequirements (351-359)
airbyte_cdk/test/models/connector_metadata/ReleaseStage.py (5)
airbyte_cdk/test/models/connector_metadata/ConnectorMetadataDefinitionV0.py (1)
  • ReleaseStage (65-69)
airbyte_cdk/test/models/connector_metadata/ConnectorRegistryDestinationDefinition.py (1)
  • ReleaseStage (14-18)
airbyte_cdk/test/models/connector_metadata/ConnectorRegistryReleases.py (1)
  • ReleaseStage (59-63)
airbyte_cdk/test/models/connector_metadata/ConnectorRegistrySourceDefinition.py (1)
  • ReleaseStage (21-25)
airbyte_cdk/test/models/connector_metadata/ConnectorRegistryV0.py (1)
  • ReleaseStage (14-18)
airbyte_cdk/test/models/connector_metadata/RolloutConfiguration.py (3)
airbyte_cdk/test/models/connector_metadata/ConnectorMetadataDefinitionV0.py (6)
  • RolloutConfiguration (136-154)
  • Config (31-32)
  • Config (45-46)
  • Config (58-59)
  • Config (79-80)
  • Config (89-90)
airbyte_cdk/test/models/connector_metadata/ConnectorRegistryReleases.py (1)
  • RolloutConfiguration (14-32)
airbyte_cdk/test/models/connector_metadata/ConnectorRegistrySourceDefinition.py (1)
  • RolloutConfiguration (74-92)
airbyte_cdk/test/models/connector_metadata/ConnectorMetrics.py (2)
airbyte_cdk/test/models/connector_metadata/ConnectorMetadataDefinitionV0.py (9)
  • ConnectorMetrics (242-245)
  • Usage (248-251)
  • SyncSuccessRate (254-257)
  • ConnectorMetric (260-266)
  • Config (31-32)
  • Config (45-46)
  • Config (58-59)
  • Config (79-80)
  • Config (89-90)
airbyte_cdk/test/models/connector_metadata/GeneratedFields.py (4)
  • ConnectorMetrics (43-46)
  • Usage (49-52)
  • SyncSuccessRate (55-58)
  • ConnectorMetric (61-67)
airbyte_cdk/test/models/connector_metadata/GeneratedFields.py (1)
airbyte_cdk/test/models/connector_metadata/ConnectorMetadataDefinitionV0.py (12)
  • GitInfo (212-231)
  • Config (31-32)
  • Config (45-46)
  • Config (58-59)
  • Config (79-80)
  • Config (89-90)
  • SourceFileInfo (234-239)
  • ConnectorMetrics (242-245)
  • Usage (248-251)
  • SyncSuccessRate (254-257)
  • ConnectorMetric (260-266)
  • GeneratedFields (330-334)
airbyte_cdk/test/models/connector_metadata/AirbyteInternal.py (3)
airbyte_cdk/test/models/connector_metadata/ConnectorMetadataDefinitionV0.py (9)
  • Sl (174-178)
  • Ql (181-188)
  • AirbyteInternal (191-201)
  • Config (31-32)
  • Config (45-46)
  • Config (58-59)
  • Config (79-80)
  • Config (89-90)
  • Config (107-108)
airbyte_cdk/test/models/connector_metadata/ConnectorRegistryDestinationDefinition.py (3)
  • Sl (130-134)
  • Ql (137-144)
  • AirbyteInternal (147-157)
airbyte_cdk/test/models/connector_metadata/ConnectorRegistrySourceDefinition.py (3)
  • Sl (130-134)
  • Ql (137-144)
  • AirbyteInternal (147-157)
airbyte_cdk/test/models/connector_metadata/ConnectorRegistryReleases.py (11)
airbyte_cdk/test/models/connector_metadata/ConnectorRegistrySourceDefinition.py (28)
  • RolloutConfiguration (74-92)
  • DeadlineAction (95-97)
  • StreamBreakingChangeScope (100-109)
  • SourceType (14-18)
  • ReleaseStage (21-25)
  • SupportLevel (28-31)
  • ResourceRequirements (34-41)
  • JobType (44-51)
  • AllowedHosts (54-61)
  • SuggestedStreams (64-71)
  • Sl (130-134)
  • Ql (137-144)
  • AirbyteInternal (147-157)
  • GitInfo (160-179)
  • SourceFileInfo (182-187)
  • ConnectorMetrics (190-193)
  • ConnectorMetric (208-214)
  • ConnectorPackageInfo (217-218)
  • NormalizationDestinationDefinitionConfig (112-127)
  • BreakingChangeScope (229-233)
  • JobTypeResourceLimit (221-226)
  • GeneratedFields (236-240)
  • VersionBreakingChange (254-276)
  • ActorDefinitionResourceRequirements (243-251)
  • ConnectorBreakingChanges (279-287)
  • ConnectorRegistryReleases (344-354)
  • ConnectorRegistrySourceDefinition (290-341)
  • ConnectorRegistryDestinationDefinition (381-433)
airbyte_cdk/test/models/connector_metadata/RolloutConfiguration.py (1)
  • RolloutConfiguration (11-29)
airbyte_cdk/test/models/connector_metadata/ActorDefinitionResourceRequirements.py (7)
  • Config (13-14)
  • Config (33-34)
  • Config (41-42)
  • ResourceRequirements (12-19)
  • JobType (22-29)
  • JobTypeResourceLimit (32-37)
  • ActorDefinitionResourceRequirements (40-48)
airbyte_cdk/test/models/connector_metadata/AirbyteInternal.py (4)
  • Config (30-31)
  • Sl (12-16)
  • Ql (19-26)
  • AirbyteInternal (29-39)
airbyte_cdk/test/models/connector_metadata/AllowedHosts.py (2)
  • Config (12-13)
  • AllowedHosts (11-18)
airbyte_cdk/test/models/connector_metadata/ConnectorBreakingChanges.py (8)
  • Config (19-20)
  • Config (38-39)
  • Config (63-64)
  • DeadlineAction (13-15)
  • StreamBreakingChangeScope (18-27)
  • BreakingChangeScope (30-34)
  • VersionBreakingChange (37-59)
  • ConnectorBreakingChanges (62-70)
airbyte_cdk/test/models/connector_metadata/SuggestedStreams.py (1)
  • SuggestedStreams (11-18)
airbyte_cdk/test/models/connector_metadata/GeneratedFields.py (5)
  • GitInfo (13-32)
  • SourceFileInfo (35-40)
  • ConnectorMetrics (43-46)
  • ConnectorMetric (61-67)
  • GeneratedFields (70-74)
airbyte_cdk/test/models/connector_metadata/GitInfo.py (1)
  • GitInfo (12-31)
airbyte_cdk/test/models/connector_metadata/ConnectorPackageInfo.py (1)
  • ConnectorPackageInfo (11-12)
airbyte_cdk/test/models/connector_metadata/NormalizationDestinationDefinitionConfig.py (1)
  • NormalizationDestinationDefinitionConfig (9-24)
🪛 markdownlint-cli2 (0.18.1)
airbyte_cdk/test/models/connector_metadata/README.md

8-8: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🔇 Additional comments (15)
airbyte_cdk/test/models/connector_metadata/README.md (2)

1-79: The documentation structure and examples are excellent!

The README provides clear usage examples, describes the model generation workflow, and properly documents the schema source. The validation examples and field access patterns will be helpful for users.


21-21: The import path in the README is correct—no changes needed.

The script confirms the package exists at airbyte_cdk/test/models/connector_metadata/, which matches exactly what the README imports on line 21. The filesystem structure validates the documentation.

The original concern about a path mismatch appears to have been based on PR objectives that differ from the actual implementation. The README accurately reflects the real package structure.

Likely an incorrect or invalid review comment.

airbyte_cdk/test/models/connector_metadata/SecretStore.py (1)

11-21: The model structure looks good!

The use of Extra.forbid is appropriate for security-sensitive secret store configuration, preventing accidental inclusion of unvalidated fields. The Literal["GSM"] type for the secret store ensures type safety.

Quick check: Is the restriction to only GSM (Google Secret Manager) intentional for the current scope? Just want to confirm this isn't limiting support for other secret stores (AWS Secrets Manager, Azure Key Vault, etc.) that might be needed.

airbyte_cdk/test/models/connector_metadata/ConnectorMetrics.py (1)

30-36: Nice design: Union types provide flexibility!

The use of Union[str, Usage] and Union[str, SyncSuccessRate] is a good pattern—it allows both string literals and typed enums, providing backwards compatibility while enabling type-safe code when using the enums directly.

airbyte_cdk/test/models/connector_metadata/ConnectorIPCOptions.py (2)

12-20: Good IPC serialization and transport options!

The enum choices make sense:

  • Serialization: JSONL (human-readable), PROTOBUF (efficient), FLATBUFFERS (zero-copy)
  • Transport: STDIO (standard streams), SOCKET (network/IPC sockets)

This provides a solid foundation for connector IPC configuration.


23-36: Strict validation is appropriate for IPC config.

Using Extra.forbid in both DataChannel and ConnectorIPCOptions is the right choice for IPC configuration—it prevents misconfiguration from typos or unsupported fields. Making all DataChannel fields required ensures complete IPC specifications.

bin/generate_component_manifest_files.py (1)

181-232: Nice refactoring to extract the common generation flow!

The generate_models_from_schemas function nicely consolidates the common logic for generating models from YAML schemas, making it reusable for both declarative components and metadata models. The post_process and metadata_models flags provide clear control over the different post-processing paths.

airbyte_cdk/test/models/connector_metadata/SuggestedStreams.py (1)

11-18: Model structure looks good!

The SuggestedStreams model is well-defined with a clear description explaining the semantics of missing vs empty arrays. The use of Extra.allow provides forward compatibility for additional fields.

airbyte_cdk/test/models/connector_metadata/ConnectorTestSuiteOptions.py (1)

52-63: ConnectorTestSuiteOptions model structure looks good.

The model appropriately aggregates test suite configuration with required suite field and optional lists of secrets and test connections. The use of Extra.forbid is appropriate for a configuration model.

airbyte_cdk/test/models/connector_metadata/GeneratedFields.py (2)

43-46: Flexible typing with Any fields.

The ConnectorMetrics model uses Any for the all, cloud, and oss fields, which sacrifices type safety for schema flexibility. This is likely intentional to accommodate varying metric structures across environments, but it means invalid data won't be caught at validation time.

If metric schemas are known and stable, consider using more specific types. Otherwise, this trade-off seems reasonable for extensibility.


70-74: GeneratedFields aggregation looks good.

The model appropriately aggregates various generated metadata fields (git info, source file info, metrics, SBOM URL). The structure makes sense for generated/computed connector metadata.

airbyte_cdk/test/models/connector_metadata/Secret.py (1)

11-33: Confirmed: Systematic duplication of models across auto-generated files. Consider centralizing generation configuration.

The duplication you identified is real and extends far beyond SecretStore and SecretGitInfo appears in 8 files, RolloutConfiguration in 7, and the same pattern repeats across the generated models. This suggests the code generator is inlining these shared schemas rather than generating them once and importing them.

Your suggestion about configuring datamodel-codegen to generate shared models is the right direction. Since these are auto-generated files, the fix should be in the generation configuration or source schema (not by modifying the generated output directly).

Could you clarify: Is the generation configuration set up to inline these common models intentionally, or would this be worth investigating as part of the code generation pipeline improvement?

airbyte_cdk/test/models/connector_metadata/ConnectorBreakingChanges.py (1)

66-66: Regex pattern is intentional, not a limitation.

The regex ^\d+\.\d+\.\d+$ reflects a deliberate versioning strategy rather than an oversight. The project's publish workflow checks version format with the same pattern: versions matching the strict MAJOR.MINOR.PATCH format are marked as stable releases (IS_PRERELEASE=false), while versions with additional identifiers are handled separately as pre-releases. Breaking changes in this schema are scoped to production releases only, not pre-releases. Since the file is auto-generated from a YAML schema (datamodel-codegen), the pattern is maintained by the build process rather than manual editing anyway.

No further action needed here—this is working as designed.

Likely an incorrect or invalid review comment.

airbyte_cdk/test/models/connector_metadata/ConnectorReleases.py (1)

87-91: Confirm semver rule: stable versions only for breakingChanges?

Regex ^\d+.\d+.\d+$ excludes pre-releases/build metadata. Is this intentional (i.e., breaking changes only tracked on stable releases), or should RCs be allowed here too, wdyt?

airbyte_cdk/test/models/connector_metadata/ConnectorMetadataDefinitionV0.py (1)

65-76: <!-- [scratchpad_update]
[task verify_duplication 1/1]
[observations added]

  • Web search confirms datamodel-codegen has --reuse-model flag (reuses models when module has same content)
  • Duplication is widespread across 8 distinct files + multiple redefinitions in ConnectorRegistry*.py
  • Dedicated enum modules exist (ReleaseStage.py, SupportLevel.py, JobType.py) but redefinitions occur in 8 other places
  • Review comment mentioned using datamodel-codegen's model reuse feature—this is a real capability

[analysis added]

  1. The original review comment's suggestion about datamodel-codegen's model reuse is technically valid
  2. The extent of duplication is more severe than what's in ConnectorMetadataDefinitionV0.py alone—it's a systemic issue
  3. The concern about "mypy friction" and drift is valid given 8 redundant definitions
  4. Standalone enum modules appear to exist but are not being imported by dependent files

[decision]
The review comment is essentially correct. The duplication is verified. This should be marked as since the suggestion requires team decision on implementation strategy (datamodel-codegen reuse vs manual consolidation).
[scratchpad_end] -->

Verified: Extensive duplication of ReleaseStage/SupportLevel/JobType enums across the connector_metadata module

Your concern is spot-on. The script confirms these enums are redefined across 8+ files instead of being reused:

  • ReleaseStage: 6 definitions (standalone module + ConnectorMetadataDefinitionV0, ConnectorRegistryV0, ConnectorRegistryDestinationDefinition, ConnectorRegistryReleases, ConnectorRegistrySourceDefinition)
  • SupportLevel: 6 definitions (same pattern)
  • JobType: 8 definitions (including ActorDefinitionResourceRequirements and RegistryOverrides)

The original suggestion to enable datamodel-codegen's --reuse-model flag to reuse models when a module has the model with the same content, or consolidate into a single shared module, remains a solid approach to eliminate drift and type-checking friction.

Consider whether the code generation pipeline should enforce the --reuse-model option or whether refactoring toward explicit imports from the dedicated enum modules (ReleaseStage.py, SupportLevel.py, JobType.py) fits your architecture better—which path appeals to you?

Copy link
Contributor Author

@aaronsteers aaronsteers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin, Great, that's much better, I think. Now, let's once again put the module until a 'generated' submodule breadcrumb. And import into airbyte_cdk/tests/models/init.py:

  • the top level manifest file model (I don't know the name)
  • the test suites definition model

This gives us most of what we would need to import for our use cases. Imports through the "generated" breadcrumb will still be possible but we can reduce surface area to maintain by only importing the models we need to call or instantiate directly.

…mports

- Moved all generated models to connector_metadata/generated/
- Added __init__.py in connector_metadata to export key models
- Updated airbyte_cdk/test/models/__init__.py to import:
  - ConnectorMetadataDefinitionV0 (top-level manifest model)
  - ConnectorTestSuiteOptions (test suites definition model)
- Updated README with new import paths and examples
- Updated build script to output to new location

This reduces the API surface area while keeping all models accessible
through the generated submodule.

Co-Authored-By: AJ Steers <[email protected]>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (4)
bin/generate_component_manifest_files.py (3)

17-18: Master branch dependency affects build reproducibility.

These URLs point to the master branch, meaning every build fetches the latest schemas. If upstream introduces breaking changes, CDK builds will fail unpredictably.

This concern was previously raised. The recommendation was to pin to a specific commit SHA or tag for reproducible builds.


41-77: Multiple reliability concerns remain unaddressed.

Previous reviews identified several critical issues:

  1. Misleading error message (lines 55-59): Claims "Using cached schemas if available" but immediately re-raises, causing build failure with no actual fallback
  2. No GitHub authentication: Unauthenticated requests hit 60/hour rate limit vs. 5000/hour with auth
  3. No retry logic: Transient network errors cause immediate build failure
  4. No fallback mechanism: If GitHub is unavailable, builds always fail

These concerns were raised in earlier reviews and would significantly impact build reliability.


246-257: Missing error handling for schema download.

If download_metadata_schemas fails at line 249 (due to rate limits, network issues, etc.), users will see a cryptic stack trace without actionable guidance.

A previous review suggested wrapping this call in try/except to catch specific exceptions and provide helpful error messages (e.g., "Set GITHUB_TOKEN environment variable to avoid rate limits" for 403 errors, "Check network connectivity" for network errors).

airbyte_cdk/test/models/connector_metadata/README.md (1)

8-10: Add language specifier to fenced code block.

The code block showing the repository path should specify a language (e.g., text) for proper syntax highlighting and markdown lint compliance.

This was flagged in a previous review.

🧹 Nitpick comments (1)
bin/generate_component_manifest_files.py (1)

224-231: Clarify mutually exclusive flags, wdyt?

The if/elif/else structure means post_process and metadata_models are mutually exclusive—if both are True, only post-processing runs and metadata post-processing is skipped.

Is this intentional? If so, consider adding a docstring note or an assertion at the function start to make the constraint explicit, or using an enum parameter instead of two booleans.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 63930c6 and 62902f6.

⛔ Files ignored due to path filters (31)
  • airbyte_cdk/test/models/connector_metadata/generated/ActorDefinitionResourceRequirements.py is excluded by !**/generated/**
  • airbyte_cdk/test/models/connector_metadata/generated/AirbyteInternal.py is excluded by !**/generated/**
  • airbyte_cdk/test/models/connector_metadata/generated/AllowedHosts.py is excluded by !**/generated/**
  • airbyte_cdk/test/models/connector_metadata/generated/ConnectorBreakingChanges.py is excluded by !**/generated/**
  • airbyte_cdk/test/models/connector_metadata/generated/ConnectorBuildOptions.py is excluded by !**/generated/**
  • airbyte_cdk/test/models/connector_metadata/generated/ConnectorIPCOptions.py is excluded by !**/generated/**
  • airbyte_cdk/test/models/connector_metadata/generated/ConnectorMetadataDefinitionV0.py is excluded by !**/generated/**
  • airbyte_cdk/test/models/connector_metadata/generated/ConnectorMetrics.py is excluded by !**/generated/**
  • airbyte_cdk/test/models/connector_metadata/generated/ConnectorPackageInfo.py is excluded by !**/generated/**
  • airbyte_cdk/test/models/connector_metadata/generated/ConnectorRegistryDestinationDefinition.py is excluded by !**/generated/**
  • airbyte_cdk/test/models/connector_metadata/generated/ConnectorRegistryReleases.py is excluded by !**/generated/**
  • airbyte_cdk/test/models/connector_metadata/generated/ConnectorRegistrySourceDefinition.py is excluded by !**/generated/**
  • airbyte_cdk/test/models/connector_metadata/generated/ConnectorRegistryV0.py is excluded by !**/generated/**
  • airbyte_cdk/test/models/connector_metadata/generated/ConnectorReleases.py is excluded by !**/generated/**
  • airbyte_cdk/test/models/connector_metadata/generated/ConnectorTestSuiteOptions.py is excluded by !**/generated/**
  • airbyte_cdk/test/models/connector_metadata/generated/GeneratedFields.py is excluded by !**/generated/**
  • airbyte_cdk/test/models/connector_metadata/generated/GitInfo.py is excluded by !**/generated/**
  • airbyte_cdk/test/models/connector_metadata/generated/JobType.py is excluded by !**/generated/**
  • airbyte_cdk/test/models/connector_metadata/generated/NormalizationDestinationDefinitionConfig.py is excluded by !**/generated/**
  • airbyte_cdk/test/models/connector_metadata/generated/RegistryOverrides.py is excluded by !**/generated/**
  • airbyte_cdk/test/models/connector_metadata/generated/ReleaseStage.py is excluded by !**/generated/**
  • airbyte_cdk/test/models/connector_metadata/generated/RemoteRegistries.py is excluded by !**/generated/**
  • airbyte_cdk/test/models/connector_metadata/generated/ResourceRequirements.py is excluded by !**/generated/**
  • airbyte_cdk/test/models/connector_metadata/generated/RolloutConfiguration.py is excluded by !**/generated/**
  • airbyte_cdk/test/models/connector_metadata/generated/Secret.py is excluded by !**/generated/**
  • airbyte_cdk/test/models/connector_metadata/generated/SecretStore.py is excluded by !**/generated/**
  • airbyte_cdk/test/models/connector_metadata/generated/SourceFileInfo.py is excluded by !**/generated/**
  • airbyte_cdk/test/models/connector_metadata/generated/SuggestedStreams.py is excluded by !**/generated/**
  • airbyte_cdk/test/models/connector_metadata/generated/SupportLevel.py is excluded by !**/generated/**
  • airbyte_cdk/test/models/connector_metadata/generated/TestConnections.py is excluded by !**/generated/**
  • airbyte_cdk/test/models/connector_metadata/generated/__init__.py is excluded by !**/generated/**
📒 Files selected for processing (4)
  • airbyte_cdk/test/models/__init__.py (1 hunks)
  • airbyte_cdk/test/models/connector_metadata/README.md (1 hunks)
  • airbyte_cdk/test/models/connector_metadata/__init__.py (1 hunks)
  • bin/generate_component_manifest_files.py (3 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
airbyte_cdk/test/models/__init__.py (2)
airbyte_cdk/test/models/connector_metadata/generated/ConnectorMetadataDefinitionV0.py (2)
  • ConnectorMetadataDefinitionV0 (490-495)
  • ConnectorTestSuiteOptions (337-348)
airbyte_cdk/test/models/connector_metadata/generated/ConnectorTestSuiteOptions.py (1)
  • ConnectorTestSuiteOptions (52-63)
airbyte_cdk/test/models/connector_metadata/__init__.py (2)
airbyte_cdk/test/models/connector_metadata/generated/ConnectorMetadataDefinitionV0.py (2)
  • ConnectorMetadataDefinitionV0 (490-495)
  • ConnectorTestSuiteOptions (337-348)
airbyte_cdk/test/models/connector_metadata/generated/ConnectorTestSuiteOptions.py (1)
  • ConnectorTestSuiteOptions (52-63)
🪛 markdownlint-cli2 (0.18.1)
airbyte_cdk/test/models/connector_metadata/README.md

8-8: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🔇 Additional comments (7)
bin/generate_component_manifest_files.py (3)

30-32: Good refactoring for reusability!

Parameterizing the directory path makes this function reusable for both local and downloaded schemas.


34-38: LGTM!

Accepting yaml_files as a parameter improves testability and makes the function more flexible for different use cases.


160-179: LGTM!

The post-processing logic correctly applies the pydantic.v1 compatibility layer to generated metadata models, consistent with the approach used for declarative component models.

airbyte_cdk/test/models/__init__.py (2)

4-7: LGTM!

The imports correctly expose the new metadata models from the connector_metadata package.


11-16: LGTM!

The __all__ list is correctly updated to include the new metadata models, maintaining alphabetical ordering.

airbyte_cdk/test/models/connector_metadata/README.md (1)

1-92: Excellent documentation!

The README is comprehensive and well-structured, with clear usage examples covering validation, field access, and model imports. The examples accurately reflect the package structure.

airbyte_cdk/test/models/connector_metadata/__init__.py (1)

1-7: LGTM!

The package initialization correctly re-exports the primary metadata models with proper __all__ definition, making them accessible via the connector_metadata namespace.

devin-ai-integration bot and others added 2 commits October 21, 2025 01:37
…tput

- Modified build script to generate all models into a single models.py file
- Added consolidated JSON schema generation (metadata_schema.json)
- Updated imports to reference generated.models module
- Removed 30+ individual model files in favor of single-file approach
- Updated README to document new structure and outputs
- Added GitHub token support to avoid rate limiting

Co-Authored-By: AJ Steers <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant