Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🎉 New Destination: Datastax Astra #34058

Merged
merged 22 commits into from
Jan 25, 2024
Merged

Conversation

anomnaco
Copy link
Contributor

@anomnaco anomnaco commented Jan 9, 2024

New destination for vector database Datastax Astra via the new JSON API. Requires python cdk to include embedder and other vector db utilities included in the cdk.

@octavia-squidington-iii octavia-squidington-iii added area/connectors Connector related issues area/documentation Improvements or additions to documentation community connectors/destination/astra labels Jan 9, 2024
Copy link
Contributor

github-actions bot commented Jan 9, 2024

Before Merging a Connector Pull Request

Wow! What a great pull request you have here! 🎉

To merge this PR, ensure the following has been done/considered for each connector added or updated:

  • PR name follows PR naming conventions
  • Breaking changes are considered. If a Breaking Change is being introduced, ensure an Airbyte engineer has created a Breaking Change Plan.
  • Connector version has been incremented in the Dockerfile and metadata.yaml according to our Semantic Versioning for Connectors guidelines
  • You've updated the connector's metadata.yaml file any other relevant changes, including a breakingChanges entry for major version bumps. See metadata.yaml docs
  • Secrets in the connector's spec are annotated with airbyte_secret
  • All documentation files are up to date. (README.md, bootstrap.md, docs.md, etc...)
  • Changelog updated in docs/integrations/<source or destination>/<name>.md with an entry for the new version. See changelog example
  • Migration guide updated in docs/integrations/<source or destination>/<name>-migrations.md with an entry for the new version, if the version is a breaking change. See migration guide example
  • If set, you've ensured the icon is present in the platform-internal repo. (Docs)

If the checklist is complete, but the CI check is failing,

  1. Check for hidden checklists in your PR description

  2. Toggle the github label checklist-action-run on/off to re-run the checklist CI.

Copy link

vercel bot commented Jan 9, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
airbyte-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jan 25, 2024 8:59am

Copy link
Contributor

@flash1293 flash1293 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, this PR looks great already! I left some higher level comments but there is nothing big in there. Let me know once I can take another look

logging.getLogger("airbyte"), invalid_config)
assert outcome.status == Status.FAILED

def test_write(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for this test to run successfully in CI, there needs to be an Airbyte controlled astra account. It seems like there is no free tier to easily set that up, so we need to check how to make that work in a robust fashion.

If that's difficult to do I'm also happy to skip this test in CI so it can be used to test locally.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Astra does have a free tier. Confirmed with Datastax that it gives access to vector search and the JSON API so this should be doable.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @flash1293 - let me know if you need anymore help with this. I think we resolved this by turning off auto-hibernation for your org.

@@ -0,0 +1,64 @@
# Astra Destination
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is shown in the Airbyte UI when users set up the connector, so it's not directed at a technical audience developing / interacting with the python code directly.

Check out

This page guides you through the process of setting up the [Chroma](https://docs.trychroma.com/?lang=py) destination connector.
for example for how these normally look.

It should guide the user through creating the db in the astra UI and so on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

from setuptools import find_packages, setup

MAIN_REQUIREMENTS = [
"airbyte-cdk==0.55.1",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using airbyte-cdk[vector-db-based]==0.57.0 here makes sure all transitive dependencies like langchain and openai are installed properly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

self.client.create_collection(self.config.collection)

def pre_sync(self, catalog: ConfiguredAirbyteCatalog):
self._create_collection()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the pre_sync hook here is creating the collection automatically.

However, in the normal user flow it works like this:

  • User fills out form in Airbyte UI
  • Airbyte platform runs check - if it succeeds, the config is saved
  • Sync can be started (calling pre_sync and index)

So in this case if the collection isn't created, a user won't be able to create the destination in the UI. It looks like we should create the collection in the check method as well as the pre_sync to make sure it works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

self.embedding_dim = embedding_dim
self.similarity_function = similarity_function

self.request_base_url = f"https://{self.astra_id}-{self.astra_region}.apps.astra.datastax.com/api/json/v1/{self.keyspace_name}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've seen in the UI that there is a handy "API Endpoint" field that can be copied with a single click. Rather than requiring the Airbyte user to copy paste id, region and so on manually, what do you think about just having a single "endpoint" field in the config and using that?

The pattern and pattern_descriptor fields in the config can be used to properly document this for the user (and to enforce a proper format right in the UI). See this example in Google drive for folder URLs which is pretty similar:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

@flash1293 flash1293 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Played around with it and it works pretty well, great job!

Had some smaller comments left.

Also, I tried to run the unit tests and they are failing for me, could you take a look at that?

...,
title="AstraDB Endpoint",
description="AstraDB Endpoint",
pattern="^https:\/\/([a-z]|[0-9]){8}-([a-z]|[0-9]){4}-([a-z]|[0-9]){4}-([a-z]|[0-9]){4}-([a-z]|[0-9]){12}-(af|il|ap|ca|eu|me|sa|us|cn|us-gov|us-iso|us-isob)-(central|north|(north(?:east|west))|south|south(?:east|west)|east|west)([0-9]{1})\\.apps\\.astra\\.datastax\\.com",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't work for my test-db: https://551e73d7-3904-4cc8-844e-20b5ff3404f9-us-east-2.apps.astra.datastax.com

Maybe it should be relaxed a bit? If it's wrong, the check will fail anyways, so this is mostly to catch early if the user is copy/pasting something completely wrong in there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

title="AstraDB Endpoint",
description="AstraDB Endpoint",
pattern="^https:\/\/([a-z]|[0-9]){8}-([a-z]|[0-9]){4}-([a-z]|[0-9]){4}-([a-z]|[0-9]){4}-([a-z]|[0-9]){12}-(af|il|ap|ca|eu|me|sa|us|cn|us-gov|us-iso|us-isob)-(central|north|(north(?:east|west))|south|south(?:east|west)|east|west)([0-9]{1})\\.apps\\.astra\\.datastax\\.com",
examples=["us-east1"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example should be the full value that needs to be entered by the user.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

def check(self, logger: AirbyteLogger, config: Mapping[str, Any]) -> AirbyteConnectionStatus:
parsed_config = ConfigModel.parse_obj(config)
self._init_indexer(parsed_config)
self.indexer._create_collection()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be in the indexers check method instead. Exceptions thrown from here won't resolve in a nice error message because they are not handled (the indexer check method already has a try-catch clause to print a nice error message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved

- Click Create Database.
-- You are redirected to your new database’s Overview screen. Your database starts in Pending status before transitioning to Initializing. You’ll receive a notification once your database is initialized.

#### Setting up a Vector Collection
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this section with the new logic of the connector auto-creating the collection? If there's a good reason to still do it manually in some cases we can leave it in with a note that the connector can also take care of it, wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed


## Supported Sync Modes

Full Refresh Sync
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The connector supports all sync modes (the vector DB CDK takes care of this if you implement index and delete on the indexer properly which you did. You can use the same table as here:

| Full Refresh Sync | Yes | |

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file should be in the integration_tests folder (don't forget to adjust the reference from acceptance-test-config.yml)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved

@hc33brackles
Copy link

Thanks for this feedback @flash1293 -- we are close to resolving all of these issues and updating the branch with the correction. I'll check back in on Monday - there's still a final issue we are working on resolving.

@anomnaco
Copy link
Contributor Author

Moved create collection back to destination.py because it was causing issues with integration and unit tests.

@hc33brackles
Copy link

@flash1293 - we are done with your most recent comments as well. Please let me know what the next steps are!

Copy link
Contributor

@flash1293 flash1293 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @hc33brackles and @anomnaco , apologies, I was off yesterday and couldn't give it another look.

This is starting to look very good! I updated a few small things:

  • Apply formatting
  • Add integration test secret to our CI
  • Update spec for integration test
  • Prepare the icon in our platform repo so it will show properly

CI is currently running here: #34442

Let's see whether it works. I need to do this on a separate PR for permission reasons. Could you give me write access to the destination-astra on the Anant fork so I can push the same things over there (or do it yourself)? This will make it easier to wrap this work up.

Left two smaller comments in the PR


def check(self, logger: AirbyteLogger, config: Mapping[str, Any]) -> AirbyteConnectionStatus:
parsed_config = ConfigModel.parse_obj(config)
self._init_indexer(parsed_config)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On https://github.com/airbytehq/airbyte/pull/34058/files#r1459806142 you wrote you moved the create_collection call, but it seems like some merge foo brought it back (it's also not in the indexer). Could you double check please?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its on the next line here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anomnaco yes, it shouldn’t be here in the destination class, it should be in the indexer class check method so it’s properly wrapped in a try/except block and a meaningful error message is produced on check

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That caused issues with the unit and integration tests

Copy link

@hc33brackles hc33brackles Jan 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @flash1293 - I've been chatting with Obi about this offline. He was testing with this in the indexer class check method and got failures with the unit tests with it there. Do you need us to look deeper into getting this to work in the way that you suggested or is it acceptable here? We can obviously dig deeper and get this to work in the other location if we need to. Would adding more meaningful error messaging with the function where it is now work?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are working on resolving this now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The important point to me is that the user gets an actionable error message on check as part of a well-formed status message and the connector doesn't crash as this won't produce a nice output in the UI.

astra_db_endpoint: str = Field(
...,
title="AstraDB Endpoint",
airbyte_secret=True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we keep the endpoint non-secret? This is causing an issue in the UI when editing an existing configuration:
Screenshot 2024-01-23 at 18 07 47

I think nobody ran into this problem yet. If something is marked as secret, two things happen:

  • it's not shown in the UI in cleartext
  • It is stored in a separate secret manager, not in the regular Airbyte DB with the rest of the configuration

I might miss something, but it seems like the endpoint isn't sensitive in this sense and can be stored with the rest of the configuration (the app token definitely is and should be declared a secret)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@flash1293
Copy link
Contributor

@anomnaco @hc33brackles About the problem that's still left - maybe the create_collection call needs to be mocked out in the unit test (plus validation it was called as expected)? https://github.com/airbytehq/airbyte/pull/34058/files#diff-ad0e480b627023296a1ae9ff8e684e73cf34f8337430d07d549b65ae583158acR157

Copy link
Contributor

@flash1293 flash1293 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to fix the unit tests as the create_collection call needs to be mocked out now, but tested again and everything seems to work well, LGTM.

We have a CI issue right now, as soon as that's resolved I'm going to merge

@flash1293 flash1293 merged commit ff12ddf into airbytehq:master Jan 25, 2024
35 of 40 checks passed
jatinyadav-cc pushed a commit to ollionorg/datapipes-airbyte that referenced this pull request Feb 21, 2024
jatinyadav-cc pushed a commit to ollionorg/datapipes-airbyte that referenced this pull request Feb 26, 2024
jatinyadav-cc pushed a commit to ollionorg/datapipes-airbyte that referenced this pull request Feb 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues area/documentation Improvements or additions to documentation community connectors/destination/astra
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

5 participants