-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🎉 New Destination: Datastax Astra #34058
Conversation
Before Merging a Connector Pull RequestWow! What a great pull request you have here! 🎉 To merge this PR, ensure the following has been done/considered for each connector added or updated:
If the checklist is complete, but the CI check is failing,
|
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, this PR looks great already! I left some higher level comments but there is nothing big in there. Let me know once I can take another look
logging.getLogger("airbyte"), invalid_config) | ||
assert outcome.status == Status.FAILED | ||
|
||
def test_write(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for this test to run successfully in CI, there needs to be an Airbyte controlled astra account. It seems like there is no free tier to easily set that up, so we need to check how to make that work in a robust fashion.
If that's difficult to do I'm also happy to skip this test in CI so it can be used to test locally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Astra does have a free tier. Confirmed with Datastax that it gives access to vector search and the JSON API so this should be doable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @flash1293 - let me know if you need anymore help with this. I think we resolved this by turning off auto-hibernation for your org.
@@ -0,0 +1,64 @@ | |||
# Astra Destination |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is shown in the Airbyte UI when users set up the connector, so it's not directed at a technical audience developing / interacting with the python code directly.
Check out
airbyte/docs/integrations/destinations/chroma.md
Lines 2 to 3 in 6dff143
This page guides you through the process of setting up the [Chroma](https://docs.trychroma.com/?lang=py) destination connector. | |
It should guide the user through creating the db in the astra UI and so on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
from setuptools import find_packages, setup | ||
|
||
MAIN_REQUIREMENTS = [ | ||
"airbyte-cdk==0.55.1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using airbyte-cdk[vector-db-based]==0.57.0
here makes sure all transitive dependencies like langchain and openai are installed properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
self.client.create_collection(self.config.collection) | ||
|
||
def pre_sync(self, catalog: ConfiguredAirbyteCatalog): | ||
self._create_collection() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the pre_sync hook here is creating the collection automatically.
However, in the normal user flow it works like this:
- User fills out form in Airbyte UI
- Airbyte platform runs
check
- if it succeeds, the config is saved - Sync can be started (calling
pre_sync
andindex
)
So in this case if the collection isn't created, a user won't be able to create the destination in the UI. It looks like we should create the collection in the check method as well as the pre_sync to make sure it works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
self.embedding_dim = embedding_dim | ||
self.similarity_function = similarity_function | ||
|
||
self.request_base_url = f"https://{self.astra_id}-{self.astra_region}.apps.astra.datastax.com/api/json/v1/{self.keyspace_name}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've seen in the UI that there is a handy "API Endpoint" field that can be copied with a single click. Rather than requiring the Airbyte user to copy paste id, region and so on manually, what do you think about just having a single "endpoint" field in the config and using that?
The pattern
and pattern_descriptor
fields in the config can be used to properly document this for the user (and to enforce a proper format right in the UI). See this example in Google drive for folder URLs which is pretty similar:
airbyte/airbyte-integrations/connectors/source-google-drive/source_google_drive/spec.py
Line 54 in 5324da8
folder_url: str = Field( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Played around with it and it works pretty well, great job!
Had some smaller comments left.
Also, I tried to run the unit tests and they are failing for me, could you take a look at that?
..., | ||
title="AstraDB Endpoint", | ||
description="AstraDB Endpoint", | ||
pattern="^https:\/\/([a-z]|[0-9]){8}-([a-z]|[0-9]){4}-([a-z]|[0-9]){4}-([a-z]|[0-9]){4}-([a-z]|[0-9]){12}-(af|il|ap|ca|eu|me|sa|us|cn|us-gov|us-iso|us-isob)-(central|north|(north(?:east|west))|south|south(?:east|west)|east|west)([0-9]{1})\\.apps\\.astra\\.datastax\\.com", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't work for my test-db: https://551e73d7-3904-4cc8-844e-20b5ff3404f9-us-east-2.apps.astra.datastax.com
Maybe it should be relaxed a bit? If it's wrong, the check will fail anyways, so this is mostly to catch early if the user is copy/pasting something completely wrong in there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
title="AstraDB Endpoint", | ||
description="AstraDB Endpoint", | ||
pattern="^https:\/\/([a-z]|[0-9]){8}-([a-z]|[0-9]){4}-([a-z]|[0-9]){4}-([a-z]|[0-9]){4}-([a-z]|[0-9]){12}-(af|il|ap|ca|eu|me|sa|us|cn|us-gov|us-iso|us-isob)-(central|north|(north(?:east|west))|south|south(?:east|west)|east|west)([0-9]{1})\\.apps\\.astra\\.datastax\\.com", | ||
examples=["us-east1"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The example should be the full value that needs to be entered by the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
def check(self, logger: AirbyteLogger, config: Mapping[str, Any]) -> AirbyteConnectionStatus: | ||
parsed_config = ConfigModel.parse_obj(config) | ||
self._init_indexer(parsed_config) | ||
self.indexer._create_collection() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be in the indexers check
method instead. Exceptions thrown from here won't resolve in a nice error message because they are not handled (the indexer check
method already has a try-catch clause to print a nice error message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved
- Click Create Database. | ||
-- You are redirected to your new database’s Overview screen. Your database starts in Pending status before transitioning to Initializing. You’ll receive a notification once your database is initialized. | ||
|
||
#### Setting up a Vector Collection |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this section with the new logic of the connector auto-creating the collection? If there's a good reason to still do it manually in some cases we can leave it in with a note that the connector can also take care of it, wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
|
||
## Supported Sync Modes | ||
|
||
Full Refresh Sync |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The connector supports all sync modes (the vector DB CDK takes care of this if you implement index
and delete
on the indexer properly which you did. You can use the same table as here:
| Full Refresh Sync | Yes | | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file should be in the integration_tests folder (don't forget to adjust the reference from acceptance-test-config.yml
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved
Thanks for this feedback @flash1293 -- we are close to resolving all of these issues and updating the branch with the correction. I'll check back in on Monday - there's still a final issue we are working on resolving. |
Moved create collection back to destination.py because it was causing issues with integration and unit tests. |
@flash1293 - we are done with your most recent comments as well. Please let me know what the next steps are! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @hc33brackles and @anomnaco , apologies, I was off yesterday and couldn't give it another look.
This is starting to look very good! I updated a few small things:
- Apply formatting
- Add integration test secret to our CI
- Update spec for integration test
- Prepare the icon in our platform repo so it will show properly
CI is currently running here: #34442
Let's see whether it works. I need to do this on a separate PR for permission reasons. Could you give me write access to the destination-astra
on the Anant
fork so I can push the same things over there (or do it yourself)? This will make it easier to wrap this work up.
Left two smaller comments in the PR
|
||
def check(self, logger: AirbyteLogger, config: Mapping[str, Any]) -> AirbyteConnectionStatus: | ||
parsed_config = ConfigModel.parse_obj(config) | ||
self._init_indexer(parsed_config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On https://github.com/airbytehq/airbyte/pull/34058/files#r1459806142 you wrote you moved the create_collection
call, but it seems like some merge foo brought it back (it's also not in the indexer). Could you double check please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its on the next line here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anomnaco yes, it shouldn’t be here in the destination class, it should be in the indexer class check method so it’s properly wrapped in a try/except block and a meaningful error message is produced on check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That caused issues with the unit and integration tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @flash1293 - I've been chatting with Obi about this offline. He was testing with this in the indexer class check method and got failures with the unit tests with it there. Do you need us to look deeper into getting this to work in the way that you suggested or is it acceptable here? We can obviously dig deeper and get this to work in the other location if we need to. Would adding more meaningful error messaging with the function where it is now work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are working on resolving this now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The important point to me is that the user gets an actionable error message on check as part of a well-formed status message and the connector doesn't crash as this won't produce a nice output in the UI.
astra_db_endpoint: str = Field( | ||
..., | ||
title="AstraDB Endpoint", | ||
airbyte_secret=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we keep the endpoint non-secret? This is causing an issue in the UI when editing an existing configuration:
I think nobody ran into this problem yet. If something is marked as secret, two things happen:
- it's not shown in the UI in cleartext
- It is stored in a separate secret manager, not in the regular Airbyte DB with the rest of the configuration
I might miss something, but it seems like the endpoint isn't sensitive in this sense and can be stored with the rest of the configuration (the app token definitely is and should be declared a secret)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@anomnaco @hc33brackles About the problem that's still left - maybe the create_collection call needs to be mocked out in the unit test (plus validation it was called as expected)? https://github.com/airbytehq/airbyte/pull/34058/files#diff-ad0e480b627023296a1ae9ff8e684e73cf34f8337430d07d549b65ae583158acR157 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to fix the unit tests as the create_collection call needs to be mocked out now, but tested again and everything seems to work well, LGTM.
We have a CI issue right now, as soon as that's resolved I'm going to merge
Co-authored-by: Joe Reuter <[email protected]>
Co-authored-by: Joe Reuter <[email protected]>
Co-authored-by: Joe Reuter <[email protected]>
New destination for vector database Datastax Astra via the new JSON API. Requires python cdk to include embedder and other vector db utilities included in the cdk.