Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add adapter wrapper for MLflow/CrateDB, based on monkeypatching #5

Merged
merged 18 commits into from
Sep 12, 2023

Conversation

amotl
Copy link
Member

@amotl amotl commented Sep 7, 2023

About

We evaluated how to make MLflow work with CrateDB on a fork of MLflow already, see crate-workbench/mlflow@master...cratedb. This patch brings in the same amalgamations, but uses monkeypatching instead.

The idea behind this repository and patching style is to create a shippable package, because we can reasonably anticipate that the corresponding changes will never make it into upstream MLflow.

Software Tests

Tests for conducting basic database conversations are already included with this patch. More thorough tests from MLflow will be added on behalf of subsequent patches, see GH-6 and GH-10.

/cc @andnig, @ckurze

@amotl amotl marked this pull request as ready for review September 7, 2023 16:17
Copy link
Member

@seut seut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! added few suggestions

README.md Outdated
docker run --rm -it --publish=4200:4200 --publish=5432:5432 \
--env=CRATE_HEAP_SIZE=4g crate \
-Cdiscovery.type=single-node \
-Ccluster.routing.allocation.disk.threshold_enabled=false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that its a good idea to advertise disabling this threshold in examples. If disabled, writing data into CrateDB can lead to a machine crash due out-of-disk-space issues.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, true. I will remove it. It slipped in because I am running it this way because apparently my disk is always full, or pretends to be.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed with 9b02b1a.


def patch_sqlalchemy_inspector(engine: sa.Engine):
"""
When using `get_table_names()`, make sure the correct schema name gets used.
Copy link
Member

@seut seut Sep 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't adjusting the search_path a better way to set a default schema to use?

Copy link
Member Author

@amotl amotl Sep 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for spotting. I observed some flaws here, but I will revisit it to find out why the search_path may not be honored here. I think the other parts of the application are already using that technique well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, patching that function was unnecessary, and got removed with 186c9e4 again.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi again. We needed to bring back this patch with e8acfc2. #12 (commits) has more details.

@@ -0,0 +1,137 @@
CREATE TABLE IF NOT EXISTS {schema_prefix}"datasets" (
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea with the search_path, I think the code in this repository is already using that technique partly through SQLAlchemy.

Concluding that, I may also think, that when using that here again, we would not need to populate the schema name (here: {schema_prefix}) into the SQL DDL statements at all, so the code could be simplified?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concluding that, I may also think, that when using that here again, we would not need to populate the schema name (here: {schema_prefix}) into the SQL DDL statements at all, so the code could be simplified?

Yes right.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improved with 3488790.

@amotl amotl merged commit 8ff918b into main Sep 12, 2023
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants