Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Rag notebook #504

Merged
merged 23 commits into from
Apr 2, 2024
Merged

Refactor Rag notebook #504

merged 23 commits into from
Apr 2, 2024

Conversation

richardsliu
Copy link
Collaborator

@richardsliu richardsliu commented Mar 29, 2024

Refactored the RAG notebook to be more modular and documented.

  • Moved the CloudSQL code into the notebook instead of running on a Ray Clusters
  • Execute remote Ray tasks instead of using a job submit with busy waiting
  • Modified the CloudSQL code to use bulk insert

Fixes #425

Preview: https://github.com/GoogleCloudPlatform/ai-on-gke/blob/rag-notebook/applications/rag/example_notebooks/rag-kaggle-ray-sql-refactored.ipynb

"outputs": [],
"source": [
"!pip install ray[default]==2.9.3 kaggle==1.6.6\n",
"!pip install langchain==0.1.10 ray==2.9.3 datasets sentence-transformers\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there were some issues in the past with some of these packages taking way too long to install. Specifically sentence-transformers. Are we going to bake the dependencies into the jupyter image?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

I think we don't need to bake all these. langchain and sentence-transformers aren't used except by the Ray job.

ray, kaggle are pretty quick. unsure about datasets and the cloud SQL ones, i imagine they're not too bad but pls verify. If it's > 30s, I vote we make a custom jupyter image.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tested and it turns out the notebook does need to install langchain and sentence-transformers. I can see if we could use a custom image here to skip these pip installs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to a custom image with dependencies baked in. Removed this section from the notebook.

"ray.init(\n",
" address=\"ray://ray-cluster-kuberay-head-svc:10001\",\n",
" runtime_env={\n",
" \"pip\": [ \n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can delete all these. The ray image comes with all these pre installed. We need to bump langchain though, I think @chiayi is tracking that

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Ray will skip pip installing these if the image is already on the node. When I tested this, this step usually finishes in a few seconds.

"outputs": [],
"source": [
"!pip install ray[default]==2.9.3 kaggle==1.6.6\n",
"!pip install langchain==0.1.10 ray==2.9.3 datasets sentence-transformers\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

I think we don't need to bake all these. langchain and sentence-transformers aren't used except by the Ray job.

ray, kaggle are pretty quick. unsure about datasets and the cloud SQL ones, i imagine they're not too bad but pls verify. If it's > 30s, I vote we make a custom jupyter image.

Copy link
Collaborator

@andrewsykim andrewsykim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, this is a great improvement!

I think the following issues need to be addressed if we're going to cherry-pick this into release-1.1:

@richardsliu
Copy link
Collaborator Author

  • Dependency install time and whether a custom image is warranted.

The notebook needs to install sentence-transformer and langchain. I'll see if we can use a custom image to reduce the install time.

This is now fixed.

  • If we choose to keep both versions of the notebook, please add the markdown descriptions in the other notebook as well

I prefer to keep this PR limited to changes relevant for the new notebook. Adding markdown to the older notebook wouldn't be too interesting since all the code is executed in one cell.

@andrewsykim
Copy link
Collaborator

I prefer to keep this PR limited to changes relevant for the new notebook. Adding markdown to the older notebook wouldn't be too interesting since all the code is executed in one cell.

Should we simplify and remove the old notebook or keep both? With the changes in this PR, do both notebooks work?

@richardsliu
Copy link
Collaborator Author

I prefer to keep this PR limited to changes relevant for the new notebook. Adding markdown to the older notebook wouldn't be too interesting since all the code is executed in one cell.

Should we simplify and remove the old notebook or keep both? With the changes in this PR, do both notebooks work?

I fixed some minor issues in the old notebook and added a header. We should probably keep it as a backup.

@richardsliu
Copy link
Collaborator Author

/gcbrun

Copy link
Collaborator

@imreddy13 imreddy13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@imreddy13
Copy link
Collaborator

/gcbrun

@richardsliu richardsliu merged commit 75331ab into main Apr 2, 2024
7 of 8 checks passed
ryanaoleary pushed a commit that referenced this pull request Apr 3, 2024
* move mysql stuff to jupyter

* new notebook

* fix notebook

* fix notebook, add markdown

* use bulk insert

* revert

* change persist data

* terraform fmt

* remove sql params from notebook

* default empty values

* rename

* parameterize notebook image

* remove pip installs from notebook

* use custom notebook image

* terraform fmt

* replace jupyter notebook tag

* add notebook version to jupyterhub app

* merge cells

* add dummy value for secret volume

* fix old notebook
kfswain pushed a commit that referenced this pull request Apr 15, 2024
* move mysql stuff to jupyter

* new notebook

* fix notebook

* fix notebook, add markdown

* use bulk insert

* revert

* change persist data

* terraform fmt

* remove sql params from notebook

* default empty values

* rename

* parameterize notebook image

* remove pip installs from notebook

* use custom notebook image

* terraform fmt

* replace jupyter notebook tag

* add notebook version to jupyterhub app

* merge cells

* add dummy value for secret volume

* fix old notebook
brandonroyal pushed a commit to brandonroyal/ai-on-gke that referenced this pull request Jul 15, 2024
* move mysql stuff to jupyter

* new notebook

* fix notebook

* fix notebook, add markdown

* use bulk insert

* revert

* change persist data

* terraform fmt

* remove sql params from notebook

* default empty values

* rename

* parameterize notebook image

* remove pip installs from notebook

* use custom notebook image

* terraform fmt

* replace jupyter notebook tag

* add notebook version to jupyterhub app

* merge cells

* add dummy value for secret volume

* fix old notebook
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use Ray interactive client in the example notebooks
3 participants