From 3a37a3b04861c6b47fb153b43876754ce630a4e9 Mon Sep 17 00:00:00 2001 From: augray Date: Tue, 1 Oct 2024 13:40:24 -0700 Subject: [PATCH 1/5] Add initial notebook --- docs/docs/examples/cookbooks/airtrain.ipynb | 495 ++++++++++++++++++++ 1 file changed, 495 insertions(+) create mode 100644 docs/docs/examples/cookbooks/airtrain.ipynb diff --git a/docs/docs/examples/cookbooks/airtrain.ipynb b/docs/docs/examples/cookbooks/airtrain.ipynb new file mode 100644 index 0000000000000..96e04724a00a2 --- /dev/null +++ b/docs/docs/examples/cookbooks/airtrain.ipynb @@ -0,0 +1,495 @@ +{ + "cells": [ + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# AirtrainAI Cookbook\n", + "\n", + "[Airtrain](https://www.airtrain.ai/) is a tool supporting unstructured/low-structured text datasets. It allows automated clustering, document classification, and more.\n", + "\n", + "This cookbook showcases how to ingest and transform/enrich data with Llama Index and then upload the data to Airtrain for further processing and exploration." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Installation & Setup" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Install some libraries we'll use for our examples. These\n", + "# are not required to use Airtrain with Llama Index, and are just\n", + "# there to help us illustrate use.\n", + "%pip install llama-index-embeddings-openai==0.2.4\n", + "%pip install llama-index-readers-web==0.2.2\n", + "%pip install llama-index-readers-github==0.2.0\n", + "\n", + "# Install Airtrain SDK with Llama Index integration\n", + "%pip install airtrain-py[llama-index]" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [], + "source": [ + "# Running async code in a notebook requires using nest_asyncio, and we will\n", + "# use some async examples. So we will set up nest_asyncio here. Outside\n", + "# an async context or outside a notebook, this step is not required.\n", + "import nest_asyncio\n", + "nest_asyncio.apply()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### API Key Setup\n", + "\n", + "Set up the API keys that will be required to run the examples that follow.\n", + "The GitHub API token and OpenAI API key are only required for the example\n", + "'Usage with Readers/Embeddings/Splitters'. Instructions for getting a GitHub\n", + "access token are\n", + "[here](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens)\n", + "while an OpenAI API key can be obtained\n", + "[here](https://platform.openai.com/api-keys).\n", + "\n", + "To obtain your Airtrain API Key:\n", + "- Create an Airtrain Account by visting [here](https://app.airtrain.ai/api/auth/login)\n", + "- View \"Settings\" in the lower left, then go to \"Billing\" to sign up for a pro account or start a trial\n", + "- Copy your API key from the \"Airtrain API Key\" tab in \"Billing\"\n", + "\n", + "Note that the Airtrain trial only allows ONE dataset at a time. As this notebook creates many, you may need\n", + "to delete the dataset in the Airtrain UI as you go along, to make space for another one." + ] + }, + { + "cell_type": "code", + "execution_count": 36, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "\n", + "os.environ[\"GITHUB_TOKEN\"] = \"\"\n", + "os.environ[\"OPENAI_API_KEY\"] = \"\"\n", + "\n", + "os.environ[\"AIRTRAIN_API_KEY\"] = \"\"" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Example 1: Usage with Readers/Embeddings/Splitters\n", + "\n", + "Some of the core abstractions in Llama Index are [Documents and Nodes](https://docs.llamaindex.ai/en/stable/module_guides/loading/documents_and_nodes/).\n", + "Airtrain's Llama Index integration allows you to create an Airtrain dataset using any iterable collection of either of these, via the\n", + "`upload_from_llama_nodes` function.\n", + "\n", + "To illustrate the flexibility of this, we'll do both:\n", + "1. Create a dataset directly of documents. In this case whole pages from the [Sematic](https://docs.sematic.dev/) docs.\n", + "2. Use OpenAI embeddings and the `SemanticSplitterNodeParser` to split those documents into nodes, and create a dataset from those." + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "\n", + "import airtrain as at\n", + "from llama_index.core.node_parser import SemanticSplitterNodeParser\n", + "from llama_index.embeddings.openai import OpenAIEmbedding\n", + "from llama_index.readers.github import GithubRepositoryReader, GithubClient\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The next step is to set up our reader. In this case we're using the GitHub reader, but that's just for illustrative purposes. Airtrain can ingest documents no matter what reader they came from originally." + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "github_token = os.environ.get(\"GITHUB_TOKEN\")\n", + "github_client = GithubClient(github_token=github_token, verbose=True)\n", + "reader = GithubRepositoryReader(\n", + " github_client=github_client,\n", + " owner=\"sematic-ai\",\n", + " repo=\"sematic\",\n", + " use_parser=False,\n", + " verbose=False,\n", + " filter_directories=(\n", + " [\"docs\"],\n", + " GithubRepositoryReader.FilterType.INCLUDE,\n", + " ),\n", + " filter_file_extensions=(\n", + " [\n", + " \".md\",\n", + " ],\n", + " GithubRepositoryReader.FilterType.INCLUDE,\n", + " ),\n", + ")\n", + "read_kwargs = dict(branch=\"main\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Read the documents with the reader" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": {}, + "outputs": [], + "source": [ + "documents = reader.load_data(**read_kwargs)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Create dataset directly from documents\n", + "\n", + "You can create an Airtrain dataset directly from these documents without doing any further\n", + "processing. In this case, Airtrain will automatically embed the documents for you before\n", + "generating further insights. Each row in the dataset will represent an entire markdown\n", + "document. Airtrain will automatically provide insights like semantic clustering of your\n", + "documents, allowing you to browse through the documents by looking at ones that cover similar\n", + "topics or uncovering subsets of documents that you might want to remove.\n", + "\n", + "Though additional processing beyond basic document retrieval is not *required*, it is\n", + "*allowed*. You can enrich the documents with metadata, filter them, or manipulate them\n", + "in any way you like before uploading to Airtrain." + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Uploaded 42 rows to 'Sematic Docs Dataset: Whole Documents'. View at: https://app.airtrain.ai/dataset/7fd09dca-81b9-42b8-acc9-01ce08302b16\n" + ] + } + ], + "source": [ + "result = at.upload_from_llama_nodes(\n", + " documents,\n", + " name=\"Sematic Docs Dataset: Whole Documents\",\n", + ")\n", + "print(f\"Uploaded {result.size} rows to '{result.name}'. View at: {result.url}\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Create dataset after splitting and embedding\n", + "\n", + "If you wish to view a dataset oriented towards nodes within documents rather than whole documents, you can do that as well.\n", + "Airtrain will automatically create insights like a 2d PCA projection of your embedding vectors, so you can visually explore\n", + "the embedding space from which your RAG nodes will be retrieved. You can also click on individual rows and look at the ones\n", + "that are nearest to it in the full n-dimensional embedding space, to drill down further. Automated clusters and other insights\n", + "will also be generated to enrich and aid your exploration.\n", + "\n", + "Here we'll use OpenAI embeddings and a `SemanticSplitterNodeParser` splitter, but you can use any other Llama Index tooling you\n", + "like to process your nodes before uploading to Airtrain. You can even skip embedding them yourself entirely, in which case\n", + "Airtrain will embed the nodes for you." + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "metadata": {}, + "outputs": [], + "source": [ + "embed_model = OpenAIEmbedding()\n", + "splitter = SemanticSplitterNodeParser(\n", + " buffer_size=1, breakpoint_percentile_threshold=95, embed_model=embed_model\n", + ")\n", + "nodes = splitter.get_nodes_from_documents(documents)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "⚠️ **Note** ⚠️: If you are on an Airtrain trial and already created a whole-document dataset, you will need to delete it before uploading a new dataset." + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Uploaded 137 rows to Sematic Docs, split + embedded. View at: https://app.airtrain.ai/dataset/ebec9bcc-6ed8-4165-a0de-29bef740c70b\n" + ] + } + ], + "source": [ + "result = at.upload_from_llama_nodes(\n", + " nodes,\n", + " name=\"Sematic Docs, split + embedded\",\n", + ")\n", + "print(f\"Uploaded {result.size} rows to {result.name}. View at: {result.url}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Example 2: Using the [Workflow](https://docs.llamaindex.ai/en/stable/module_guides/workflow/#workflows) API\n", + "\n", + "Since documents and nodes are the core abstractions the Airtrain integration works with, and these abstractions are\n", + "shared in Llama Index's workflows API, you can also use Airtrain as part of a broader workflow. Here we will illustrate\n", + "usage by scraping a few [Hacker News](https://news.ycombinator.com/) comment threads, but again you are not restricted\n", + "to web scraping workflows; any workflow producing documents or nodes will do." + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "metadata": {}, + "outputs": [], + "source": [ + "import asyncio\n", + "\n", + "from llama_index.core.schema import Node\n", + "from llama_index.core.workflow import (\n", + " Context,\n", + " Event,\n", + " StartEvent,\n", + " StopEvent,\n", + " Workflow,\n", + " step,\n", + ")\n", + "from llama_index.readers.web import AsyncWebPageReader\n", + "\n", + "from airtrain import DatasetMetadata, upload_from_llama_nodes" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Specify the comment threads we'll be scraping from. The particular ones in this example were on or near the front page on September 30th, 2024. If\n", + "you wish to ingest from pages besides Hacker News, be aware that some sites have their content rendered client-side, in which case you might\n", + "want to use a reader like the `WholeSiteReader`, which uses a headless Chrome driver to render the page before returning the documents. For here\n", + "we'll use a page with server-side rendered HTML for simplicity." + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "metadata": {}, + "outputs": [], + "source": [ + "URLS = [\n", + " \"https://news.ycombinator.com/item?id=41694044\",\n", + " \"https://news.ycombinator.com/item?id=41696046\",\n", + " \"https://news.ycombinator.com/item?id=41693087\",\n", + " \"https://news.ycombinator.com/item?id=41695756\",\n", + " \"https://news.ycombinator.com/item?id=41666269\",\n", + " \"https://news.ycombinator.com/item?id=41697137\",\n", + " \"https://news.ycombinator.com/item?id=41695840\",\n", + " \"https://news.ycombinator.com/item?id=41694712\",\n", + " \"https://news.ycombinator.com/item?id=41690302\",\n", + " \"https://news.ycombinator.com/item?id=41695076\",\n", + " \"https://news.ycombinator.com/item?id=41669747\",\n", + " \"https://news.ycombinator.com/item?id=41694504\",\n", + " \"https://news.ycombinator.com/item?id=41697032\",\n", + " \"https://news.ycombinator.com/item?id=41694025\",\n", + " \"https://news.ycombinator.com/item?id=41652935\",\n", + " \"https://news.ycombinator.com/item?id=41693979\",\n", + " \"https://news.ycombinator.com/item?id=41696236\",\n", + " \"https://news.ycombinator.com/item?id=41696434\",\n", + " \"https://news.ycombinator.com/item?id=41688469\",\n", + " \"https://news.ycombinator.com/item?id=41646782\",\n", + " \"https://news.ycombinator.com/item?id=41689332\",\n", + " \"https://news.ycombinator.com/item?id=41688018\",\n", + " \"https://news.ycombinator.com/item?id=41668896\",\n", + " \"https://news.ycombinator.com/item?id=41690087\",\n", + " \"https://news.ycombinator.com/item?id=41679497\",\n", + " \"https://news.ycombinator.com/item?id=41687739\",\n", + " \"https://news.ycombinator.com/item?id=41686722\",\n", + " \"https://news.ycombinator.com/item?id=41689138\",\n", + " \"https://news.ycombinator.com/item?id=41691530\",\n", + "]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Next we'll define some basic events, as this events are the standard way to pass data between steps in Llama Index workflows." + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "metadata": {}, + "outputs": [], + "source": [ + "class CompletedDocumentRetrievalEvent(Event):\n", + " name: str\n", + " documents: list[Node]\n", + "\n", + "class AirtrainDocumentDatasetEvent(Event):\n", + " metadata: DatasetMetadata" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "After that we'll define the workflow itself. In our case, this will have one step to ingest the documents from the web, one to ingest them to Airtrain, and one to wrap up the workflow." + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "metadata": {}, + "outputs": [], + "source": [ + "class IngestToAirtrainWorkflow(Workflow):\n", + " @step\n", + " async def ingest_documents(\n", + " self, ctx: Context, ev: StartEvent\n", + " ) -> CompletedDocumentRetrievalEvent | None:\n", + " if not ev.get(\"urls\"):\n", + " return None\n", + " reader = AsyncWebPageReader(html_to_text=True)\n", + " documents = await reader.aload_data(urls=ev.get(\"urls\"))\n", + " return CompletedDocumentRetrievalEvent(name=ev.get(\"name\"), documents=documents)\n", + "\n", + " @step\n", + " async def ingest_documents_to_airtrain(\n", + " self, ctx: Context, ev: CompletedDocumentRetrievalEvent\n", + " ) -> AirtrainDocumentDatasetEvent | None:\n", + " if not isinstance(ev, CompletedDocumentRetrievalEvent):\n", + " return None\n", + "\n", + " dataset_meta = upload_from_llama_nodes(ev.documents, name=ev.name)\n", + " return AirtrainDocumentDatasetEvent(metadata=dataset_meta)\n", + "\n", + " @step\n", + " async def complete_workflow(\n", + " self, ctx: Context, ev: AirtrainDocumentDatasetEvent\n", + " ) -> None | StopEvent:\n", + " if not isinstance(ev, AirtrainDocumentDatasetEvent):\n", + " return None\n", + " return StopEvent(result=ev.metadata)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Since the workflow API treats async code as a first-class citizen, we'll define an async `main` to drive the workflow." + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "metadata": {}, + "outputs": [], + "source": [ + "async def main() -> None:\n", + " workflow = IngestToAirtrainWorkflow()\n", + " result = await workflow.run(\n", + " name=\"My HN Discussions Dataset\", urls=URLS,\n", + " )\n", + " print(f\"Uploaded {result.size} rows to {result.name}. View at: {result.url}\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Finally, we'll execute the async main using an asyncio event loop.\n", + "\n", + "⚠️ **Note** ⚠️: If you are on an Airtrain trial and already ran examples above,\n", + "you will need to delete the resulting dataset(s) before uploading a new one." + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "error fetching page from https://news.ycombinator.com/item?id=41666269\n", + "error fetching page from https://news.ycombinator.com/item?id=41697137\n", + "error fetching page from https://news.ycombinator.com/item?id=41695076\n", + "error fetching page from https://news.ycombinator.com/item?id=41697032\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Uploaded 25 rows to My HN Discussions Dataset. View at: https://app.airtrain.ai/dataset/51f491c3-06fe-4da8-aba5-b18f7fa0d167\n" + ] + } + ], + "source": [ + "asyncio.run(main()) # actually run the main & the workflow" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.5" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} From 7af171d3ee2731a2c5c82fa622ad7e98c39265aa Mon Sep 17 00:00:00 2001 From: augray Date: Tue, 1 Oct 2024 13:44:23 -0700 Subject: [PATCH 2/5] Add 'Open in Collab' badge --- docs/docs/examples/cookbooks/airtrain.ipynb | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/docs/docs/examples/cookbooks/airtrain.ipynb b/docs/docs/examples/cookbooks/airtrain.ipynb index 96e04724a00a2..2bd04657a1962 100644 --- a/docs/docs/examples/cookbooks/airtrain.ipynb +++ b/docs/docs/examples/cookbooks/airtrain.ipynb @@ -1,5 +1,12 @@ { "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\"Open" + ] + }, { "attachments": {}, "cell_type": "markdown", From 91efb14a946b85c38d1f9e63eee89384ee9f58b4 Mon Sep 17 00:00:00 2001 From: augray Date: Tue, 1 Oct 2024 14:00:56 -0700 Subject: [PATCH 3/5] Add to mkdocs --- docs/mkdocs.yml | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/mkdocs.yml b/docs/mkdocs.yml index 3846121b55ea4..c5852adf353df 100644 --- a/docs/mkdocs.yml +++ b/docs/mkdocs.yml @@ -134,6 +134,7 @@ nav: - Cookbooks: - ./examples/cookbooks/GraphRAG_v1.ipynb - ./examples/cookbooks/GraphRAG_v2.ipynb + - ./examples/cookbooks/airtrain.ipynb - ./examples/cookbooks/anthropic_haiku.ipynb - ./examples/cookbooks/cleanlab_tlm_rag.ipynb - ./examples/cookbooks/codestral.ipynb From cf44815bf40ddb6536848475dc121e7fd39b2490 Mon Sep 17 00:00:00 2001 From: augray Date: Wed, 2 Oct 2024 06:57:46 -0700 Subject: [PATCH 4/5] Lint --- docs/docs/examples/cookbooks/airtrain.ipynb | 49 ++++++++++++--------- 1 file changed, 27 insertions(+), 22 deletions(-) diff --git a/docs/docs/examples/cookbooks/airtrain.ipynb b/docs/docs/examples/cookbooks/airtrain.ipynb index 2bd04657a1962..847ff5979641e 100644 --- a/docs/docs/examples/cookbooks/airtrain.ipynb +++ b/docs/docs/examples/cookbooks/airtrain.ipynb @@ -45,7 +45,7 @@ }, { "cell_type": "code", - "execution_count": 22, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -53,6 +53,7 @@ "# use some async examples. So we will set up nest_asyncio here. Outside\n", "# an async context or outside a notebook, this step is not required.\n", "import nest_asyncio\n", + "\n", "nest_asyncio.apply()" ] }, @@ -81,7 +82,7 @@ }, { "cell_type": "code", - "execution_count": 36, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -111,7 +112,7 @@ }, { "cell_type": "code", - "execution_count": 18, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -120,7 +121,7 @@ "import airtrain as at\n", "from llama_index.core.node_parser import SemanticSplitterNodeParser\n", "from llama_index.embeddings.openai import OpenAIEmbedding\n", - "from llama_index.readers.github import GithubRepositoryReader, GithubClient\n" + "from llama_index.readers.github import GithubRepositoryReader, GithubClient" ] }, { @@ -132,11 +133,10 @@ }, { "cell_type": "code", - "execution_count": 23, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ - "\n", "github_token = os.environ.get(\"GITHUB_TOKEN\")\n", "github_client = GithubClient(github_token=github_token, verbose=True)\n", "reader = GithubRepositoryReader(\n", @@ -168,7 +168,7 @@ }, { "cell_type": "code", - "execution_count": 24, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -195,7 +195,7 @@ }, { "cell_type": "code", - "execution_count": 26, + "execution_count": null, "metadata": {}, "outputs": [ { @@ -211,7 +211,7 @@ " documents,\n", " name=\"Sematic Docs Dataset: Whole Documents\",\n", ")\n", - "print(f\"Uploaded {result.size} rows to '{result.name}'. View at: {result.url}\")\n" + "print(f\"Uploaded {result.size} rows to '{result.name}'. View at: {result.url}\")" ] }, { @@ -233,7 +233,7 @@ }, { "cell_type": "code", - "execution_count": 27, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -253,7 +253,7 @@ }, { "cell_type": "code", - "execution_count": 28, + "execution_count": null, "metadata": {}, "outputs": [ { @@ -286,7 +286,7 @@ }, { "cell_type": "code", - "execution_count": 29, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -318,7 +318,7 @@ }, { "cell_type": "code", - "execution_count": 30, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -364,7 +364,7 @@ }, { "cell_type": "code", - "execution_count": 31, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -372,6 +372,7 @@ " name: str\n", " documents: list[Node]\n", "\n", + "\n", "class AirtrainDocumentDatasetEvent(Event):\n", " metadata: DatasetMetadata" ] @@ -385,7 +386,7 @@ }, { "cell_type": "code", - "execution_count": 32, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -398,7 +399,9 @@ " return None\n", " reader = AsyncWebPageReader(html_to_text=True)\n", " documents = await reader.aload_data(urls=ev.get(\"urls\"))\n", - " return CompletedDocumentRetrievalEvent(name=ev.get(\"name\"), documents=documents)\n", + " return CompletedDocumentRetrievalEvent(\n", + " name=ev.get(\"name\"), documents=documents\n", + " )\n", "\n", " @step\n", " async def ingest_documents_to_airtrain(\n", @@ -428,16 +431,19 @@ }, { "cell_type": "code", - "execution_count": 33, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ "async def main() -> None:\n", " workflow = IngestToAirtrainWorkflow()\n", " result = await workflow.run(\n", - " name=\"My HN Discussions Dataset\", urls=URLS,\n", + " name=\"My HN Discussions Dataset\",\n", + " urls=URLS,\n", " )\n", - " print(f\"Uploaded {result.size} rows to {result.name}. View at: {result.url}\")\n" + " print(\n", + " f\"Uploaded {result.size} rows to {result.name}. View at: {result.url}\"\n", + " )" ] }, { @@ -452,7 +458,7 @@ }, { "cell_type": "code", - "execution_count": 35, + "execution_count": null, "metadata": {}, "outputs": [ { @@ -493,8 +499,7 @@ "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.11.5" + "pygments_lexer": "ipython3" } }, "nbformat": 4, From 675baa3d5582c317dfa538495ed9f10b443f0a57 Mon Sep 17 00:00:00 2001 From: augray Date: Wed, 2 Oct 2024 13:22:13 -0700 Subject: [PATCH 5/5] review comments --- docs/docs/examples/cookbooks/airtrain.ipynb | 48 ++++++++------------- 1 file changed, 19 insertions(+), 29 deletions(-) diff --git a/docs/docs/examples/cookbooks/airtrain.ipynb b/docs/docs/examples/cookbooks/airtrain.ipynb index 847ff5979641e..07e363142fa00 100644 --- a/docs/docs/examples/cookbooks/airtrain.ipynb +++ b/docs/docs/examples/cookbooks/airtrain.ipynb @@ -16,7 +16,7 @@ "\n", "[Airtrain](https://www.airtrain.ai/) is a tool supporting unstructured/low-structured text datasets. It allows automated clustering, document classification, and more.\n", "\n", - "This cookbook showcases how to ingest and transform/enrich data with Llama Index and then upload the data to Airtrain for further processing and exploration." + "This cookbook showcases how to ingest and transform/enrich data with LlamaIndex and then upload the data to Airtrain for further processing and exploration." ] }, { @@ -33,13 +33,13 @@ "outputs": [], "source": [ "# Install some libraries we'll use for our examples. These\n", - "# are not required to use Airtrain with Llama Index, and are just\n", + "# are not required to use Airtrain with LlamaIndex, and are just\n", "# there to help us illustrate use.\n", "%pip install llama-index-embeddings-openai==0.2.4\n", "%pip install llama-index-readers-web==0.2.2\n", "%pip install llama-index-readers-github==0.2.0\n", "\n", - "# Install Airtrain SDK with Llama Index integration\n", + "# Install Airtrain SDK with LlamaIndex integration\n", "%pip install airtrain-py[llama-index]" ] }, @@ -101,8 +101,8 @@ "source": [ "## Example 1: Usage with Readers/Embeddings/Splitters\n", "\n", - "Some of the core abstractions in Llama Index are [Documents and Nodes](https://docs.llamaindex.ai/en/stable/module_guides/loading/documents_and_nodes/).\n", - "Airtrain's Llama Index integration allows you to create an Airtrain dataset using any iterable collection of either of these, via the\n", + "Some of the core abstractions in LlamaIndex are [Documents and Nodes](https://docs.llamaindex.ai/en/stable/module_guides/loading/documents_and_nodes/).\n", + "Airtrain's LlamaIndex integration allows you to create an Airtrain dataset using any iterable collection of either of these, via the\n", "`upload_from_llama_nodes` function.\n", "\n", "To illustrate the flexibility of this, we'll do both:\n", @@ -226,7 +226,7 @@ "that are nearest to it in the full n-dimensional embedding space, to drill down further. Automated clusters and other insights\n", "will also be generated to enrich and aid your exploration.\n", "\n", - "Here we'll use OpenAI embeddings and a `SemanticSplitterNodeParser` splitter, but you can use any other Llama Index tooling you\n", + "Here we'll use OpenAI embeddings and a `SemanticSplitterNodeParser` splitter, but you can use any other LlamaIndex tooling you\n", "like to process your nodes before uploading to Airtrain. You can even skip embedding them yourself entirely, in which case\n", "Airtrain will embed the nodes for you." ] @@ -279,7 +279,7 @@ "## Example 2: Using the [Workflow](https://docs.llamaindex.ai/en/stable/module_guides/workflow/#workflows) API\n", "\n", "Since documents and nodes are the core abstractions the Airtrain integration works with, and these abstractions are\n", - "shared in Llama Index's workflows API, you can also use Airtrain as part of a broader workflow. Here we will illustrate\n", + "shared in LlamaIndex's workflows API, you can also use Airtrain as part of a broader workflow. Here we will illustrate\n", "usage by scraping a few [Hacker News](https://news.ycombinator.com/) comment threads, but again you are not restricted\n", "to web scraping workflows; any workflow producing documents or nodes will do." ] @@ -359,7 +359,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Next we'll define some basic events, as this events are the standard way to pass data between steps in Llama Index workflows." + "Next we'll define a basic event, as events are the standard way to pass data between steps in LlamaIndex workflows." ] }, { @@ -370,11 +370,7 @@ "source": [ "class CompletedDocumentRetrievalEvent(Event):\n", " name: str\n", - " documents: list[Node]\n", - "\n", - "\n", - "class AirtrainDocumentDatasetEvent(Event):\n", - " metadata: DatasetMetadata" + " documents: list[Node]" ] }, { @@ -406,20 +402,9 @@ " @step\n", " async def ingest_documents_to_airtrain(\n", " self, ctx: Context, ev: CompletedDocumentRetrievalEvent\n", - " ) -> AirtrainDocumentDatasetEvent | None:\n", - " if not isinstance(ev, CompletedDocumentRetrievalEvent):\n", - " return None\n", - "\n", + " ) -> StopEvent | None:\n", " dataset_meta = upload_from_llama_nodes(ev.documents, name=ev.name)\n", - " return AirtrainDocumentDatasetEvent(metadata=dataset_meta)\n", - "\n", - " @step\n", - " async def complete_workflow(\n", - " self, ctx: Context, ev: AirtrainDocumentDatasetEvent\n", - " ) -> None | StopEvent:\n", - " if not isinstance(ev, AirtrainDocumentDatasetEvent):\n", - " return None\n", - " return StopEvent(result=ev.metadata)" + " return StopEvent(result=dataset_meta)" ] }, { @@ -465,17 +450,22 @@ "name": "stderr", "output_type": "stream", "text": [ + "error fetching page from https://news.ycombinator.com/item?id=41693087\n", "error fetching page from https://news.ycombinator.com/item?id=41666269\n", "error fetching page from https://news.ycombinator.com/item?id=41697137\n", - "error fetching page from https://news.ycombinator.com/item?id=41695076\n", - "error fetching page from https://news.ycombinator.com/item?id=41697032\n" + "error fetching page from https://news.ycombinator.com/item?id=41697032\n", + "error fetching page from https://news.ycombinator.com/item?id=41652935\n", + "error fetching page from https://news.ycombinator.com/item?id=41696434\n", + "error fetching page from https://news.ycombinator.com/item?id=41688469\n", + "error fetching page from https://news.ycombinator.com/item?id=41646782\n", + "error fetching page from https://news.ycombinator.com/item?id=41668896\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ - "Uploaded 25 rows to My HN Discussions Dataset. View at: https://app.airtrain.ai/dataset/51f491c3-06fe-4da8-aba5-b18f7fa0d167\n" + "Uploaded 20 rows to My HN Discussions Dataset. View at: https://app.airtrain.ai/dataset/bd330f0a-6ff1-4e51-9fe2-9900a1a42308\n" ] } ],