NVIDIA · ChrisJar · Nov 22, 2024 · Oct 3, 2024 · Oct 7, 2024 · Oct 9, 2024
@@ -10,282 +10,130 @@
   },
   {
    "cell_type": "markdown",
-   "id": "12e78be0-9abe-4d1a-8aaa-61230b059792",
+   "id": "91ece9e3-155a-44f4-81e5-2f9492c62a2f",
    "metadata": {},
    "source": [
-    "This cookbook shows how to use LangChain to query the table and text extraction results of nv-ingest's pdf extraction tools"
+    "This notebook shows how to perform RAG on the table, chart, and text extraction results of NV-Ingest's pdf extraction tools using LangChain"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "cdfad056-baef-448e-b5b3-4544e6b06472",
+   "id": "c6905d11-0ec3-43c8-961b-24cb52e36bfe",
    "metadata": {},
    "source": [
-    "To start we'll need to make sure we have some dependencies installed"
-   ]
-  },
-  {
-   "cell_type": "raw",
-   "id": "2dcc96cf-f8e1-4baa-b1ca-b6521d23dec8",
-   "metadata": {},
-   "source": [
-    "pip install langchain langchain_community langchain_chroma langchain-openai"
+    "**Note:** In order to run this notebook, you'll need to have the NV-Ingest microservice running along with all of the other included microservices. To do this, make sure all of the services are uncommented in the file: [docker-compose.yaml](https://github.com/NVIDIA/nv-ingest/blob/main/docker-compose.yaml) and follow the [quickstart guide](https://github.com/NVIDIA/nv-ingest?tab=readme-ov-file#quickstart) to start everything up. You'll also need to have the NV-Ingest python client installed as demonstrated [here](https://github.com/NVIDIA/nv-ingest?tab=readme-ov-file#step-2-installing-python-dependencies)."
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "2487f1eb-eea8-46f1-9ab0-0fd3bea77bd5",
+   "id": "81014734-f765-48fc-8fc2-4c19f5f28eae",
    "metadata": {},
    "source": [
-    "Then, we'll use nv-ingest to parse an example pdf that contains text, tables, charts, and images. We'll need to make sure to have the nv-ingest microservice up and running at localhost:7670 along with the supporting NIMs. To do this, follow the nv-ingest [quickstart guide](https://github.com/NVIDIA/nv-ingest?tab=readme-ov-file#quickstart). Once the microservice is ready we can create a job with the nv-ingest python client"
+    "To start, make sure LangChain and pymilvus are installed and up to date"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 1,
-   "id": "32017922-9b9c-48b9-86ab-6319377fcce8",
+   "execution_count": null,
+   "id": "bacbe052-4429-4c0a-8b1e-309ac55ad8fb",
    "metadata": {},
    "outputs": [],
    "source": [
-    "from nv_ingest_client.client import NvIngestClient\n",
-    "from nv_ingest_client.primitives import JobSpec\n",
-    "from nv_ingest_client.primitives.tasks import ExtractTask\n",
-    "from nv_ingest_client.primitives.tasks import SplitTask\n",
-    "from nv_ingest_client.util.file_processing.extract import extract_file_content\n",
-    "import logging, time\n",
-    "\n",
-    "logger = logging.getLogger(\"nv_ingest_client\")\n",
-    "\n",
-    "file_name = \"../data/multimodal_test.pdf\"\n",
-    "file_content, file_type = extract_file_content(file_name)\n",
-    "\n",
-    "job_spec = JobSpec(\n",
-    "    document_type=file_type,\n",
-    "    payload=file_content,\n",
-    "    source_id=file_name,\n",
-    "    source_name=file_name,\n",
-    "    extended_options={\"tracing_options\": {\"trace\": True, \"ts_send\": time.time_ns()}},\n",
-    ")"
+    "pip install -qU langchain langchain_community langchain-nvidia-ai-endpoints langchain_milvus pymilvus"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "5cedc329-e9a1-428e-9fa1-2aee965c2533",
+   "id": "d888ba26-04cf-4577-81a3-5bcd537fc2f6",
    "metadata": {},
    "source": [
-    "And then we can and submit a task to extract the text and tables from the example pdf"
+    "Then, we'll use NV-Ingest's Ingestor interface to extract the tables and charts from a test pdf, embed them, and upload them to our Milvus vector database (VDB)"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 2,
-   "id": "8bcd6b2a-c832-4685-bd6e-53b60a107e28",
+   "execution_count": 1,
+   "id": "32017922-9b9c-48b9-86ab-6319377fcce8",
    "metadata": {},
    "outputs": [],
    "source": [
-    "extract_task = ExtractTask(\n",
-    "    document_type=file_type,\n",
-    "    extract_text=True,\n",
-    "    extract_images=False,\n",
-    "    extract_tables=True,\n",
-    ")\n",
-    "\n",
+    "from nv_ingest_client.client import Ingestor\n",
     "\n",
-    "job_spec.add_task(extract_task)\n",
-    "client = NvIngestClient()\n",
-    "job_id = client.add_job(job_spec)\n",
-    "\n",
-    "client.submit_job(job_id, \"morpheus_task_queue\")\n",
+    "ingestor = (\n",
+    "    Ingestor(message_client_hostname=\"localhost\")\n",
+    "    .files(\"../data/multimodal_test.pdf\")\n",
+    "    .extract(\n",
+    "        extract_text=False,\n",
+    "        extract_tables=True,\n",
+    "        extract_images=False,\n",
+    "    ).embed(\n",
+    "        text=False,\n",
+    "        tables=True,\n",
+    "    ).vdb_upload()\n",
+    ")\n",
     "\n",
-    "result = client.fetch_job_result(job_id, timeout=60)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 3,
-   "id": "bf097db1-9699-442e-88b7-7b67a8a2fecf",
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "{'document_type': 'text',\n",
-       " 'metadata': {'content': 'TestingDocument\\r\\nA sample document with headings and placeholder text\\r\\nIntroduction\\r\\nThis is a placeholder document that can be used for any purpose. It contains some \\r\\nheadings and some placeholder text to fill the space. The text is not important and contains \\r\\nno real value, but it is useful for testing. Below, we will have some simple tables and charts \\r\\nthat we can use to confirm Ingest is working as expected.\\r\\nTable 1\\r\\nThis table describes some animals, and some activities they might be doing in specific \\r\\nlocations.\\r\\nAnimal Activity Place\\r\\nGira@e Driving a car At the beach\\r\\nLion Putting on sunscreen At the park\\r\\nCat Jumping onto a laptop In a home o@ice\\r\\nDog Chasing a squirrel In the front yard\\r\\nChart 1\\r\\nThis chart shows some gadgets, and some very fictitious costs. Section One\\r\\nThis is the first section of the document. It has some more placeholder text to show how \\r\\nthe document looks like. The text is not meant to be meaningful or informative, but rather to \\r\\ndemonstrate the layout and formatting of the document.\\r\\n• This is the first bullet point\\r\\n• This is the second bullet point\\r\\n• This is the third bullet point\\r\\nSection Two\\r\\nThis is the second section of the document. It is more of the same as we’ve seen in the rest \\r\\nof the document. The content is meaningless, but the intent is to create a very simple \\r\\nsmoke test to ensure extraction is working as intended. This will be used in CI as time goes \\r\\non to ensure that changes we make to the library do not negatively impact our accuracy.\\r\\nTable 2\\r\\nThis table shows some popular colors that cars might come in.\\r\\nCar Color1 Color2 Color3\\r\\nCoupe White Silver Flat Gray\\r\\nSedan White Metallic Gray Matte Gray\\r\\nMinivan Gray Beige Black\\r\\nTruck Dark Gray Titanium Gray Charcoal\\r\\nConvertible Light Gray Graphite Slate Gray\\r\\nPicture\\r\\nBelow, is a high-quality picture of some shapes. Chart 2\\r\\nThis chart shows some average frequency ranges for speaker drivers.\\r\\nConclusion\\r\\nThis is the conclusion of the document. It has some more placeholder text, but the most \\r\\nimportant thing is that this is the conclusion. As we end this document, we should have \\r\\nbeen able to extract 2 tables, 2 charts, and some text including 3 bullet points.',\n",
-       "  'content_metadata': {'description': 'Unstructured text from PDF document.',\n",
-       "   'hierarchy': {'block': -1,\n",
-       "    'line': -1,\n",
-       "    'nearby_objects': {'images': {'bbox': [], 'content': []},\n",
-       "     'structured': {'bbox': [], 'content': []},\n",
-       "     'text': {'bbox': [], 'content': []}},\n",
-       "    'page': -1,\n",
-       "    'page_count': 3,\n",
-       "    'span': -1},\n",
-       "   'page_number': -1,\n",
-       "   'subtype': '',\n",
-       "   'type': 'text'},\n",
-       "  'debug_metadata': None,\n",
-       "  'embedding': None,\n",
-       "  'error_metadata': None,\n",
-       "  'image_metadata': None,\n",
-       "  'info_message_metadata': None,\n",
-       "  'raise_on_failure': False,\n",
-       "  'source_metadata': {'access_level': 1,\n",
-       "   'collection_id': '',\n",
-       "   'date_created': '2024-09-04T15:55:15.103673',\n",
-       "   'last_modified': '2024-09-04T15:55:15.103335',\n",
-       "   'partition_id': -1,\n",
-       "   'source_id': '../data/multimodal_test.pdf',\n",
-       "   'source_location': '',\n",
-       "   'source_name': '../data/multimodal_test.pdf',\n",
-       "   'source_type': 'PDF',\n",
-       "   'summary': ''},\n",
-       "  'table_metadata': None,\n",
-       "  'text_metadata': {'keywords': '',\n",
-       "   'language': 'en',\n",
-       "   'summary': '',\n",
-       "   'text_location': [-1, -1, -1, -1],\n",
-       "   'text_type': 'document'}}}"
-      ]
-     },
-     "execution_count": 3,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "result[0][0][0]"
+    "results = ingestor.ingest()"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "72d85eb5-2d66-4d6d-95fe-f69b2c6c5489",
-   "metadata": {},
-   "source": [
-    "Now, we have the extraction results in the nv-ingest metadata format which we'll grab the extracted content from and load into Langchain documents"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 4,
-   "id": "b0c6d966-9a38-4b1b-b5f2-cbe214d1718d",
+   "id": "02131711-31bf-4536-81b7-8c464c7473e3",
    "metadata": {},
-   "outputs": [],
    "source": [
-    "from langchain_core.documents import Document\n",
-    "\n",
-    "texts = []\n",
-    "tables = []\n",
-    "for element in result[0][0]:\n",
-    "    if element['document_type'] == 'text':\n",
-    "        texts.append(Document(element['metadata']['content']))\n",
-    "    elif element['document_type'] == 'structured':\n",
-    "        tables.append(Document(element['metadata']['table_metadata']['table_content']))"
+    "Now, the text, table, and chart content is extracted and stored in the Milvus VDB along with the embeddings. Next we'll connect LangChain to Milvus and create a vector store so that we can query our extraction results"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 5,
-   "id": "7bb4b918-9029-4ffc-badd-dddc4022a750",
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "[Document(page_content='TestingDocument\\r\\nA sample document with headings and placeholder text\\r\\nIntroduction\\r\\nThis is a placeholder document that can be used for any purpose. It contains some \\r\\nheadings and some placeholder text to fill the space. The text is not important and contains \\r\\nno real value, but it is useful for testing. Below, we will have some simple tables and charts \\r\\nthat we can use to confirm Ingest is working as expected.\\r\\nTable 1\\r\\nThis table describes some animals, and some activities they might be doing in specific \\r\\nlocations.\\r\\nAnimal Activity Place\\r\\nGira@e Driving a car At the beach\\r\\nLion Putting on sunscreen At the park\\r\\nCat Jumping onto a laptop In a home o@ice\\r\\nDog Chasing a squirrel In the front yard\\r\\nChart 1\\r\\nThis chart shows some gadgets, and some very fictitious costs. Section One\\r\\nThis is the first section of the document. It has some more placeholder text to show how \\r\\nthe document looks like. The text is not meant to be meaningful or informative, but rather to \\r\\ndemonstrate the layout and formatting of the document.\\r\\n• This is the first bullet point\\r\\n• This is the second bullet point\\r\\n• This is the third bullet point\\r\\nSection Two\\r\\nThis is the second section of the document. It is more of the same as we’ve seen in the rest \\r\\nof the document. The content is meaningless, but the intent is to create a very simple \\r\\nsmoke test to ensure extraction is working as intended. This will be used in CI as time goes \\r\\non to ensure that changes we make to the library do not negatively impact our accuracy.\\r\\nTable 2\\r\\nThis table shows some popular colors that cars might come in.\\r\\nCar Color1 Color2 Color3\\r\\nCoupe White Silver Flat Gray\\r\\nSedan White Metallic Gray Matte Gray\\r\\nMinivan Gray Beige Black\\r\\nTruck Dark Gray Titanium Gray Charcoal\\r\\nConvertible Light Gray Graphite Slate Gray\\r\\nPicture\\r\\nBelow, is a high-quality picture of some shapes. Chart 2\\r\\nThis chart shows some average frequency ranges for speaker drivers.\\r\\nConclusion\\r\\nThis is the conclusion of the document. It has some more placeholder text, but the most \\r\\nimportant thing is that this is the conclusion. As we end this document, we should have \\r\\nbeen able to extract 2 tables, 2 charts, and some text including 3 bullet points.')]"
-      ]
-     },
-     "execution_count": 5,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "texts"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 6,
-   "id": "307fdf5d-587b-448b-8b79-e4fa88fb5b0c",
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "[Document(page_content='locations. Animal Activity Place Giraffe Driving a car At the beach Lion Putting on sunscreen At the park Cat Jumping onto a laptop In a home office Dog Chasing a squirrel In the front yard'),\n",
-       " Document(page_content='This chart shows some gadgets, and some very fictitious costs. >\\\\n7938.758 ext. Print & Maroon Bookshelf Fine Art Poems Collection dla Cemicon Diamtháhn | Gadgets and their cost\\nSollywood for Coasters | 19875.075     t158.281 \\n Hammer | 19871.55 \\n Powerdrill | 12044.625 \\n Bluetooth speaker | 7598.07 \\n Minifridge | 9916.305 \\n Premium desk'),\n",
-       " Document(page_content='This table shows some popular colors that cars might come in. Car Color1 Color2 Color3 Coupe White Silver Flat Gray Sedan White Metallic Gray Matte Gray Minivan Gray Beige Black Truck Dark Gray Titanium Gray Charcoal Convertible Light Gray Graphite Slate Gray'),\n",
-       " Document(page_content='This chart shows some average frequency ranges for speaker drivers TITLE | Chart 2 \\n Frequency Range Start (Hz) | Frequency Range Start (Hz) | Frequency Range End (Hz) \\n Twitter | 12800 | 12700 \\n Midrange | 13900 | 13000 \\n Midwoofer | 9600 | 13000 \\n Subwoofer | 0.00 | 13000')]"
-      ]
-     },
-     "execution_count": 6,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "tables"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "15410af0-fa5c-4aed-b554-f2165c6f482e",
-   "metadata": {},
-   "source": [
-    "Next, we'll set our OpenAI API key and create a vector store to embed and store our text and table documents using OpenAI's embedding model"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 7,
+   "execution_count": 13,
    "id": "53957974-c688-4521-8c61-09f2649d5d53",
    "metadata": {},
    "outputs": [],
    "source": [
-    "import os\n",
-    "from langchain_chroma import Chroma\n",
-    "from langchain_openai import OpenAIEmbeddings\n",
+    "from langchain_nvidia_ai_endpoints import NVIDIAEmbeddings\n",
+    "from langchain_milvus import Milvus\n",
     "\n",
-    "# TODO: Add your OpenAI API key here\n",
-    "os.environ[\"OPENAI_API_KEY\"] = \"<YOUR_OPENAI_API_KEY>\"\n",
+    "embedding_function = NVIDIAEmbeddings(base_url=\"http://localhost:8012/v1\")\n",
     "\n",
-    "vectorstore = Chroma.from_documents(documents=(texts+tables), embedding=OpenAIEmbeddings())"
+    "vectorstore = Milvus(\n",
+    "    embedding_function=embedding_function,\n",
+    "    collection_name=\"nv_ingest_collection\",\n",
+    "    primary_field = \"pk\",\n",
+    "    vector_field = \"vector\",\n",
+    "    text_field=\"text\",\n",
+    "    connection_args={\"uri\": \"http://localhost:19530\"},\n",
+    ")\n",
+    "retriever = vectorstore.as_retriever()"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "12d14f63-31b8-4d72-a029-a54328ff1c5b",
+   "id": "b87111b5-e5a8-45a0-9663-2ae6d9ea2ab6",
    "metadata": {},
    "source": [
-    "Then, we'll create a retriever from our vector score that will allow us to retrieve our documents by semantic similarity and an llm to synthesize the final answer from the retrieved documents"
+    "Then, we'll create an RAG chain using [llama-3.1-405b-instruct](https://build.nvidia.com/meta/llama-3_1-405b-instruct) that we can use to query our pdf in natural language"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 8,
-   "id": "a12f53f7-2407-4890-9586-82956cfccf13",
+   "execution_count": 3,
+   "id": "b4c9e109-395c-40e2-a1a5-e0c0ef217e24",
    "metadata": {},
    "outputs": [],
    "source": [
-    "from langchain_openai import ChatOpenAI\n",
+    "import os \n",
+    "from langchain_nvidia_ai_endpoints import ChatNVIDIA\n",
     "\n",
-    "retriever = vectorstore.as_retriever()\n",
+    "# TODO: Add your NVIDIA API key\n",
+    "os.environ[\"NVIDIA_API_KEY\"] = \"[YOUR NVIDIA API KEY HERE]\"\n",
     "\n",
-    "llm = ChatOpenAI(model=\"gpt-4o-mini\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "b87111b5-e5a8-45a0-9663-2ae6d9ea2ab6",
-   "metadata": {},
-   "source": [
-    "Finally, we'll create an RAG chain that we can use to query our pdf in natural language"
+    "llm = ChatNVIDIA(model=\"meta/llama-3.1-405b-instruct\")"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 9,
-   "id": "d16331bb-95dd-46d7-8b71-9d966a8ef3ba",
+   "execution_count": 17,
+   "id": "77fd17f8-eac0-4457-b6fb-6e5c8ce90c84",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -313,9 +161,17 @@
     ")"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "cc2ee8fb-a154-46c9-9181-29a035fdcfbb",
+   "metadata": {},
+   "source": [
+    "And now we can ask our pdf questions"
+   ]
+  },
   {
    "cell_type": "code",
-   "execution_count": 10,
+   "execution_count": 16,
    "id": "b547a19a-9ada-4a40-a246-6d7bc4d24482",
    "metadata": {},
    "outputs": [
@@ -325,7 +181,7 @@
        "'The dog is chasing a squirrel in the front yard.'"
       ]
      },
-     "execution_count": 10,
+     "execution_count": 16,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -337,7 +193,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "dcbcfcb7-c6c6-4112-86c4-39e6975d325b",
+   "id": "b5b3f079-65a6-4d32-a190-1df96925c5c7",
    "metadata": {},
    "outputs": [],
    "source": []
@@ -359,7 +215,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.10.14"
+   "version": "3.10.15"
   }
  },
  "nbformat": 4,