Merge pull request #24 from lamalab-org/beyond_images

lamalab-org · Jun 4, 2024 · 21dcac4 · 21dcac4
2 parents a16e4c6 + f865078
commit 21dcac4
Show file tree

Hide file tree

Showing 12 changed files with 303 additions and 96 deletions.
diff --git a/beyond_text/beyond_images.ipynb b/beyond_text/beyond_images.ipynb
@@ -27,16 +27,29 @@
    "source": [
     "NLP-LLMs tend to have problems with analysing and understanding complex structures such as tables, plots and images included in scientific articles. Since especially in chemistry and material science information about chemical components is included in these, one should think about different approaches for these structures. Therefore, vision language models (VLMs) since they can analyse images alongside text. There are several open and closed-source VLMs available e.g. [Vision models from OpenAI](https://platform.openai.com/docs/guides/vision), [Claude models](https://docs.anthropic.com/en/docs/vision) and [DeepSeek-VL](https://github.com/deepseek-ai/DeepSeek-VL). As an example the extraction of images with [GPT4-o](https://platform.openai.com/docs/models/gpt-4o) is shown:\n",
     "\n",
-    "First one has to convert the file into images. As an example a PDF file obtained in the [Section 1](../obtaining_data/data_mining.ipynb) is converted into images."
+    "First one has to convert the file into images."
    ],
    "metadata": {
     "collapsed": false
    },
    "id": "f3a5e72798406fa1"
   },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "::: {.callout-note}\n",
+    "\n",
+    "The used PDF file was obtained in the [Section 1](../obtaining_data/data_mining.ipynb).\n",
+    ":::"
+   ],
+   "metadata": {
+    "collapsed": false
+   },
+   "id": "4d590539057c1dc3"
+  },
   {
    "cell_type": "code",
-   "execution_count": 1,
+   "execution_count": 3,
    "outputs": [],
    "source": [
     "from pdf2image import convert_from_path\n",
@@ -49,8 +62,8 @@
    "metadata": {
     "collapsed": false,
     "ExecuteTime": {
-     "end_time": "2024-05-28T07:54:02.942697Z",
-     "start_time": "2024-05-28T07:54:02.547049Z"
+     "end_time": "2024-06-03T10:42:11.479846Z",
+     "start_time": "2024-06-03T10:42:10.275394Z"
     }
    },
    "id": "b886d18ec7e86797"
@@ -67,7 +80,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 2,
+   "execution_count": 4,
    "outputs": [
     {
      "name": "stderr",
@@ -177,8 +190,8 @@
    "metadata": {
     "collapsed": false,
     "ExecuteTime": {
-     "end_time": "2024-05-28T07:54:08.090636Z",
-     "start_time": "2024-05-28T07:54:02.944862Z"
+     "end_time": "2024-06-03T10:42:19.110545Z",
+     "start_time": "2024-06-03T10:42:11.483622Z"
     }
    },
    "id": "19b39a0896010c5f"
@@ -193,9 +206,23 @@
    },
    "id": "fd238c06473113b7"
   },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "::: {.callout-tip}\n",
+    "## Prompt\n",
+    "\n",
+    "This is a very simple example prompt. One should optimize and engineer the prompt before usage. For that one could use a tool like [DSPy](https://github.com/stanfordnlp/dspy).\n",
+    ":::"
+   ],
+   "metadata": {
+    "collapsed": false
+   },
+   "id": "dc0b4943b2bf26a4"
+  },
   {
    "cell_type": "code",
-   "execution_count": 3,
+   "execution_count": 5,
    "outputs": [],
    "source": [
     "# the text prompt text for the model call gets defined\n",
@@ -225,134 +252,182 @@
    "metadata": {
     "collapsed": false,
     "ExecuteTime": {
-     "end_time": "2024-05-28T07:54:08.094290Z",
-     "start_time": "2024-05-28T07:54:08.092567Z"
+     "end_time": "2024-06-03T10:42:19.111092Z",
+     "start_time": "2024-06-03T10:42:19.104520Z"
     }
    },
    "id": "f95a25d2099f080d"
   },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "To call the actual model one could use [LiteLLM](https://github.com/BerriAI/litellm) instead of directly using an API like the OpenAI-API. So one could easily switch between different models for different publishers."
+   ],
+   "metadata": {
+    "collapsed": false
+   },
+   "id": "21daa806c07c25b5"
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "::: {.callout-important}\n",
+    "## API-Key\n",
+    "\n",
+    "One has to provide his or her own API-key in the .env file. \n",
+    ":::"
+   ],
+   "metadata": {
+    "collapsed": false
+   },
+   "id": "9c11896b70b6aa6a"
+  },
   {
    "cell_type": "code",
-   "execution_count": 5,
+   "execution_count": 7,
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "Output:  {\n",
-      "  \"Buchwald-Hartwig Reaction Conditions\": [\n",
-      "    {\n",
-      "      \"Reaction Type\": \"Cross-coupling\",\n",
-      "      \"Key Step\": \"Reaction of an α-amino-BODIPY and the respective halide\",\n",
-      "      \"Catalyst\": \"Pd(OAc)2\",\n",
-      "      \"Ligand\": \"(±)-BINAP\",\n",
-      "      \"Base\": \"Cs2CO3\",\n",
-      "      \"Solvent\": \"PhMe\",\n",
-      "      \"Temperature\": \"80 °C\",\n",
-      "      \"Time\": \"1.5 h\",\n",
-      "      \"Yield\": \"Up to 68% for unsymmetric BODIPYs\"\n",
-      "    },\n",
-      "    {\n",
-      "      \"Monomer\": \"α-chloro- and α-amino-BODIPYs\",\n",
-      "      \"Catalyst\": \"Pd(OAc)2\",\n",
-      "      \"Ligand\": \"(±)-BINAP\",\n",
-      "      \"Base\": \"Cs2CO3\",\n",
-      "      \"Solvent\": \"PhMe\",\n",
-      "      \"Temperature\": \"80 °C\",\n",
-      "      \"Yield\": \"Up to 68% for unsymmetric BODIPYs\"\n",
-      "    },\n",
-      "    {\n",
-      "      \"Monomer\": \"Br-Ar-mono-NH2\",\n",
-      "      \"Catalyst\": \"Pd(OAc)2\",\n",
-      "      \"Ligand\": \"(±)-BINAP\",\n",
-      "      \"Base\": \"Cs2CO3\",\n",
-      "      \"Solvent\": \"PhMe\",\n",
-      "      \"Temperature\": \"80 °C\",\n",
-      "      \"Time\": \"1.5 h\",\n",
-      "      \"Yield\": \"Up to 68% for unsymmetric BODIPYs\"\n",
-      "    }\n",
-      "  ],\n",
-      "  \"Additional Notes\": [\n",
-      "    {\n",
-      "      \"Note\": \"The reaction showed a trend of improvement with increasing level of substitution of the BODIPY core.\"\n",
-      "    },\n",
-      "    {\n",
-      "      \"Note\": \"The reaction of Br-Ar-mono-Br with EDM-Ar-mono-Br required slow addition of Br-Ar-mono-NH2 to a heated solution of the remaining reagents.\"\n",
-      "    },\n",
-      "    {\n",
-      "      \"Note\": \"The reaction yielded 44% of the functionalized dimer, while 45% of the starting material was recovered.\"\n",
-      "    }\n",
-      "  ]\n",
+      "Output:  Here is the extracted information about Buchwald-Hartwig reactions from the provided images:\n",
+      "\n",
+      "```json\n",
+      "{\n",
+      "  \"Buchwald-Hartwig Reactions\": {\n",
+      "    \"Key Step\": \"Cross-coupling reaction of an α-amino-BODIPY and the respective halide.\",\n",
+      "    \"Conditions\": [\n",
+      "      {\n",
+      "        \"Reagents\": [\n",
+      "          \"Pd(OAc)2\",\n",
+      "          \"(±)-BINAP\",\n",
+      "          \"Cs2CO3\",\n",
+      "          \"PhMe\"\n",
+      "        ],\n",
+      "        \"Temperature\": \"80 °C\",\n",
+      "        \"Time\": \"1-5 h\",\n",
+      "        \"Yield\": \"up to 68%\"\n",
+      "      }\n",
+      "    ],\n",
+      "    \"Monomers\": [\n",
+      "      {\n",
+      "        \"Type\": \"α-chlorinated or α-amino-BODIPYs\",\n",
+      "        \"Reagents\": [\n",
+      "          \"Pd(OAc)2\",\n",
+      "          \"(±)-BINAP\",\n",
+      "          \"Cs2CO3\",\n",
+      "          \"PhMe\"\n",
+      "        ],\n",
+      "        \"Temperature\": \"80 °C\",\n",
+      "        \"Time\": \"1-5 h\",\n",
+      "        \"Yield\": \"up to 68%\"\n",
+      "      }\n",
+      "    ],\n",
+      "    \"Functionalized Monomers\": [\n",
+      "      {\n",
+      "        \"Type\": \"Br-Ar-mono-Br or Br-Ar-di\",\n",
+      "        \"Reagents\": [\n",
+      "          \"Pd(OAc)2\",\n",
+      "          \"(±)-BINAP\",\n",
+      "          \"Cs2CO3\",\n",
+      "          \"PhMe\"\n",
+      "        ],\n",
+      "        \"Temperature\": \"80 °C\",\n",
+      "        \"Time\": \"1-5 h\",\n",
+      "        \"Yield\": \"44% for Br-Ar-mono-Br, 45% of starting material recovered\"\n",
+      "      }\n",
+      "    ],\n",
+      "    \"Procedure\": \"Stirring slow addition of Br-Ar-mono-NH2 to a heated solution of the remaining reagents.\",\n",
+      "    \"Selectivity\": \"Maintained excess of Br-Ar-mono-Br to avoid further oligomerization.\"\n",
+      "  }\n",
       "}\n",
-      "Input-token used: 6706  Output_token used:  436\n"
+      "```\n",
+      "Input tokens used: 6704 Output tokens used: 387\n"
      ]
     }
    ],
    "source": [
-    "from openai import OpenAI\n",
+    "import os\n",
     "from dotenv import load_dotenv\n",
+    "from litellm import completion\n",
     "\n",
-    "# the openai api gets called; the temperature is set to 0 since the output should have a high accuracy\n",
-    "# the gpt4-o model is used since this is the cheapest and fastest openai vision model\n",
-    "def call_openai(\n",
-    "    prompt, model=\"gpt-4o\", temperature: float = 0.0, **kwargs\n",
-    "):\n",
-    "    \"\"\"Call chat openai model\n",
+    "# Define the function to call the LiteLLM API\n",
+    "def call_litellm(prompt, model=\"gpt-4o\", temperature: float = 0.0, **kwargs):\n",
+    "    \"\"\"Call LiteLLM model\n",
     "\n",
     "    Args:\n",
     "        prompt (str): Prompt to send to model\n",
-    "        model (str, optional): Name of the API. Defaults to \"\"gpt-4-vision-preview\".\n",
-    "        temperature (float, optional): inference temperature. Defaults to 0.\n",
+    "        model (str, optional): Name of the API. Defaults to \"gpt-4o\".\n",
+    "        temperature (float, optional): Inference temperature. Defaults to 0.\n",
     "\n",
     "    Returns:\n",
-    "        dict: new data\n",
+    "        dict: New data\n",
     "    \"\"\"\n",
-    "    client = OpenAI()\n",
-    "    completion = client.chat.completions.create(\n",
+    "    messages = [\n",
+    "        {\n",
+    "            \"role\": \"system\",\n",
+    "            \"content\": (\n",
+    "                \"You are a scientific assistant, extracting important information about reaction conditions \"\n",
+    "                \"out of PDFs in valid JSON format. Extract just data which you are 100% confident about the \"\n",
+    "                \"accuracy. Keep the entries short without details. Be careful with numbers.\"\n",
+    "            ),\n",
+    "        },\n",
+    "        {\"role\": \"user\", \"content\": prompt},\n",
+    "    ]\n",
+    "\n",
+    "    response = completion(\n",
     "        model=model,\n",
-    "        messages=[\n",
-    "            {\n",
-    "                \"role\": \"system\",\n",
-    "                \"content\": \"You are a scientific assistant, extracting important information about polymerization conditions\"\n",
-    "                \"out of pdfs in valid json format. Extract just data which you are 100% confident about the \"\n",
-    "                \"accuracy. Keep the entries short without details. Be careful with numbers.\",\n",
-    "            },\n",
-    "            {\"role\": \"user\", \"content\": prompt},\n",
-    "        ],\n",
+    "        messages=messages,\n",
     "        temperature=temperature,\n",
-    "        response_format={\"type\": \"json_object\"},\n",
     "        **kwargs,\n",
     "    )\n",
-    "    # the input and output token are reported in order to track costs of the api calls\n",
-    "    input_tokens = completion.usage.prompt_tokens\n",
-    "    output_token = completion.usage.completion_tokens\n",
-    "    # the output of the model call is saved\n",
-    "    message_content = completion.choices[0].message.content\n",
-    "    return message_content, input_tokens, output_token\n",
-    "\n",
-    "# the openai api key is loading\n",
-    "dotenv_path = '../OPENAI_KEY.env'\n",
+    "\n",
+    "    # Extract and return the message content and token usage\n",
+    "    message_content = response['choices'][0]['message']['content']\n",
+    "    input_tokens = response['usage']['prompt_tokens']\n",
+    "    output_tokens = response['usage']['completion_tokens']\n",
+    "    return message_content, input_tokens, output_tokens\n",
+    "\n",
+    "# Load the OpenAI API key from environment variables\n",
+    "dotenv_path = '../.env'\n",
     "load_dotenv(dotenv_path)\n",
     "api_key = os.getenv(\"OPENAI_API_KEY\")\n",
-    "client = OpenAI(api_key=api_key)\n",
     "\n",
-    "# the openai api is called and the used token and the output printed\n",
-    "output, input_token, output_token = call_openai(prompt=prompt)\n",
+    "# Set the API key for LiteLLM\n",
+    "os.environ[\"OPENAI_API_KEY\"] = api_key\n",
+    "\n",
+    "# Call the LiteLLM API and print the output and token usage\n",
+    "output, input_tokens, output_tokens = call_litellm(prompt=prompt)\n",
     "print('Output: ', output)\n",
-    "print('Input-token used:', input_token, ' Output_token used: ', output_token)"
+    "print('Input tokens used:', input_tokens, 'Output tokens used:', output_tokens)"
    ],
    "metadata": {
     "collapsed": false,
     "ExecuteTime": {
-     "end_time": "2024-05-28T07:55:06.274592Z",
-     "start_time": "2024-05-28T07:54:49.753036Z"
+     "end_time": "2024-06-03T10:56:02.715950Z",
+     "start_time": "2024-06-03T10:55:46.810446Z"
     }
    },
    "id": "3d37e685fef815ab"
   },
   {
    "cell_type": "markdown",
    "source": [
+    "::: {.callout-tip}\n",
+    "\n",
+    "To get only the json part of the output, one could use Regex to extract this content. \n",
+    ":::"
+   ],
+   "metadata": {
+    "collapsed": false
+   },
+   "id": "b4ca2e31fcdbed75"
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "Since in the article there is no experimental section provided the model just extracted general information about the reactions. It failed with extraction the data provided in the reaction schemes. To extract this information, one should use tools presented in the [agentic section](link to agentic section).\n",
+    "\n",
     "Now one could use this structured output to build up a database of Buchwald-Hartwig-Coupling reactions. "
    ],
    "metadata": {

diff --git a/beyond_text/images/corrected_10.26434_chemrxiv-2024-1l0sn_page1.png b/beyond_text/images/corrected_10.26434_chemrxiv-2024-1l0sn_page1.png
diff --git a/beyond_text/images/corrected_10.26434_chemrxiv-2024-1l0sn_page2.png b/beyond_text/images/corrected_10.26434_chemrxiv-2024-1l0sn_page2.png
diff --git a/beyond_text/images/corrected_10.26434_chemrxiv-2024-1l0sn_page3.png b/beyond_text/images/corrected_10.26434_chemrxiv-2024-1l0sn_page3.png
diff --git a/beyond_text/images/corrected_10.26434_chemrxiv-2024-1l0sn_page4.png b/beyond_text/images/corrected_10.26434_chemrxiv-2024-1l0sn_page4.png
diff --git a/beyond_text/images/corrected_10.26434_chemrxiv-2024-1l0sn_page5.png b/beyond_text/images/corrected_10.26434_chemrxiv-2024-1l0sn_page5.png
diff --git a/beyond_text/images/corrected_10.26434_chemrxiv-2024-1l0sn_page6.png b/beyond_text/images/corrected_10.26434_chemrxiv-2024-1l0sn_page6.png