From 4e0985c2f4735a9b5745b8dbe214708a3a747056 Mon Sep 17 00:00:00 2001
From: Liam Thompson <liam.thompson@elastic.co>
Date: Fri, 4 Aug 2023 13:50:09 +0200
Subject: [PATCH 1/2] Add notebook to test action

---
 elasticsearch-basics-indexing.ipynb | 491 ++++++++++++++++++++++++++++
 1 file changed, 491 insertions(+)
 create mode 100644 elasticsearch-basics-indexing.ipynb

diff --git a/elasticsearch-basics-indexing.ipynb b/elasticsearch-basics-indexing.ipynb
new file mode 100644
index 0000000..b81c7a8
--- /dev/null
+++ b/elasticsearch-basics-indexing.ipynb
@@ -0,0 +1,491 @@
+{
+ "cells": [
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "87773ce7",
+   "metadata": {},
+   "source": [
+    "# Elasticsearch basics: indexing\n",
+    "\n",
+    "This notebook contains a basic introduction to indexing documents into Elasticsearch, using the Python client.\n",
+    "This is an interactive notebook, so you can run the code and experiment with it!\n",
+    "\n",
+    "Run this notebook:\n",
+    "\n",
+    "- Locally using [jupyter](https://docs.jupyter.org/en/latest/install.html)\n",
+    "- Online using [Google Colab](https://colab.research.google.com/?hl=en)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "d0f7dafa",
+   "metadata": {},
+   "source": [
+    "## 🧰 Requirements\n",
+    "\n",
+    "For this example, you will need:\n",
+    "\n",
+    "- Python 3.6 or later\n",
+    "- An Elastic deployment\n",
+    "   - We'll be using [Elastic Cloud](https://www.elastic.co/guide/en/cloud/current/ec-getting-started.html) for this example (available with a [free trial](https://cloud.elastic.co/registration?elektra=en-ess-sign-up-page))\n",
+    "- The [Elastic Python client](https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/installation.html)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "9cf3cbb5",
+   "metadata": {},
+   "source": [
+    "## Create Elastic Cloud deployment\n",
+    "\n",
+    "If you don't have an Elastic Cloud deployment, sign up [here](https://cloud.elastic.co/registration?fromURI=%2Fhome) for a free trial.\n",
+    "\n",
+    "- Go to the [Create deployment](https://cloud.elastic.co/deployments/create) page\n",
+    "   - Select **Create deployment**"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "d7076a7a",
+   "metadata": {},
+   "source": [
+    "## Install packages and import modules\n",
+    "\n",
+    "To get started, we'll need to connect to our Elastic deployment using the Python client.\n",
+    "Because we're using an Elastic Cloud deployment, we'll use the **Cloud ID** to identify our deployment.\n",
+    "\n",
+    "First we need to `pip` install the following packages:\n",
+    "\n",
+    "- `elasticsearch`\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "6e237928",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Requirement already satisfied: elasticsearch in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (8.8.0)\n",
+      "Requirement already satisfied: elastic-transport<9,>=8 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from elasticsearch) (8.4.0)\n",
+      "Requirement already satisfied: certifi in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from elastic-transport<9,>=8->elasticsearch) (2023.5.7)\n",
+      "Requirement already satisfied: urllib3<2,>=1.26.2 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from elastic-transport<9,>=8->elasticsearch) (1.26.16)\n",
+      "\u001b[33mWARNING: You are using pip version 21.2.3; however, version 23.1.2 is available.\n",
+      "You should consider upgrading via the '/Users/liamthompson/.pyenv/versions/3.9.7/bin/python3.9 -m pip install --upgrade pip' command.\u001b[0m\n"
+     ]
+    }
+   ],
+   "source": [
+    "!pip install elasticsearch"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "cccf5bf5",
+   "metadata": {},
+   "source": [
+    "Next we need to import the modules we need."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "8ed40603",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from elasticsearch import Elasticsearch, helpers\n",
+    "from urllib.request import urlopen\n",
+    "import getpass\n",
+    "# import requests\n",
+    "import json\n",
+    "from datetime import datetime"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "38679016",
+   "metadata": {},
+   "source": [
+    "## Initialize the Elasticsearch client\n",
+    "\n",
+    "Now we can instantiate the Elasticsearch client.\n",
+    "First we prompt the user for their password and Cloud ID.\n",
+    "\n",
+    "🔐 NOTE: `getpass` enables us to securely prompt for credentials without echoing them to the terminal, or storing in memory.\n",
+    "\n",
+    "Then we create a `client` object that instantiates an instance of the `Elasticsearch` class."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "145a1222",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Found in the 'Manage Deployment' page\n",
+    "CLOUD_ID = getpass.getpass('Enter Elastic Cloud ID:  ')\n",
+    "\n",
+    "# Password for the 'elastic' user generated by Elasticsearch\n",
+    "ELASTIC_PASSWORD = getpass.getpass('Enter Elastic password:  ')\n",
+    "\n",
+    "# Create the client instance\n",
+    "client = Elasticsearch(\n",
+    "    cloud_id=CLOUD_ID,\n",
+    "    basic_auth=(\"elastic\", ELASTIC_PASSWORD)\n",
+    ")"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "555fbc67",
+   "metadata": {},
+   "source": [
+    "Confirm that the client has connected with this test."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "92afc4a9",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "{'name': 'instance-0000000001', 'cluster_name': '9dd1e5c0b0d64796b8cf0746cf63d734', 'cluster_uuid': 'VeYvw6JhQcC3P-Q1-L9P_w', 'version': {'number': '8.9.0-SNAPSHOT', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': 'ac7d79178c3e57c935358453331efe9e9cc5104d', 'build_date': '2023-06-21T09:08:25.219504984Z', 'build_snapshot': True, 'lucene_version': '9.7.0', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0', 'transport_version': '8500019'}, 'tagline': 'You Know, for Search'}\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(client.info())"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "d0a03898",
+   "metadata": {},
+   "source": [
+    "Refer to https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/connecting.html#connect-self-managed-new to learn how to connect to a self-managed deployment.\n",
+    "\n",
+    "Read https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/connecting.html#connect-self-managed-new to learn how to connect using API keys."
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "b5508979",
+   "metadata": {},
+   "source": [
+    "## Indexing a single document\n",
+    "\n",
+    "Let's start by indexing a single document.\n",
+    "To index a document, you need to specify three pieces of information:\n",
+    "- the Elasticsearch `index` to index the document into\n",
+    "- the document's `id` (optional) - If you don't specify an id, Elasticsearch will generate a random one for you\n",
+    "- the document itself (here we store this as a Python dictionary named `doc`)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "id": "6fd10c6d",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Document: created\n"
+     ]
+    }
+   ],
+   "source": [
+    "doc = {\n",
+    "    'author': 'john_smith',\n",
+    "    'text': \"This is a lovely document, but it's a bit short.\",\n",
+    "    'timestamp': datetime.now(),\n",
+    "}\n",
+    "resp = client.index(index=\"test-index\", id=1, document=doc)\n",
+    "print(\"Document: \" + resp[\"result\"])"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "658f0f4e",
+   "metadata": {},
+   "source": [
+    "### Updating a document\n",
+    "\n",
+    "If you index a document with an id that already exists, Elasticsearch will update the existing document."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "id": "ac6995f8",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Document: updated\n"
+     ]
+    }
+   ],
+   "source": [
+    "doc = {\n",
+    "    'author': 'john_smith',\n",
+    "    'text': \"This is a lovely document, and now it's a little bit longer which is great.\",\n",
+    "    'timestamp': datetime.now(),\n",
+    "}\n",
+    "resp = client.index(index=\"test-index\", id=1, document=doc)\n",
+    "print(\"Document: \" + resp[\"result\"])"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "0a498c7d",
+   "metadata": {},
+   "source": [
+    "### Deleting a document\n",
+    "\n",
+    "You can delete a document by specifying its `index` and `id` in the `delete()` method:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "09fed5f3",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Document: deleted\n"
+     ]
+    }
+   ],
+   "source": [
+    "resp= client.delete(index=\"test-index\", id=1)\n",
+    "print(\"Document: \" + resp[\"result\"])"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "d532a317",
+   "metadata": {},
+   "source": [
+    "## Indexing with the bulk API\n",
+    "\n",
+    "You can also index multiple documents at once using the [bulk API](https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html).\n",
+    "We recommend using the bulk API where possible for greater efficiency, as it allows multiple operations in a single request, ensuring better throughput and performance.\n",
+    "\n",
+    "Here's an example of indexing multiple documents using the bulk API.\n",
+    "We have some test data in a `json` file at this [URL](https://raw.githubusercontent.com/leemthompo/notebook-tests/main/12-movies.json).\n",
+    "Let's load that into our Elastic deployment.\n",
+    "First we'll create an index named `movies` to store that data."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "id": "7fdf272e",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'movies'})"
+      ]
+     },
+     "execution_count": 10,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "client.indices.create(\n",
+    "    index=\"movies\",\n",
+    "    mappings= {\n",
+    "    \"properties\": {\n",
+    "      \"genre\": {\n",
+    "        \"type\": \"text\",\n",
+    "        \"fields\": {\n",
+    "          \"keyword\": {\n",
+    "            \"type\": \"keyword\",\n",
+    "            \"ignore_above\": 256\n",
+    "          }\n",
+    "        }\n",
+    "      },\n",
+    "      \"keyScene\": {\n",
+    "        \"type\": \"text\",\n",
+    "        \"fields\": {\n",
+    "          \"keyword\": {\n",
+    "            \"type\": \"keyword\",\n",
+    "            \"ignore_above\": 256\n",
+    "          }\n",
+    "        }\n",
+    "      },\n",
+    "      \"plot\": {\n",
+    "        \"type\": \"text\",\n",
+    "        \"fields\": {\n",
+    "          \"keyword\": {\n",
+    "            \"type\": \"keyword\",\n",
+    "            \"ignore_above\": 256\n",
+    "          }\n",
+    "        }\n",
+    "      },\n",
+    "      \"released\": {\n",
+    "        \"type\": \"integer\"\n",
+    "      },\n",
+    "      \"runtime\": {\n",
+    "        \"type\": \"integer\"\n",
+    "      },\n",
+    "      \"title\": {\n",
+    "        \"type\": \"text\",\n",
+    "        \"fields\": {\n",
+    "          \"keyword\": {\n",
+    "            \"type\": \"keyword\",\n",
+    "            \"ignore_above\": 256\n",
+    "          }\n",
+    "        }\n",
+    "      }\n",
+    "    }\n",
+    "})"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "2cc775d1",
+   "metadata": {},
+   "source": [
+    "Let's upload the JSON data.\n",
+    "The dataset provides information on twelve films.\n",
+    "Each film's entry includes its title, runtime, plot summary, a key scene, genre classification, and release year."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "id": "97b037c9",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Done indexing documents into index!\n"
+     ]
+    }
+   ],
+   "source": [
+    "import json\n",
+    "from urllib.request import urlopen\n",
+    "from urllib.error import URLError\n",
+    "\n",
+    "url = \"https://raw.githubusercontent.com/leemthompo/notebook-tests/main/12-movies.json\"\n",
+    "\n",
+    "try:\n",
+    "    # Send a request to the URL and get the response\n",
+    "    response = urlopen(url)\n",
+    "\n",
+    "    # Load the response data into a JSON object\n",
+    "    data_json = json.loads(response.read())\n",
+    "\n",
+    "    def create_index_body(doc):\n",
+    "        \"\"\" Generate the body for an Elasticsearch document. \"\"\"\n",
+    "        return {\n",
+    "            \"_index\": \"movies\",\n",
+    "            \"_source\": doc,\n",
+    "        }\n",
+    "\n",
+    "    # Prepare the documents to be indexed\n",
+    "    documents = [create_index_body(doc) for doc in data_json]\n",
+    "\n",
+    "    try:\n",
+    "        # Use helpers.bulk to index\n",
+    "        helpers.bulk(client, documents)\n",
+    "        print(\"Done indexing documents into index!\")\n",
+    "    except elasticsearch.ElasticsearchException as es1:\n",
+    "        print(f\"Elasticsearch error: {es_e}\")\n",
+    "    except Exception as e:\n",
+    "        print(f\"Unknown error occurred during indexing: {e}\")\n",
+    "\n",
+    "except URLError as url_e:\n",
+    "    print(f\"Error fetching data from URL: {url_e}\")\n",
+    "except json.JSONDecodeError as json_e:\n",
+    "    print(f\"Error decoding JSON data: {json_e}\")\n",
+    "except Exception as e:\n",
+    "    print(f\"Unknown error occurred: {e}\")\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "id": "5f9823ba",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Successfully deleted index: movies\n"
+     ]
+    }
+   ],
+   "source": [
+    "## Delete index \n",
+    "\n",
+    "index_name = 'movies'\n",
+    "\n",
+    "# Delete the index\n",
+    "try:\n",
+    "    client.indices.delete(index=index_name)\n",
+    "    print(f'Successfully deleted index: {index_name}')\n",
+    "except Exception as e:\n",
+    "    print(f'Error deleting index: {index_name}, error: {str(e)}')"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.7"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

From 772a859a41b8544076e59e7f7b597dd8c2ca69ec Mon Sep 17 00:00:00 2001
From: Liam Thompson <liam.thompson@elastic.co>
Date: Fri, 4 Aug 2023 13:56:02 +0200
Subject: [PATCH 2/2] Add notebook, test action

---
 transcriptions-elasticsearch.ipynb | 586 +++++++++++++++++++++++++++++
 1 file changed, 586 insertions(+)
 create mode 100644 transcriptions-elasticsearch.ipynb

diff --git a/transcriptions-elasticsearch.ipynb b/transcriptions-elasticsearch.ipynb
new file mode 100644
index 0000000..bd2d1f1
--- /dev/null
+++ b/transcriptions-elasticsearch.ipynb
@@ -0,0 +1,586 @@
+{
+ "cells": [
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "87773ce7",
+   "metadata": {},
+   "source": [
+    "# Tutorial: Search audio transcriptions with Elasticsearch\n",
+    "\n",
+    "## What problem are we solving?\n",
+    "\n",
+    "Your organization likely has a lot of unstructured data, such as audio from recorded meetings, which are difficult to search.\n",
+    "Tools like Zoom and Teams have audio transcription features today, but they have two major limitations:\n",
+    "\n",
+    "- They are not very accurate, especially for technical terms and non-native English accents.\n",
+    "- They are not easily searchable outside of the meeting platform.\n",
+    "\n",
+    "This tutorial will show you how to use a state-of-the-art AI model to generate accurate transcriptions from audio files and sync them to an Elasticsearch index.\n",
+    "You'll be able to scale this approach up to keep track of all your organization's audio data, and search it from a single place.\n",
+    "This is a powerful way to make an important part of your organization's knowledge base more accessible.\n",
+    "You'll be able to use this tutorial as a blueprint for building search experiences for other types of unstructured data, such as images, video, and text.\n",
+    "\n",
+    "## What you'll learn\n",
+    "\n",
+    "This tutorial will walk you through the following steps:\n",
+    "\n",
+    "1. How to generate transcriptions from an audio file using the OpenAI [Whisper](https://openai.com/blog/whisper/) model [API](https://platform.openai.com/docs/api-reference/audio) in Python.\n",
+    "2. Sync the transcriptions to an Elasticsearch index, using use the official [Elasticsearch Python client](https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/connecting.html#auth-apikey).\n",
+    "3. Query the index to retrieve transcriptions, using a hybrid search (vector-based semantic search + keyword search) strategy.\n",
+    "4. Use an Elastic Search UI to easily search the transcriptions.\n",
+    "5. 🎁 **Bonus**: We'll show you how to summarize your transcription results using the Hugging Face [BART model](https://huggingface.co/transformers/model_doc/bart.html#bartsummarizationpipeline).\n",
+    "Use this to get a quick overview of the contents of your audio files, and to find the most relevant ones.\n",
+    "We can update the documents that contain transcriptions in the Elasticsearch index with a `summary` field, making these searchable.\n",
+    "\n",
+    "First things first: let's import the libraries we'll need.\n",
+    "\n",
+    "🏃🏽‍♀️Run this notebook:\n",
+    "\n",
+    "- Locally using [jupyter](https://docs.jupyter.org/en/latest/install.html)\n",
+    "- Online using [Google Colab](https://colab.research.google.com/?hl=en)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "d0f7dafa",
+   "metadata": {},
+   "source": [
+    "## 🧰 Requirements\n",
+    "\n",
+    "For this example, you will need:\n",
+    "\n",
+    "- Python 3.6 or later\n",
+    "- An Elastic deployment\n",
+    "   - We'll be using [Elastic Cloud](https://www.elastic.co/guide/en/cloud/current/ec-getting-started.html) for this example (available with a [free trial](https://cloud.elastic.co/registration?elektra=en-ess-sign-up-page))\n",
+    "- The [Elastic Python client](https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/installation.html)\n",
+    "- The [OpenAI Python client](https://github.com/openai/openai-python)\n",
+    "- An OpenAI API key\n",
+    "   - You can get one by [signing up for the OpenAI API](https://beta.openai.com/)\n",
+    "- (_Optional for bonus section_) The [`huggingface_hub` library](https://huggingface.co/docs/huggingface_hub/quick-start)\n"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "9cf3cbb5",
+   "metadata": {},
+   "source": [
+    "## Create Elastic Cloud deployment\n",
+    "\n",
+    "If you don't have an Elastic Cloud deployment, sign up [here](https://cloud.elastic.co/registration?fromURI=%2Fhome) for a free trial.\n",
+    "\n",
+    "- Go to the [Create deployment](https://cloud.elastic.co/deployments/create) page\n",
+    "   - Select **Create deployment**"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "d7076a7a",
+   "metadata": {},
+   "source": [
+    "## Install packages and import modules\n",
+    "\n",
+    "To get started, we'll need to connect to our Elastic deployment using the Python client.\n",
+    "Because we're using an Elastic Cloud deployment, we'll use the **Cloud ID** to identify our deployment.\n",
+    "\n",
+    "First we need to `pip` install the following packages:\n",
+    "\n",
+    "- `elasticsearch`\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "6e237928",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Requirement already satisfied: elasticsearch in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (8.8.0)\n",
+      "Requirement already satisfied: elastic-transport<9,>=8 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from elasticsearch) (8.4.0)\n",
+      "Requirement already satisfied: urllib3<2,>=1.26.2 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from elastic-transport<9,>=8->elasticsearch) (1.26.16)\n",
+      "Requirement already satisfied: certifi in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from elastic-transport<9,>=8->elasticsearch) (2023.5.7)\n",
+      "\u001b[33mWARNING: You are using pip version 21.2.3; however, version 23.1.2 is available.\n",
+      "You should consider upgrading via the '/Users/liamthompson/.pyenv/versions/3.9.7/bin/python3.9 -m pip install --upgrade pip' command.\u001b[0m\n",
+      "Collecting openai\n",
+      "  Downloading openai-0.27.8-py3-none-any.whl (73 kB)\n",
+      "\u001b[K     |████████████████████████████████| 73 kB 6.2 MB/s  eta 0:00:01\n",
+      "\u001b[?25hCollecting aiohttp\n",
+      "  Downloading aiohttp-3.8.4-cp39-cp39-macosx_11_0_arm64.whl (338 kB)\n",
+      "\u001b[K     |████████████████████████████████| 338 kB 28.9 MB/s eta 0:00:01\n",
+      "\u001b[?25hRequirement already satisfied: requests>=2.20 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from openai) (2.31.0)\n",
+      "Requirement already satisfied: tqdm in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from openai) (4.65.0)\n",
+      "Requirement already satisfied: charset-normalizer<4,>=2 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from requests>=2.20->openai) (3.1.0)\n",
+      "Requirement already satisfied: urllib3<3,>=1.21.1 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from requests>=2.20->openai) (1.26.16)\n",
+      "Requirement already satisfied: idna<4,>=2.5 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from requests>=2.20->openai) (3.4)\n",
+      "Requirement already satisfied: certifi>=2017.4.17 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from requests>=2.20->openai) (2023.5.7)\n",
+      "Collecting yarl<2.0,>=1.0\n",
+      "  Downloading yarl-1.9.2-cp39-cp39-macosx_11_0_arm64.whl (62 kB)\n",
+      "\u001b[K     |████████████████████████████████| 62 kB 5.6 MB/s  eta 0:00:01\n",
+      "\u001b[?25hCollecting aiosignal>=1.1.2\n",
+      "  Using cached aiosignal-1.3.1-py3-none-any.whl (7.6 kB)\n",
+      "Collecting frozenlist>=1.1.1\n",
+      "  Downloading frozenlist-1.3.3-cp39-cp39-macosx_11_0_arm64.whl (35 kB)\n",
+      "Collecting attrs>=17.3.0\n",
+      "  Using cached attrs-23.1.0-py3-none-any.whl (61 kB)\n",
+      "Collecting multidict<7.0,>=4.5\n",
+      "  Downloading multidict-6.0.4-cp39-cp39-macosx_11_0_arm64.whl (29 kB)\n",
+      "Collecting async-timeout<5.0,>=4.0.0a3\n",
+      "  Using cached async_timeout-4.0.2-py3-none-any.whl (5.8 kB)\n",
+      "Installing collected packages: multidict, frozenlist, yarl, attrs, async-timeout, aiosignal, aiohttp, openai\n",
+      "Successfully installed aiohttp-3.8.4 aiosignal-1.3.1 async-timeout-4.0.2 attrs-23.1.0 frozenlist-1.3.3 multidict-6.0.4 openai-0.27.8 yarl-1.9.2\n",
+      "\u001b[33mWARNING: You are using pip version 21.2.3; however, version 23.1.2 is available.\n",
+      "You should consider upgrading via the '/Users/liamthompson/.pyenv/versions/3.9.7/bin/python3.9 -m pip install --upgrade pip' command.\u001b[0m\n",
+      "Requirement already satisfied: huggingface-hub in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (0.15.1)\n",
+      "Requirement already satisfied: filelock in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from huggingface-hub) (3.12.2)\n",
+      "Requirement already satisfied: typing-extensions>=3.7.4.3 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from huggingface-hub) (4.6.3)\n",
+      "Requirement already satisfied: fsspec in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from huggingface-hub) (2023.6.0)\n",
+      "Requirement already satisfied: packaging>=20.9 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from huggingface-hub) (23.1)\n",
+      "Requirement already satisfied: tqdm>=4.42.1 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from huggingface-hub) (4.65.0)\n",
+      "Requirement already satisfied: pyyaml>=5.1 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from huggingface-hub) (6.0)\n",
+      "Requirement already satisfied: requests in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from huggingface-hub) (2.31.0)\n",
+      "Requirement already satisfied: certifi>=2017.4.17 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from requests->huggingface-hub) (2023.5.7)\n",
+      "Requirement already satisfied: urllib3<3,>=1.21.1 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from requests->huggingface-hub) (1.26.16)\n",
+      "Requirement already satisfied: idna<4,>=2.5 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from requests->huggingface-hub) (3.4)\n",
+      "Requirement already satisfied: charset-normalizer<4,>=2 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from requests->huggingface-hub) (3.1.0)\n",
+      "\u001b[33mWARNING: You are using pip version 21.2.3; however, version 23.1.2 is available.\n",
+      "You should consider upgrading via the '/Users/liamthompson/.pyenv/versions/3.9.7/bin/python3.9 -m pip install --upgrade pip' command.\u001b[0m\n"
+     ]
+    }
+   ],
+   "source": [
+    "!pip install elasticsearch\n",
+    "!pip install --upgrade openai\n",
+    "!pip install huggingface-hub"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "cccf5bf5",
+   "metadata": {},
+   "source": [
+    "Next we need to import the modules we need."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "8ed40603",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from elasticsearch import Elasticsearch, helpers\n",
+    "import openai\n",
+    "import huggingface_hub # optional for step 5\n",
+    "from urllib.request import urlopen\n",
+    "import getpass\n",
+    "import requests"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "4c25c2a9",
+   "metadata": {},
+   "source": [
+    "## Transcribe audio file(s)\n",
+    "\n",
+    "We need some sample audio files to transcribe.\n",
+    "We're going to use a podcast interview with Brian Kernighan available in MP3 format at this [URL](https://op3.dev/e/https://cdn.changelog.com/uploads/podcast/484/the-changelog-484.mp3). \n",
+    "The interview is about 96 minutes long.\n",
+    "First let's download the file and save it locally.\n",
+    "In your organization you might have audio files stored in a cloud storage bucket, or in a database.\n",
+    "You can adapt the code below to read the audio file from your storage system."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "id": "af1eef70",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Downloading file into /Users/liamthompson/notebook-tests\n"
+     ]
+    }
+   ],
+   "source": [
+    "import os # use this to get the current user's current working directory\n",
+    "\n",
+    "url = \"https://op3.dev/e/https://cdn.changelog.com/uploads/podcast/484/the-changelog-484.mp3\"\n",
+    "\n",
+    "# Download the file using the URL with the requests library\n",
+    "# File will be saved in the current working directory\n",
+    "\n",
+    "pwd = os.getcwd()\n",
+    "\n",
+    "\n",
+    "r = requests.get(url)\n",
+    "with open(\"kernighan.mp3\", \"wb\") as file:\n",
+    "    file.write(r.content)\n",
+    "print(f\"Downloading file into {pwd}\")\n",
+    "\n",
+    "\n"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "91962d3b",
+   "metadata": {},
+   "source": [
+    "# Transcribe audio file\n",
+    "\n",
+    "Now we've got our sample audio file, let's transcribe it using the OpenAI API.\n",
+    "We'll use the [Whisper](https://openai.com/blog/whisper/) model.\n",
+    "The model is available via the OpenAI API.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "id": "1e03999a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "openai.api_key = getpass.getpass(\"Enter your OpenAI API key: \")\n",
+    "\n",
+    "\n",
+    "audio_file= open(\"/Users/liamthompson/notebook-tests/kernighan.mp3\", \"rb\") # change this to the path of your audio file\n",
+    "\n",
+    "transcription = openai.Audio.transcribe(\"whisper-1\", audio_file)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "e1ab9c20",
+   "metadata": {},
+   "source": [
+    "Let's see what our transcription looks like:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 19,
+   "id": "e2185211",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "<class 'openai.openai_object.OpenAIObject'>\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(type(transcription))\n",
+    "\n",
+    "# save the transcription to a file\n",
+    "\n",
+    "with open(\"kernighan-transcription.json\", \"w\") as file:\n",
+    "    file.write(str(transcription))"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "91b466d6",
+   "metadata": {},
+   "source": [
+    "## Connect Elasticsearch client\n",
+    "\n",
+    "Cool we have our transcription!\n",
+    "Let's connect our Elasticsearch Python client to our Elastic deployment, so we can sync the transcription to an index."
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "38679016",
+   "metadata": {},
+   "source": [
+    "## Initialize the Elasticsearch client\n",
+    "\n",
+    "Now we can instantiate the Elasticsearch client.\n",
+    "First we prompt the user for their password and Cloud ID.\n",
+    "\n",
+    "🔐 NOTE: `getpass` enables us to securely prompt for credentials without echoing them to the terminal, or storing in memory.\n",
+    "\n",
+    "Then we create a `client` object that instantiates an instance of the `Elasticsearch` class."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 20,
+   "id": "145a1222",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Found in the 'Manage Deployment' page\n",
+    "CLOUD_ID = getpass.getpass('Enter Elastic Cloud ID:  ')\n",
+    "\n",
+    "# Password for the 'elastic' user generated by Elasticsearch\n",
+    "ELASTIC_PASSWORD = getpass.getpass('Enter Elastic password:  ')\n",
+    "\n",
+    "# Create the client instance\n",
+    "client = Elasticsearch(\n",
+    "    cloud_id=CLOUD_ID,\n",
+    "    basic_auth=(\"elastic\", ELASTIC_PASSWORD)\n",
+    ")"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "555fbc67",
+   "metadata": {},
+   "source": [
+    "Confirm that the client has connected with this test."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 21,
+   "id": "92afc4a9",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "{'name': 'instance-0000000000', 'cluster_name': '9dd1e5c0b0d64796b8cf0746cf63d734', 'cluster_uuid': 'VeYvw6JhQcC3P-Q1-L9P_w', 'version': {'number': '8.9.0-SNAPSHOT', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': 'ac7d79178c3e57c935358453331efe9e9cc5104d', 'build_date': '2023-06-21T09:08:25.219504984Z', 'build_snapshot': True, 'lucene_version': '9.7.0', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0', 'transport_version': '8500019'}, 'tagline': 'You Know, for Search'}\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(client.info())"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "d0a03898",
+   "metadata": {},
+   "source": [
+    "Refer to https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/connecting.html#connect-self-managed-new to learn how to connect to a self-managed deployment.\n",
+    "\n",
+    "Read https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/connecting.html#connect-self-managed-new to learn how to connect using API keys."
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "86945aaf",
+   "metadata": {},
+   "source": [
+    "## Index the transcription into Elasticsearch\n",
+    "\n",
+    "Now we can create an index to store our transcriptions and index our first document."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 24,
+   "id": "c59aa463",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/var/folders/vz/v2f6_x6s0kg51j2vbm5rlhww0000gn/T/ipykernel_71565/2436590366.py:3: DeprecationWarning: The 'body' parameter is deprecated and will be removed in a future version. Instead use the 'document' parameter. See https://github.com/elastic/elasticsearch-py/issues/1698 for more information\n",
+      "  client.index(index=\"transcriptions\", id=1, body=str(transcription))\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "ObjectApiResponse({'_index': 'transcriptions', '_id': '1', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 2, 'failed': 0}, '_seq_no': 0, '_primary_term': 1})"
+      ]
+     },
+     "execution_count": 24,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "# client.indices.create(index=\"transcriptions\", ignore=400)\n",
+    "\n",
+    "client.index(index=\"transcriptions\", id=1, body=str(transcription))"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "944cff74",
+   "metadata": {},
+   "source": [
+    "## Aside: Pretty printing Elasticsearch responses\n",
+    "\n",
+    "Your API calls will return hard-to-read nested JSON.\n",
+    "We'll create a little function called `pretty_response` to return nice, human-readable outputs from our examples."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 40,
+   "id": "21c2a6fc",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def insert_newlines(string, every=64):\n",
+    "    return '\\n'.join(string[i:i+every] for i in range(0, len(string), every))\n",
+    "\n",
+    "def pretty_response(response):\n",
+    "    for hit in response['hits']['hits']:\n",
+    "        id = hit['_id']\n",
+    "        text = hit['_source']['text']\n",
+    "        higlight = hit['highlight']['text']\n",
+    "        pretty_output = (f\"\\nText: {text} \\n\\nHighlight: {higlight}\")\n",
+    "        print(insert_newlines(pretty_output))\n",
+    "\n"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "f160acd0",
+   "metadata": {},
+   "source": [
+    "## Query the index\n",
+    "\n",
+    "Now we can query the index to search our transcription.\n",
+    "Let's start with a simple keyword search.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 43,
+   "id": "5ea91bfb",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "Text: You know, is he a standout in terms of just like once in \n",
+      "a generation kind of a software developer or Are there a lot of \n",
+      "people that you've seen that have been just as good as he was bu\n",
+      "t he happened to have that Nugget, you know, he had to be the ri\n",
+      "ght place the right time with the right idea and the right peopl\n",
+      "e. I think He's a singularity. I have never seen anybody else wh\n",
+      "o's in the same league as him You know, I've certainly met a lot\n",
+      " of programmers who are very good Yeah, and you know some of my \n",
+      "students sure the people I worked with at Bell Labs very good Bu\n",
+      "t I can is in a different universe entirely as far as I can tell\n",
+      " and it's a combination of a bunch of things I mean just being a\n",
+      "ble to write code very quickly that works Very very well done co\n",
+      "de but also this insight into solving the right problem in a The\n",
+      " right way and just doing that repeatedly over all kinds of diff\n",
+      "erent domains I've never seen anybody remotely like that in any \n",
+      "setting at all he You know one night he and Joe Condon and I we \n",
+      "had gotten a new typesetter at Bell Labs It was basically a devi\n",
+      "ce controlled by a very small computer inside a computer automat\n",
+      "ion naked mini if you wish to know I'll be no just a generic kin\n",
+      "d of mediocre 16-bit Computer and it came the typesetter came wi\n",
+      "th really awful software And so you couldn't figure out what was\n",
+      " going on. And of course, you didn't get source code You just go\n",
+      "t more at something that ran and so Ken and Joe and I were puzzl\n",
+      "ing over what to do with this thing And I late afternoon. I said\n",
+      " I'm going home for dinner I'll be back in a while and I came ba\n",
+      "ck at sort of seven or eight o'clock at night and Ken had writte\n",
+      "n a Disassembler for this thing so that he could see what the as\n",
+      "sembly language was so that he could then start to write well, o\n",
+      "f course now you write the assembler and then you And you know t\n",
+      "hat kind of thing where in a couple of hours He had built a fund\n",
+      "amental tool that was then our first toehold and understanding m\n",
+      "achine now, you know Writing a disassembler is not rocket scienc\n",
+      "e But on the other hand to put it together that quickly and accu\n",
+      "rately on the basis of very little information Now this is befor\n",
+      "e the internet when you couldn't just sort of go and Google for \n",
+      "what's the opcode set of this machine? You had to find manuals a\n",
+      "nd it's always kind of thing So now off scale and he could just \n",
+      "kept doing that over such a wide domain of things I mean we thin\n",
+      "k of Unix, but he did all this work on the chess machine where h\n",
+      "e had the first Master level chess computer. That was his softwa\n",
+      "re and he wrote a lot of the CAD tools that made it go as well A\n",
+      "nd you know He built a thing that was like the Sony Walkman with\n",
+      " an mp3 like encoding before anybody else did because he talked \n",
+      "to the people who knew how to do speech coding down the hall is \n",
+      "on and on and on and you've said before that Programming is not \n",
+      "just a science but also an art Which leads me to believe that fo\n",
+      "r some reason Ken was blessed with this art side of the of the s\n",
+      "cience So you can know how to program you can know how to progra\n",
+      "m well with less bugs But to be able to apply the thinking to a \n",
+      "problem set in the ways you described Ken What do you think you \n",
+      "know without describing his you know for lack of better terms ge\n",
+      "nius What do you think helped him have that mindset? Like what h\n",
+      "ow did he begin to solve a problem? Do you think? You know, I ac\n",
+      "tually don't know I suspect part of it is that he had just been \n",
+      "interested in all kinds of things And you know I didn't meet him\n",
+      " until he and I arrived he arrived at labs a couple years before\n",
+      " I did and Then we were in the same group for many years, but hi\n",
+      "s background I think originally was electrical engineering He wa\n",
+      "s much more of a hardware person. In fact than a software person\n",
+      " originally And perhaps that gave him a different perspective on\n",
+      " how things work or at least a broader Perspective. I don't know\n",
+      " about let's say his mathematical background But for example, yo\n",
+      "u mentioned this art and science he built a regular expression r\n",
+      "ecognizer, which is \n",
+      "\n",
+      "Highlight: ['You know, is he a standout in\n",
+      " terms of just like once in a <em>generation</em> kind of a soft\n",
+      "ware developer or']\n"
+     ]
+    }
+   ],
+   "source": [
+    "response = client.search(index=\"transcriptions\",\n",
+    "                         query= {\n",
+    "                             \"match\": {\n",
+    "                             \"text\": \"generation\"\n",
+    "                             }\n",
+    "                             },\n",
+    "                        highlight={\n",
+    "                            \"fields\": {\n",
+    "                                \"text\": {}\n",
+    "                                }\n",
+    "                                })\n",
+    "pretty_response(response)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.7"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}