From 4e0985c2f4735a9b5745b8dbe214708a3a747056 Mon Sep 17 00:00:00 2001 From: Liam Thompson Date: Fri, 4 Aug 2023 13:50:09 +0200 Subject: [PATCH 1/2] Add notebook to test action --- elasticsearch-basics-indexing.ipynb | 491 ++++++++++++++++++++++++++++ 1 file changed, 491 insertions(+) create mode 100644 elasticsearch-basics-indexing.ipynb diff --git a/elasticsearch-basics-indexing.ipynb b/elasticsearch-basics-indexing.ipynb new file mode 100644 index 0000000..b81c7a8 --- /dev/null +++ b/elasticsearch-basics-indexing.ipynb @@ -0,0 +1,491 @@ +{ + "cells": [ + { + "attachments": {}, + "cell_type": "markdown", + "id": "87773ce7", + "metadata": {}, + "source": [ + "# Elasticsearch basics: indexing\n", + "\n", + "This notebook contains a basic introduction to indexing documents into Elasticsearch, using the Python client.\n", + "This is an interactive notebook, so you can run the code and experiment with it!\n", + "\n", + "Run this notebook:\n", + "\n", + "- Locally using [jupyter](https://docs.jupyter.org/en/latest/install.html)\n", + "- Online using [Google Colab](https://colab.research.google.com/?hl=en)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "d0f7dafa", + "metadata": {}, + "source": [ + "## 🧰 Requirements\n", + "\n", + "For this example, you will need:\n", + "\n", + "- Python 3.6 or later\n", + "- An Elastic deployment\n", + " - We'll be using [Elastic Cloud](https://www.elastic.co/guide/en/cloud/current/ec-getting-started.html) for this example (available with a [free trial](https://cloud.elastic.co/registration?elektra=en-ess-sign-up-page))\n", + "- The [Elastic Python client](https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/installation.html)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "9cf3cbb5", + "metadata": {}, + "source": [ + "## Create Elastic Cloud deployment\n", + "\n", + "If you don't have an Elastic Cloud deployment, sign up [here](https://cloud.elastic.co/registration?fromURI=%2Fhome) for a free trial.\n", + "\n", + "- Go to the [Create deployment](https://cloud.elastic.co/deployments/create) page\n", + " - Select **Create deployment**" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "d7076a7a", + "metadata": {}, + "source": [ + "## Install packages and import modules\n", + "\n", + "To get started, we'll need to connect to our Elastic deployment using the Python client.\n", + "Because we're using an Elastic Cloud deployment, we'll use the **Cloud ID** to identify our deployment.\n", + "\n", + "First we need to `pip` install the following packages:\n", + "\n", + "- `elasticsearch`\n" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "6e237928", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Requirement already satisfied: elasticsearch in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (8.8.0)\n", + "Requirement already satisfied: elastic-transport<9,>=8 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from elasticsearch) (8.4.0)\n", + "Requirement already satisfied: certifi in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from elastic-transport<9,>=8->elasticsearch) (2023.5.7)\n", + "Requirement already satisfied: urllib3<2,>=1.26.2 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from elastic-transport<9,>=8->elasticsearch) (1.26.16)\n", + "\u001b[33mWARNING: You are using pip version 21.2.3; however, version 23.1.2 is available.\n", + "You should consider upgrading via the '/Users/liamthompson/.pyenv/versions/3.9.7/bin/python3.9 -m pip install --upgrade pip' command.\u001b[0m\n" + ] + } + ], + "source": [ + "!pip install elasticsearch" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "cccf5bf5", + "metadata": {}, + "source": [ + "Next we need to import the modules we need." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "8ed40603", + "metadata": {}, + "outputs": [], + "source": [ + "from elasticsearch import Elasticsearch, helpers\n", + "from urllib.request import urlopen\n", + "import getpass\n", + "# import requests\n", + "import json\n", + "from datetime import datetime" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "38679016", + "metadata": {}, + "source": [ + "## Initialize the Elasticsearch client\n", + "\n", + "Now we can instantiate the Elasticsearch client.\n", + "First we prompt the user for their password and Cloud ID.\n", + "\n", + "🔐 NOTE: `getpass` enables us to securely prompt for credentials without echoing them to the terminal, or storing in memory.\n", + "\n", + "Then we create a `client` object that instantiates an instance of the `Elasticsearch` class." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "145a1222", + "metadata": {}, + "outputs": [], + "source": [ + "# Found in the 'Manage Deployment' page\n", + "CLOUD_ID = getpass.getpass('Enter Elastic Cloud ID: ')\n", + "\n", + "# Password for the 'elastic' user generated by Elasticsearch\n", + "ELASTIC_PASSWORD = getpass.getpass('Enter Elastic password: ')\n", + "\n", + "# Create the client instance\n", + "client = Elasticsearch(\n", + " cloud_id=CLOUD_ID,\n", + " basic_auth=(\"elastic\", ELASTIC_PASSWORD)\n", + ")" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "555fbc67", + "metadata": {}, + "source": [ + "Confirm that the client has connected with this test." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "92afc4a9", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'name': 'instance-0000000001', 'cluster_name': '9dd1e5c0b0d64796b8cf0746cf63d734', 'cluster_uuid': 'VeYvw6JhQcC3P-Q1-L9P_w', 'version': {'number': '8.9.0-SNAPSHOT', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': 'ac7d79178c3e57c935358453331efe9e9cc5104d', 'build_date': '2023-06-21T09:08:25.219504984Z', 'build_snapshot': True, 'lucene_version': '9.7.0', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0', 'transport_version': '8500019'}, 'tagline': 'You Know, for Search'}\n" + ] + } + ], + "source": [ + "print(client.info())" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "d0a03898", + "metadata": {}, + "source": [ + "Refer to https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/connecting.html#connect-self-managed-new to learn how to connect to a self-managed deployment.\n", + "\n", + "Read https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/connecting.html#connect-self-managed-new to learn how to connect using API keys." + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "b5508979", + "metadata": {}, + "source": [ + "## Indexing a single document\n", + "\n", + "Let's start by indexing a single document.\n", + "To index a document, you need to specify three pieces of information:\n", + "- the Elasticsearch `index` to index the document into\n", + "- the document's `id` (optional) - If you don't specify an id, Elasticsearch will generate a random one for you\n", + "- the document itself (here we store this as a Python dictionary named `doc`)" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "6fd10c6d", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Document: created\n" + ] + } + ], + "source": [ + "doc = {\n", + " 'author': 'john_smith',\n", + " 'text': \"This is a lovely document, but it's a bit short.\",\n", + " 'timestamp': datetime.now(),\n", + "}\n", + "resp = client.index(index=\"test-index\", id=1, document=doc)\n", + "print(\"Document: \" + resp[\"result\"])" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "658f0f4e", + "metadata": {}, + "source": [ + "### Updating a document\n", + "\n", + "If you index a document with an id that already exists, Elasticsearch will update the existing document." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "ac6995f8", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Document: updated\n" + ] + } + ], + "source": [ + "doc = {\n", + " 'author': 'john_smith',\n", + " 'text': \"This is a lovely document, and now it's a little bit longer which is great.\",\n", + " 'timestamp': datetime.now(),\n", + "}\n", + "resp = client.index(index=\"test-index\", id=1, document=doc)\n", + "print(\"Document: \" + resp[\"result\"])" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "0a498c7d", + "metadata": {}, + "source": [ + "### Deleting a document\n", + "\n", + "You can delete a document by specifying its `index` and `id` in the `delete()` method:" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "09fed5f3", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Document: deleted\n" + ] + } + ], + "source": [ + "resp= client.delete(index=\"test-index\", id=1)\n", + "print(\"Document: \" + resp[\"result\"])" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "d532a317", + "metadata": {}, + "source": [ + "## Indexing with the bulk API\n", + "\n", + "You can also index multiple documents at once using the [bulk API](https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html).\n", + "We recommend using the bulk API where possible for greater efficiency, as it allows multiple operations in a single request, ensuring better throughput and performance.\n", + "\n", + "Here's an example of indexing multiple documents using the bulk API.\n", + "We have some test data in a `json` file at this [URL](https://raw.githubusercontent.com/leemthompo/notebook-tests/main/12-movies.json).\n", + "Let's load that into our Elastic deployment.\n", + "First we'll create an index named `movies` to store that data." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "7fdf272e", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'movies'})" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "client.indices.create(\n", + " index=\"movies\",\n", + " mappings= {\n", + " \"properties\": {\n", + " \"genre\": {\n", + " \"type\": \"text\",\n", + " \"fields\": {\n", + " \"keyword\": {\n", + " \"type\": \"keyword\",\n", + " \"ignore_above\": 256\n", + " }\n", + " }\n", + " },\n", + " \"keyScene\": {\n", + " \"type\": \"text\",\n", + " \"fields\": {\n", + " \"keyword\": {\n", + " \"type\": \"keyword\",\n", + " \"ignore_above\": 256\n", + " }\n", + " }\n", + " },\n", + " \"plot\": {\n", + " \"type\": \"text\",\n", + " \"fields\": {\n", + " \"keyword\": {\n", + " \"type\": \"keyword\",\n", + " \"ignore_above\": 256\n", + " }\n", + " }\n", + " },\n", + " \"released\": {\n", + " \"type\": \"integer\"\n", + " },\n", + " \"runtime\": {\n", + " \"type\": \"integer\"\n", + " },\n", + " \"title\": {\n", + " \"type\": \"text\",\n", + " \"fields\": {\n", + " \"keyword\": {\n", + " \"type\": \"keyword\",\n", + " \"ignore_above\": 256\n", + " }\n", + " }\n", + " }\n", + " }\n", + "})" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "2cc775d1", + "metadata": {}, + "source": [ + "Let's upload the JSON data.\n", + "The dataset provides information on twelve films.\n", + "Each film's entry includes its title, runtime, plot summary, a key scene, genre classification, and release year." + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "97b037c9", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Done indexing documents into index!\n" + ] + } + ], + "source": [ + "import json\n", + "from urllib.request import urlopen\n", + "from urllib.error import URLError\n", + "\n", + "url = \"https://raw.githubusercontent.com/leemthompo/notebook-tests/main/12-movies.json\"\n", + "\n", + "try:\n", + " # Send a request to the URL and get the response\n", + " response = urlopen(url)\n", + "\n", + " # Load the response data into a JSON object\n", + " data_json = json.loads(response.read())\n", + "\n", + " def create_index_body(doc):\n", + " \"\"\" Generate the body for an Elasticsearch document. \"\"\"\n", + " return {\n", + " \"_index\": \"movies\",\n", + " \"_source\": doc,\n", + " }\n", + "\n", + " # Prepare the documents to be indexed\n", + " documents = [create_index_body(doc) for doc in data_json]\n", + "\n", + " try:\n", + " # Use helpers.bulk to index\n", + " helpers.bulk(client, documents)\n", + " print(\"Done indexing documents into index!\")\n", + " except elasticsearch.ElasticsearchException as es1:\n", + " print(f\"Elasticsearch error: {es_e}\")\n", + " except Exception as e:\n", + " print(f\"Unknown error occurred during indexing: {e}\")\n", + "\n", + "except URLError as url_e:\n", + " print(f\"Error fetching data from URL: {url_e}\")\n", + "except json.JSONDecodeError as json_e:\n", + " print(f\"Error decoding JSON data: {json_e}\")\n", + "except Exception as e:\n", + " print(f\"Unknown error occurred: {e}\")\n" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "5f9823ba", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Successfully deleted index: movies\n" + ] + } + ], + "source": [ + "## Delete index \n", + "\n", + "index_name = 'movies'\n", + "\n", + "# Delete the index\n", + "try:\n", + " client.indices.delete(index=index_name)\n", + " print(f'Successfully deleted index: {index_name}')\n", + "except Exception as e:\n", + " print(f'Error deleting index: {index_name}, error: {str(e)}')" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.7" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} From 772a859a41b8544076e59e7f7b597dd8c2ca69ec Mon Sep 17 00:00:00 2001 From: Liam Thompson Date: Fri, 4 Aug 2023 13:56:02 +0200 Subject: [PATCH 2/2] Add notebook, test action --- transcriptions-elasticsearch.ipynb | 586 +++++++++++++++++++++++++++++ 1 file changed, 586 insertions(+) create mode 100644 transcriptions-elasticsearch.ipynb diff --git a/transcriptions-elasticsearch.ipynb b/transcriptions-elasticsearch.ipynb new file mode 100644 index 0000000..bd2d1f1 --- /dev/null +++ b/transcriptions-elasticsearch.ipynb @@ -0,0 +1,586 @@ +{ + "cells": [ + { + "attachments": {}, + "cell_type": "markdown", + "id": "87773ce7", + "metadata": {}, + "source": [ + "# Tutorial: Search audio transcriptions with Elasticsearch\n", + "\n", + "## What problem are we solving?\n", + "\n", + "Your organization likely has a lot of unstructured data, such as audio from recorded meetings, which are difficult to search.\n", + "Tools like Zoom and Teams have audio transcription features today, but they have two major limitations:\n", + "\n", + "- They are not very accurate, especially for technical terms and non-native English accents.\n", + "- They are not easily searchable outside of the meeting platform.\n", + "\n", + "This tutorial will show you how to use a state-of-the-art AI model to generate accurate transcriptions from audio files and sync them to an Elasticsearch index.\n", + "You'll be able to scale this approach up to keep track of all your organization's audio data, and search it from a single place.\n", + "This is a powerful way to make an important part of your organization's knowledge base more accessible.\n", + "You'll be able to use this tutorial as a blueprint for building search experiences for other types of unstructured data, such as images, video, and text.\n", + "\n", + "## What you'll learn\n", + "\n", + "This tutorial will walk you through the following steps:\n", + "\n", + "1. How to generate transcriptions from an audio file using the OpenAI [Whisper](https://openai.com/blog/whisper/) model [API](https://platform.openai.com/docs/api-reference/audio) in Python.\n", + "2. Sync the transcriptions to an Elasticsearch index, using use the official [Elasticsearch Python client](https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/connecting.html#auth-apikey).\n", + "3. Query the index to retrieve transcriptions, using a hybrid search (vector-based semantic search + keyword search) strategy.\n", + "4. Use an Elastic Search UI to easily search the transcriptions.\n", + "5. 🎁 **Bonus**: We'll show you how to summarize your transcription results using the Hugging Face [BART model](https://huggingface.co/transformers/model_doc/bart.html#bartsummarizationpipeline).\n", + "Use this to get a quick overview of the contents of your audio files, and to find the most relevant ones.\n", + "We can update the documents that contain transcriptions in the Elasticsearch index with a `summary` field, making these searchable.\n", + "\n", + "First things first: let's import the libraries we'll need.\n", + "\n", + "🏃🏽‍♀️Run this notebook:\n", + "\n", + "- Locally using [jupyter](https://docs.jupyter.org/en/latest/install.html)\n", + "- Online using [Google Colab](https://colab.research.google.com/?hl=en)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "d0f7dafa", + "metadata": {}, + "source": [ + "## 🧰 Requirements\n", + "\n", + "For this example, you will need:\n", + "\n", + "- Python 3.6 or later\n", + "- An Elastic deployment\n", + " - We'll be using [Elastic Cloud](https://www.elastic.co/guide/en/cloud/current/ec-getting-started.html) for this example (available with a [free trial](https://cloud.elastic.co/registration?elektra=en-ess-sign-up-page))\n", + "- The [Elastic Python client](https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/installation.html)\n", + "- The [OpenAI Python client](https://github.com/openai/openai-python)\n", + "- An OpenAI API key\n", + " - You can get one by [signing up for the OpenAI API](https://beta.openai.com/)\n", + "- (_Optional for bonus section_) The [`huggingface_hub` library](https://huggingface.co/docs/huggingface_hub/quick-start)\n" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "9cf3cbb5", + "metadata": {}, + "source": [ + "## Create Elastic Cloud deployment\n", + "\n", + "If you don't have an Elastic Cloud deployment, sign up [here](https://cloud.elastic.co/registration?fromURI=%2Fhome) for a free trial.\n", + "\n", + "- Go to the [Create deployment](https://cloud.elastic.co/deployments/create) page\n", + " - Select **Create deployment**" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "d7076a7a", + "metadata": {}, + "source": [ + "## Install packages and import modules\n", + "\n", + "To get started, we'll need to connect to our Elastic deployment using the Python client.\n", + "Because we're using an Elastic Cloud deployment, we'll use the **Cloud ID** to identify our deployment.\n", + "\n", + "First we need to `pip` install the following packages:\n", + "\n", + "- `elasticsearch`\n" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "6e237928", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Requirement already satisfied: elasticsearch in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (8.8.0)\n", + "Requirement already satisfied: elastic-transport<9,>=8 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from elasticsearch) (8.4.0)\n", + "Requirement already satisfied: urllib3<2,>=1.26.2 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from elastic-transport<9,>=8->elasticsearch) (1.26.16)\n", + "Requirement already satisfied: certifi in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from elastic-transport<9,>=8->elasticsearch) (2023.5.7)\n", + "\u001b[33mWARNING: You are using pip version 21.2.3; however, version 23.1.2 is available.\n", + "You should consider upgrading via the '/Users/liamthompson/.pyenv/versions/3.9.7/bin/python3.9 -m pip install --upgrade pip' command.\u001b[0m\n", + "Collecting openai\n", + " Downloading openai-0.27.8-py3-none-any.whl (73 kB)\n", + "\u001b[K |████████████████████████████████| 73 kB 6.2 MB/s eta 0:00:01\n", + "\u001b[?25hCollecting aiohttp\n", + " Downloading aiohttp-3.8.4-cp39-cp39-macosx_11_0_arm64.whl (338 kB)\n", + "\u001b[K |████████████████████████████████| 338 kB 28.9 MB/s eta 0:00:01\n", + "\u001b[?25hRequirement already satisfied: requests>=2.20 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from openai) (2.31.0)\n", + "Requirement already satisfied: tqdm in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from openai) (4.65.0)\n", + "Requirement already satisfied: charset-normalizer<4,>=2 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from requests>=2.20->openai) (3.1.0)\n", + "Requirement already satisfied: urllib3<3,>=1.21.1 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from requests>=2.20->openai) (1.26.16)\n", + "Requirement already satisfied: idna<4,>=2.5 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from requests>=2.20->openai) (3.4)\n", + "Requirement already satisfied: certifi>=2017.4.17 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from requests>=2.20->openai) (2023.5.7)\n", + "Collecting yarl<2.0,>=1.0\n", + " Downloading yarl-1.9.2-cp39-cp39-macosx_11_0_arm64.whl (62 kB)\n", + "\u001b[K |████████████████████████████████| 62 kB 5.6 MB/s eta 0:00:01\n", + "\u001b[?25hCollecting aiosignal>=1.1.2\n", + " Using cached aiosignal-1.3.1-py3-none-any.whl (7.6 kB)\n", + "Collecting frozenlist>=1.1.1\n", + " Downloading frozenlist-1.3.3-cp39-cp39-macosx_11_0_arm64.whl (35 kB)\n", + "Collecting attrs>=17.3.0\n", + " Using cached attrs-23.1.0-py3-none-any.whl (61 kB)\n", + "Collecting multidict<7.0,>=4.5\n", + " Downloading multidict-6.0.4-cp39-cp39-macosx_11_0_arm64.whl (29 kB)\n", + "Collecting async-timeout<5.0,>=4.0.0a3\n", + " Using cached async_timeout-4.0.2-py3-none-any.whl (5.8 kB)\n", + "Installing collected packages: multidict, frozenlist, yarl, attrs, async-timeout, aiosignal, aiohttp, openai\n", + "Successfully installed aiohttp-3.8.4 aiosignal-1.3.1 async-timeout-4.0.2 attrs-23.1.0 frozenlist-1.3.3 multidict-6.0.4 openai-0.27.8 yarl-1.9.2\n", + "\u001b[33mWARNING: You are using pip version 21.2.3; however, version 23.1.2 is available.\n", + "You should consider upgrading via the '/Users/liamthompson/.pyenv/versions/3.9.7/bin/python3.9 -m pip install --upgrade pip' command.\u001b[0m\n", + "Requirement already satisfied: huggingface-hub in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (0.15.1)\n", + "Requirement already satisfied: filelock in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from huggingface-hub) (3.12.2)\n", + "Requirement already satisfied: typing-extensions>=3.7.4.3 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from huggingface-hub) (4.6.3)\n", + "Requirement already satisfied: fsspec in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from huggingface-hub) (2023.6.0)\n", + "Requirement already satisfied: packaging>=20.9 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from huggingface-hub) (23.1)\n", + "Requirement already satisfied: tqdm>=4.42.1 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from huggingface-hub) (4.65.0)\n", + "Requirement already satisfied: pyyaml>=5.1 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from huggingface-hub) (6.0)\n", + "Requirement already satisfied: requests in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from huggingface-hub) (2.31.0)\n", + "Requirement already satisfied: certifi>=2017.4.17 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from requests->huggingface-hub) (2023.5.7)\n", + "Requirement already satisfied: urllib3<3,>=1.21.1 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from requests->huggingface-hub) (1.26.16)\n", + "Requirement already satisfied: idna<4,>=2.5 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from requests->huggingface-hub) (3.4)\n", + "Requirement already satisfied: charset-normalizer<4,>=2 in /Users/liamthompson/.pyenv/versions/3.9.7/lib/python3.9/site-packages (from requests->huggingface-hub) (3.1.0)\n", + "\u001b[33mWARNING: You are using pip version 21.2.3; however, version 23.1.2 is available.\n", + "You should consider upgrading via the '/Users/liamthompson/.pyenv/versions/3.9.7/bin/python3.9 -m pip install --upgrade pip' command.\u001b[0m\n" + ] + } + ], + "source": [ + "!pip install elasticsearch\n", + "!pip install --upgrade openai\n", + "!pip install huggingface-hub" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "cccf5bf5", + "metadata": {}, + "source": [ + "Next we need to import the modules we need." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "8ed40603", + "metadata": {}, + "outputs": [], + "source": [ + "from elasticsearch import Elasticsearch, helpers\n", + "import openai\n", + "import huggingface_hub # optional for step 5\n", + "from urllib.request import urlopen\n", + "import getpass\n", + "import requests" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "4c25c2a9", + "metadata": {}, + "source": [ + "## Transcribe audio file(s)\n", + "\n", + "We need some sample audio files to transcribe.\n", + "We're going to use a podcast interview with Brian Kernighan available in MP3 format at this [URL](https://op3.dev/e/https://cdn.changelog.com/uploads/podcast/484/the-changelog-484.mp3). \n", + "The interview is about 96 minutes long.\n", + "First let's download the file and save it locally.\n", + "In your organization you might have audio files stored in a cloud storage bucket, or in a database.\n", + "You can adapt the code below to read the audio file from your storage system." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "af1eef70", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Downloading file into /Users/liamthompson/notebook-tests\n" + ] + } + ], + "source": [ + "import os # use this to get the current user's current working directory\n", + "\n", + "url = \"https://op3.dev/e/https://cdn.changelog.com/uploads/podcast/484/the-changelog-484.mp3\"\n", + "\n", + "# Download the file using the URL with the requests library\n", + "# File will be saved in the current working directory\n", + "\n", + "pwd = os.getcwd()\n", + "\n", + "\n", + "r = requests.get(url)\n", + "with open(\"kernighan.mp3\", \"wb\") as file:\n", + " file.write(r.content)\n", + "print(f\"Downloading file into {pwd}\")\n", + "\n", + "\n" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "91962d3b", + "metadata": {}, + "source": [ + "# Transcribe audio file\n", + "\n", + "Now we've got our sample audio file, let's transcribe it using the OpenAI API.\n", + "We'll use the [Whisper](https://openai.com/blog/whisper/) model.\n", + "The model is available via the OpenAI API.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "1e03999a", + "metadata": {}, + "outputs": [], + "source": [ + "openai.api_key = getpass.getpass(\"Enter your OpenAI API key: \")\n", + "\n", + "\n", + "audio_file= open(\"/Users/liamthompson/notebook-tests/kernighan.mp3\", \"rb\") # change this to the path of your audio file\n", + "\n", + "transcription = openai.Audio.transcribe(\"whisper-1\", audio_file)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "e1ab9c20", + "metadata": {}, + "source": [ + "Let's see what our transcription looks like:" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "id": "e2185211", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n" + ] + } + ], + "source": [ + "print(type(transcription))\n", + "\n", + "# save the transcription to a file\n", + "\n", + "with open(\"kernighan-transcription.json\", \"w\") as file:\n", + " file.write(str(transcription))" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "91b466d6", + "metadata": {}, + "source": [ + "## Connect Elasticsearch client\n", + "\n", + "Cool we have our transcription!\n", + "Let's connect our Elasticsearch Python client to our Elastic deployment, so we can sync the transcription to an index." + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "38679016", + "metadata": {}, + "source": [ + "## Initialize the Elasticsearch client\n", + "\n", + "Now we can instantiate the Elasticsearch client.\n", + "First we prompt the user for their password and Cloud ID.\n", + "\n", + "🔐 NOTE: `getpass` enables us to securely prompt for credentials without echoing them to the terminal, or storing in memory.\n", + "\n", + "Then we create a `client` object that instantiates an instance of the `Elasticsearch` class." + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "id": "145a1222", + "metadata": {}, + "outputs": [], + "source": [ + "# Found in the 'Manage Deployment' page\n", + "CLOUD_ID = getpass.getpass('Enter Elastic Cloud ID: ')\n", + "\n", + "# Password for the 'elastic' user generated by Elasticsearch\n", + "ELASTIC_PASSWORD = getpass.getpass('Enter Elastic password: ')\n", + "\n", + "# Create the client instance\n", + "client = Elasticsearch(\n", + " cloud_id=CLOUD_ID,\n", + " basic_auth=(\"elastic\", ELASTIC_PASSWORD)\n", + ")" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "555fbc67", + "metadata": {}, + "source": [ + "Confirm that the client has connected with this test." + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "id": "92afc4a9", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'name': 'instance-0000000000', 'cluster_name': '9dd1e5c0b0d64796b8cf0746cf63d734', 'cluster_uuid': 'VeYvw6JhQcC3P-Q1-L9P_w', 'version': {'number': '8.9.0-SNAPSHOT', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': 'ac7d79178c3e57c935358453331efe9e9cc5104d', 'build_date': '2023-06-21T09:08:25.219504984Z', 'build_snapshot': True, 'lucene_version': '9.7.0', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0', 'transport_version': '8500019'}, 'tagline': 'You Know, for Search'}\n" + ] + } + ], + "source": [ + "print(client.info())" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "d0a03898", + "metadata": {}, + "source": [ + "Refer to https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/connecting.html#connect-self-managed-new to learn how to connect to a self-managed deployment.\n", + "\n", + "Read https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/connecting.html#connect-self-managed-new to learn how to connect using API keys." + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "86945aaf", + "metadata": {}, + "source": [ + "## Index the transcription into Elasticsearch\n", + "\n", + "Now we can create an index to store our transcriptions and index our first document." + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "id": "c59aa463", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/var/folders/vz/v2f6_x6s0kg51j2vbm5rlhww0000gn/T/ipykernel_71565/2436590366.py:3: DeprecationWarning: The 'body' parameter is deprecated and will be removed in a future version. Instead use the 'document' parameter. See https://github.com/elastic/elasticsearch-py/issues/1698 for more information\n", + " client.index(index=\"transcriptions\", id=1, body=str(transcription))\n" + ] + }, + { + "data": { + "text/plain": [ + "ObjectApiResponse({'_index': 'transcriptions', '_id': '1', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 2, 'failed': 0}, '_seq_no': 0, '_primary_term': 1})" + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# client.indices.create(index=\"transcriptions\", ignore=400)\n", + "\n", + "client.index(index=\"transcriptions\", id=1, body=str(transcription))" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "944cff74", + "metadata": {}, + "source": [ + "## Aside: Pretty printing Elasticsearch responses\n", + "\n", + "Your API calls will return hard-to-read nested JSON.\n", + "We'll create a little function called `pretty_response` to return nice, human-readable outputs from our examples." + ] + }, + { + "cell_type": "code", + "execution_count": 40, + "id": "21c2a6fc", + "metadata": {}, + "outputs": [], + "source": [ + "def insert_newlines(string, every=64):\n", + " return '\\n'.join(string[i:i+every] for i in range(0, len(string), every))\n", + "\n", + "def pretty_response(response):\n", + " for hit in response['hits']['hits']:\n", + " id = hit['_id']\n", + " text = hit['_source']['text']\n", + " higlight = hit['highlight']['text']\n", + " pretty_output = (f\"\\nText: {text} \\n\\nHighlight: {higlight}\")\n", + " print(insert_newlines(pretty_output))\n", + "\n" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "f160acd0", + "metadata": {}, + "source": [ + "## Query the index\n", + "\n", + "Now we can query the index to search our transcription.\n", + "Let's start with a simple keyword search.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 43, + "id": "5ea91bfb", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Text: You know, is he a standout in terms of just like once in \n", + "a generation kind of a software developer or Are there a lot of \n", + "people that you've seen that have been just as good as he was bu\n", + "t he happened to have that Nugget, you know, he had to be the ri\n", + "ght place the right time with the right idea and the right peopl\n", + "e. I think He's a singularity. I have never seen anybody else wh\n", + "o's in the same league as him You know, I've certainly met a lot\n", + " of programmers who are very good Yeah, and you know some of my \n", + "students sure the people I worked with at Bell Labs very good Bu\n", + "t I can is in a different universe entirely as far as I can tell\n", + " and it's a combination of a bunch of things I mean just being a\n", + "ble to write code very quickly that works Very very well done co\n", + "de but also this insight into solving the right problem in a The\n", + " right way and just doing that repeatedly over all kinds of diff\n", + "erent domains I've never seen anybody remotely like that in any \n", + "setting at all he You know one night he and Joe Condon and I we \n", + "had gotten a new typesetter at Bell Labs It was basically a devi\n", + "ce controlled by a very small computer inside a computer automat\n", + "ion naked mini if you wish to know I'll be no just a generic kin\n", + "d of mediocre 16-bit Computer and it came the typesetter came wi\n", + "th really awful software And so you couldn't figure out what was\n", + " going on. And of course, you didn't get source code You just go\n", + "t more at something that ran and so Ken and Joe and I were puzzl\n", + "ing over what to do with this thing And I late afternoon. I said\n", + " I'm going home for dinner I'll be back in a while and I came ba\n", + "ck at sort of seven or eight o'clock at night and Ken had writte\n", + "n a Disassembler for this thing so that he could see what the as\n", + "sembly language was so that he could then start to write well, o\n", + "f course now you write the assembler and then you And you know t\n", + "hat kind of thing where in a couple of hours He had built a fund\n", + "amental tool that was then our first toehold and understanding m\n", + "achine now, you know Writing a disassembler is not rocket scienc\n", + "e But on the other hand to put it together that quickly and accu\n", + "rately on the basis of very little information Now this is befor\n", + "e the internet when you couldn't just sort of go and Google for \n", + "what's the opcode set of this machine? You had to find manuals a\n", + "nd it's always kind of thing So now off scale and he could just \n", + "kept doing that over such a wide domain of things I mean we thin\n", + "k of Unix, but he did all this work on the chess machine where h\n", + "e had the first Master level chess computer. That was his softwa\n", + "re and he wrote a lot of the CAD tools that made it go as well A\n", + "nd you know He built a thing that was like the Sony Walkman with\n", + " an mp3 like encoding before anybody else did because he talked \n", + "to the people who knew how to do speech coding down the hall is \n", + "on and on and on and you've said before that Programming is not \n", + "just a science but also an art Which leads me to believe that fo\n", + "r some reason Ken was blessed with this art side of the of the s\n", + "cience So you can know how to program you can know how to progra\n", + "m well with less bugs But to be able to apply the thinking to a \n", + "problem set in the ways you described Ken What do you think you \n", + "know without describing his you know for lack of better terms ge\n", + "nius What do you think helped him have that mindset? Like what h\n", + "ow did he begin to solve a problem? Do you think? You know, I ac\n", + "tually don't know I suspect part of it is that he had just been \n", + "interested in all kinds of things And you know I didn't meet him\n", + " until he and I arrived he arrived at labs a couple years before\n", + " I did and Then we were in the same group for many years, but hi\n", + "s background I think originally was electrical engineering He wa\n", + "s much more of a hardware person. In fact than a software person\n", + " originally And perhaps that gave him a different perspective on\n", + " how things work or at least a broader Perspective. I don't know\n", + " about let's say his mathematical background But for example, yo\n", + "u mentioned this art and science he built a regular expression r\n", + "ecognizer, which is \n", + "\n", + "Highlight: ['You know, is he a standout in\n", + " terms of just like once in a generation kind of a soft\n", + "ware developer or']\n" + ] + } + ], + "source": [ + "response = client.search(index=\"transcriptions\",\n", + " query= {\n", + " \"match\": {\n", + " \"text\": \"generation\"\n", + " }\n", + " },\n", + " highlight={\n", + " \"fields\": {\n", + " \"text\": {}\n", + " }\n", + " })\n", + "pretty_response(response)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.7" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}