Add media description feature using Azure Content Understanding (Azur…

…e-Samples#2195) * First pass * CU kinda working * CU integration * Better splitting * Add Bicep * Rm unneeded figures * Remove en-us from URLs * Fix URLs * Remote figures output JSON * Update matrix comments * Make mypy happy * Add same errors to file strategy * Add pymupdf to skip modules for mypy * Output the endpoint from Bicep * 100 percent coverage for mediadescriber.py * Tests added for PDFParser * Fix that tuple type * Add pricing link * Fix content read issue
pamelafox · Dec 9, 2024 · 0bb3f95 · 0bb3f95
1 parent e90920f
commit 0bb3f95
Show file tree

Hide file tree

Showing 36 changed files with 962 additions and 65 deletions.
diff --git a/.azdo/pipelines/azure-dev.yml b/.azdo/pipelines/azure-dev.yml
@@ -120,6 +120,7 @@ steps:
       DEPLOYMENT_TARGET: $(DEPLOYMENT_TARGET)
       AZURE_CONTAINER_APPS_WORKLOAD_PROFILE: $(AZURE_CONTAINER_APPS_WORKLOAD_PROFILE)
       USE_CHAT_HISTORY_BROWSER: $(USE_CHAT_HISTORY_BROWSER)
+      USE_MEDIA_DESCRIBER_AZURE_CU: $(USE_MEDIA_DESCRIBER_AZURE_CU)
   - task: AzureCLI@2
     displayName: Deploy Application
     inputs:

diff --git a/.github/workflows/azure-dev.yml b/.github/workflows/azure-dev.yml
@@ -13,7 +13,7 @@ on:
 # To configure required secrets for connecting to Azure, simply run `azd pipeline config`
 
 # Set up permissions for deploying with secretless Azure federated credentials
-# https://learn.microsoft.com/en-us/azure/developer/github/connect-from-azure?tabs=azure-portal%2Clinux#set-up-azure-login-with-openid-connect-authentication
+# https://learn.microsoft.com/azure/developer/github/connect-from-azure?tabs=azure-portal%2Clinux#set-up-azure-login-with-openid-connect-authentication
 permissions:
   id-token: write
   contents: read
@@ -103,6 +103,7 @@ jobs:
       DEPLOYMENT_TARGET: ${{ vars.DEPLOYMENT_TARGET }}
       AZURE_CONTAINER_APPS_WORKLOAD_PROFILE: ${{ vars.AZURE_CONTAINER_APPS_WORKLOAD_PROFILE }}
       USE_CHAT_HISTORY_BROWSER: ${{ vars.USE_CHAT_HISTORY_BROWSER }}
+      USE_MEDIA_DESCRIBER_AZURE_CU: ${{ vars.USE_MEDIA_DESCRIBER_AZURE_CU }}
     steps:
       - name: Checkout
         uses: actions/checkout@v4

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -122,6 +122,8 @@ If you followed the steps above to install the pre-commit hooks, then you can ju
 
 When adding new azd environment variables, please remember to update:
 
+1. [main.parameters.json](./infra/main.parameters.json)
+1. [appEnvVariables in main.bicep](./infra/main.bicep)
 1. App Service's [azure.yaml](./azure.yaml)
 1. [ADO pipeline](.azdo/pipelines/azure-dev.yml).
 1. [Github workflows](.github/workflows/azure-dev.yml)

diff --git a/README.md b/README.md
@@ -91,7 +91,9 @@ However, you can try the [Azure pricing calculator](https://azure.com/e/e3490de2
 - Azure AI Document Intelligence: SO (Standard) tier using pre-built layout. Pricing per document page, sample documents have 261 pages total. [Pricing](https://azure.microsoft.com/pricing/details/form-recognizer/)
 - Azure AI Search: Basic tier, 1 replica, free level of semantic search. Pricing per hour. [Pricing](https://azure.microsoft.com/pricing/details/search/)
 - Azure Blob Storage: Standard tier with ZRS (Zone-redundant storage). Pricing per storage and read operations. [Pricing](https://azure.microsoft.com/pricing/details/storage/blobs/)
-- Azure Cosmos DB: Serverless tier. Pricing per request unit and storage. [Pricing](https://azure.microsoft.com/pricing/details/cosmos-db/)
+- Azure Cosmos DB: Only provisioned if you enabled [chat history with Cosmos DB](docs/deploy_features.md#enabling-persistent-chat-history-with-azure-cosmos-db). Serverless tier. Pricing per request unit and storage. [Pricing](https://azure.microsoft.com/pricing/details/cosmos-db/)
+- Azure AI Vision: Only provisioned if you enabled [GPT-4 with vision](docs/gpt4v.md). Pricing per 1K transactions. [Pricing](https://azure.microsoft.com/pricing/details/cognitive-services/computer-vision/)
+- Azure AI Content Understanding: Only provisioned if you enabled [media description](docs/deploy_features.md#enabling-media-description-with-azure-content-understanding). Pricing per 1K images. [Pricing](https://azure.microsoft.com/pricing/details/content-understanding/)
 - Azure Monitor: Pay-as-you-go tier. Costs based on data ingested. [Pricing](https://azure.microsoft.com/pricing/details/monitor/)
 
 To reduce costs, you can switch to free SKUs for various services, but those SKUs have limitations.

diff --git a/app/backend/gunicorn.conf.py b/app/backend/gunicorn.conf.py
@@ -7,7 +7,7 @@
 bind = "0.0.0.0"
 
 timeout = 230
-# https://learn.microsoft.com/en-us/troubleshoot/azure/app-service/web-apps-performance-faqs#why-does-my-request-time-out-after-230-seconds
+# https://learn.microsoft.com/troubleshoot/azure/app-service/web-apps-performance-faqs#why-does-my-request-time-out-after-230-seconds
 
 num_cpus = multiprocessing.cpu_count()
 if os.getenv("WEBSITE_SKU") == "LinuxFree":

diff --git a/app/backend/prepdocs.py b/app/backend/prepdocs.py
@@ -7,6 +7,7 @@
 from azure.core.credentials import AzureKeyCredential
 from azure.core.credentials_async import AsyncTokenCredential
 from azure.identity.aio import AzureDeveloperCliCredential, get_bearer_token_provider
+from rich.logging import RichHandler
 
 from load_azd_env import load_azd_env
 from prepdocslib.blobmanager import BlobManager
@@ -158,8 +159,10 @@ def setup_file_processors(
     local_pdf_parser: bool = False,
     local_html_parser: bool = False,
     search_images: bool = False,
+    use_content_understanding: bool = False,
+    content_understanding_endpoint: Union[str, None] = None,
 ):
-    sentence_text_splitter = SentenceTextSplitter(has_image_embeddings=search_images)
+    sentence_text_splitter = SentenceTextSplitter()
 
     doc_int_parser: Optional[DocumentAnalysisParser] = None
     # check if Azure Document Intelligence credentials are provided
@@ -170,6 +173,8 @@ def setup_file_processors(
         doc_int_parser = DocumentAnalysisParser(
             endpoint=f"https://{document_intelligence_service}.cognitiveservices.azure.com/",
             credential=documentintelligence_creds,
+            use_content_understanding=use_content_understanding,
+            content_understanding_endpoint=content_understanding_endpoint,
         )
 
     pdf_parser: Optional[Parser] = None
@@ -294,10 +299,10 @@ async def main(strategy: Strategy, setup_index: bool = True):
     args = parser.parse_args()
 
     if args.verbose:
-        logging.basicConfig(format="%(message)s")
+        logging.basicConfig(format="%(message)s", datefmt="[%X]", handlers=[RichHandler(rich_tracebacks=True)])
         # We only set the level to INFO for our logger,
         # to avoid seeing the noisy INFO level logs from the Azure SDKs
-        logger.setLevel(logging.INFO)
+        logger.setLevel(logging.DEBUG)
 
     load_azd_env()
 
@@ -309,6 +314,7 @@ async def main(strategy: Strategy, setup_index: bool = True):
     use_gptvision = os.getenv("USE_GPT4V", "").lower() == "true"
     use_acls = os.getenv("AZURE_ADLS_GEN2_STORAGE_ACCOUNT") is not None
     dont_use_vectors = os.getenv("USE_VECTORS", "").lower() == "false"
+    use_content_understanding = os.getenv("USE_MEDIA_DESCRIBER_AZURE_CU", "").lower() == "true"
 
     # Use the current user identity to connect to Azure services. See infra/main.bicep for role assignments.
     if tenant_id := os.getenv("AZURE_TENANT_ID"):
@@ -406,6 +412,8 @@ async def main(strategy: Strategy, setup_index: bool = True):
             local_pdf_parser=os.getenv("USE_LOCAL_PDF_PARSER") == "true",
             local_html_parser=os.getenv("USE_LOCAL_HTML_PARSER") == "true",
             search_images=use_gptvision,
+            use_content_understanding=use_content_understanding,
+            content_understanding_endpoint=os.getenv("AZURE_CONTENTUNDERSTANDING_ENDPOINT"),
         )
         image_embeddings_service = setup_image_embeddings_service(
             azure_credential=azd_credential,
@@ -424,6 +432,8 @@ async def main(strategy: Strategy, setup_index: bool = True):
             search_analyzer_name=os.getenv("AZURE_SEARCH_ANALYZER_NAME"),
             use_acls=use_acls,
             category=args.category,
+            use_content_understanding=use_content_understanding,
+            content_understanding_endpoint=os.getenv("AZURE_CONTENTUNDERSTANDING_ENDPOINT"),
         )
 
     loop.run_until_complete(main(ingestion_strategy, setup_index=not args.remove and not args.removeall))

diff --git a/app/backend/prepdocslib/blobmanager.py b/app/backend/prepdocslib/blobmanager.py
@@ -171,7 +171,7 @@ def sourcepage_from_file_page(cls, filename, page=0) -> str:
 
     @classmethod
     def blob_image_name_from_file_page(cls, filename, page=0) -> str:
-        return os.path.splitext(os.path.basename(filename))[0] + f"-{page}" + ".png"
+        return os.path.splitext(os.path.basename(filename))[0] + f"-{page+1}" + ".png"
 
     @classmethod
     def blob_name_from_file_name(cls, filename) -> str:

diff --git a/app/backend/prepdocslib/filestrategy.py b/app/backend/prepdocslib/filestrategy.py
@@ -1,10 +1,13 @@
 import logging
 from typing import List, Optional
 
+from azure.core.credentials import AzureKeyCredential
+
 from .blobmanager import BlobManager
 from .embeddings import ImageEmbeddings, OpenAIEmbeddings
 from .fileprocessor import FileProcessor
 from .listfilestrategy import File, ListFileStrategy
+from .mediadescriber import ContentUnderstandingDescriber
 from .searchmanager import SearchManager, Section
 from .strategy import DocumentAction, SearchInfo, Strategy
 
@@ -50,6 +53,8 @@ def __init__(
         search_analyzer_name: Optional[str] = None,
         use_acls: bool = False,
         category: Optional[str] = None,
+        use_content_understanding: bool = False,
+        content_understanding_endpoint: Optional[str] = None,
     ):
         self.list_file_strategy = list_file_strategy
         self.blob_manager = blob_manager
@@ -61,6 +66,8 @@ def __init__(
         self.search_info = search_info
         self.use_acls = use_acls
         self.category = category
+        self.use_content_understanding = use_content_understanding
+        self.content_understanding_endpoint = content_understanding_endpoint
 
     async def setup(self):
         search_manager = SearchManager(
@@ -73,6 +80,16 @@ async def setup(self):
         )
         await search_manager.create_index()
 
+        if self.use_content_understanding:
+            if self.content_understanding_endpoint is None:
+                raise ValueError("Content Understanding is enabled but no endpoint was provided")
+            if isinstance(self.search_info.credential, AzureKeyCredential):
+                raise ValueError(
+                    "AzureKeyCredential is not supported for Content Understanding, use keyless auth instead"
+                )
+            cu_manager = ContentUnderstandingDescriber(self.content_understanding_endpoint, self.search_info.credential)
+            await cu_manager.create_analyzer()
+
     async def run(self):
         search_manager = SearchManager(
             self.search_info, self.search_analyzer_name, self.use_acls, False, self.embeddings

diff --git a/app/backend/prepdocslib/mediadescriber.py b/app/backend/prepdocslib/mediadescriber.py
@@ -0,0 +1,107 @@
+import logging
+from abc import ABC
+
+import aiohttp
+from azure.core.credentials_async import AsyncTokenCredential
+from azure.identity.aio import get_bearer_token_provider
+from rich.progress import Progress
+from tenacity import retry, retry_if_exception_type, stop_after_attempt, wait_fixed
+
+logger = logging.getLogger("scripts")
+
+
+class MediaDescriber(ABC):
+
+    async def describe_image(self, image_bytes) -> str:
+        raise NotImplementedError  # pragma: no cover
+
+
+class ContentUnderstandingDescriber:
+    CU_API_VERSION = "2024-12-01-preview"
+
+    analyzer_schema = {
+        "analyzerId": "image_analyzer",
+        "name": "Image understanding",
+        "description": "Extract detailed structured information from images extracted from documents.",
+        "baseAnalyzerId": "prebuilt-image",
+        "scenario": "image",
+        "config": {"returnDetails": False},
+        "fieldSchema": {
+            "name": "ImageInformation",
+            "descriptions": "Description of image.",
+            "fields": {
+                "Description": {
+                    "type": "string",
+                    "description": "Description of the image. If the image has a title, start with the title. Include a 2-sentence summary. If the image is a chart, diagram, or table, include the underlying data in an HTML table tag, with accurate numbers. If the image is a chart, describe any axis or legends. The only allowed HTML tags are the table/thead/tr/td/tbody tags.",
+                },
+            },
+        },
+    }
+
+    def __init__(self, endpoint: str, credential: AsyncTokenCredential):
+        self.endpoint = endpoint
+        self.credential = credential
+
+    async def poll_api(self, session, poll_url, headers):
+
+        @retry(stop=stop_after_attempt(60), wait=wait_fixed(2), retry=retry_if_exception_type(ValueError))
+        async def poll():
+            async with session.get(poll_url, headers=headers) as response:
+                response.raise_for_status()
+                response_json = await response.json()
+                if response_json["status"] == "Failed":
+                    raise Exception("Failed")
+                if response_json["status"] == "Running":
+                    raise ValueError("Running")
+                return response_json
+
+        return await poll()
+
+    async def create_analyzer(self):
+        logger.info("Creating analyzer '%s'...", self.analyzer_schema["analyzerId"])
+
+        token_provider = get_bearer_token_provider(self.credential, "https://cognitiveservices.azure.com/.default")
+        token = await token_provider()
+        headers = {"Authorization": f"Bearer {token}", "Content-Type": "application/json"}
+        params = {"api-version": self.CU_API_VERSION}
+        analyzer_id = self.analyzer_schema["analyzerId"]
+        cu_endpoint = f"{self.endpoint}/contentunderstanding/analyzers/{analyzer_id}"
+        async with aiohttp.ClientSession() as session:
+            async with session.put(
+                url=cu_endpoint, params=params, headers=headers, json=self.analyzer_schema
+            ) as response:
+                if response.status == 409:
+                    logger.info("Analyzer '%s' already exists.", analyzer_id)
+                    return
+                elif response.status != 201:
+                    data = await response.text()
+                    raise Exception("Error creating analyzer", data)
+                else:
+                    poll_url = response.headers.get("Operation-Location")
+
+            with Progress() as progress:
+                progress.add_task("Creating analyzer...", total=None, start=False)
+                await self.poll_api(session, poll_url, headers)
+
+    async def describe_image(self, image_bytes: bytes) -> str:
+        logger.info("Sending image to Azure Content Understanding service...")
+        async with aiohttp.ClientSession() as session:
+            token = await self.credential.get_token("https://cognitiveservices.azure.com/.default")
+            headers = {"Authorization": "Bearer " + token.token}
+            params = {"api-version": self.CU_API_VERSION}
+            analyzer_name = self.analyzer_schema["analyzerId"]
+            async with session.post(
+                url=f"{self.endpoint}/contentunderstanding/analyzers/{analyzer_name}:analyze",
+                params=params,
+                headers=headers,
+                data=image_bytes,
+            ) as response:
+                response.raise_for_status()
+                poll_url = response.headers["Operation-Location"]
+
+                with Progress() as progress:
+                    progress.add_task("Processing...", total=None, start=False)
+                    results = await self.poll_api(session, poll_url, headers)
+
+                fields = results["result"]["contents"][0]["fields"]
+                return fields["Description"]["valueString"]
diff --git a/app/backend/prepdocslib/page.py b/app/backend/prepdocslib/page.py
@@ -3,7 +3,7 @@ class Page:
     A single page from a document
 
     Attributes:
-        page_num (int): Page number
+        page_num (int): Page number (0-indexed)
         offset (int): If the text of the entire Document was concatenated into a single string, the index of the first character on the page. For example, if page 1 had the text "hello" and page 2 had the text "world", the offset of page 2 is 5 ("hellow")
         text (str): The text of the page
     """
@@ -17,6 +17,10 @@ def __init__(self, page_num: int, offset: int, text: str):
 class SplitPage:
     """
     A section of a page that has been split into a smaller chunk.
+
+    Attributes:
+        page_num (int): Page number (0-indexed)
+        text (str): The text of the section
     """
 
     def __init__(self, page_num: int, text: str):