Merge branch 'main' into dependabot/pip/pydantic-2.10.6

mia-platform · Feb 7, 2025 · 2e26493 · 2e26493
2 parents 2afb238 + 41cf3dc
commit 2e26493
Show file tree

Hide file tree

Showing 24 changed files with 279 additions and 206 deletions.
diff --git a/.mia-template/README.md b/.mia-template/README.md
@@ -3,7 +3,7 @@
 [![Python
 version](https://img.shields.io/badge/python-v3.12.3-blue)](.coverage/html/index.html)
 [![FastAPI
-version](https://img.shields.io/badge/fastapi-v0.112.1-blue)](.coverage/html/index.html)
+version](https://img.shields.io/badge/fastapi-v0.115.6-blue)](.coverage/html/index.html)
 
 ---
 

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -7,6 +7,24 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ## Unreleased
 
+### Fixed
+
+- Version `0.5.2` included an error with the `mdx` files for embedding generation from `generateFromFile` API. This has been fixed.
+- Fixed several typos related to the `aggregateMaxTokenNumber` configurable parameter.s
+
+### Changed
+
+- Updated documentation related to the Aggregate Max Token Number and custom prompts (both system and user prompts)
+
+## 0.5.2 - 2025-01-29
+
+
+### Fixed
+
+- At service startup, if the Vector Search collection does not exist, it is automatically created
+- Support file extension `mdx` for embedding generation
+- File uploaded for embedding generation is validated either from the content-type or the file extension
+
 ## 0.5.1 - 2024-12-20
 
 ## 0.5.0 - 2024-12-19

diff --git a/Dockerfile b/Dockerfile
@@ -17,7 +17,7 @@ LABEL maintainer="%CUSTOM_PLUGIN_CREATOR_USERNAME%" \
       name="ai-rag-template" \
       description="%CUSTOM_PLUGIN_SERVICE_DESCRIPTION%" \
       eu.mia-platform.url="https://www.mia-platform.eu" \
-      eu.mia-platform.version="0.5.1"
+      eu.mia-platform.version="0.5.2"
 
 USER python
 

diff --git a/README.md b/README.md
@@ -3,7 +3,7 @@
 [![Python
 version](https://img.shields.io/badge/python-v3.12.3-blue)](.coverage/html/index.html)
 [![FastAPI
-version](https://img.shields.io/badge/fastapi-v0.112.1-blue)](.coverage/html/index.html)
+version](https://img.shields.io/badge/fastapi-v0.115.6-blue)](.coverage/html/index.html)
 
 ---
 
@@ -130,7 +130,7 @@ The `/embeddings/generateFromFile` endpoint is a HTTP POST method that takes as
 The file must be of format:
 
 - a text file (`.txt`)
-- a markdown file (`.md`)
+- a markdown file (`.md` or `.mdx`)
 - a PDF file (`.pdf`)
 - a zip file (formats available: `.zip`, `.tar`, `.gz`) containing files of the same formats as above (folders and other files will be skipped).
 
@@ -310,8 +310,9 @@ Description of configuration parameters:
 | Vector Store Text Key | Name of the field used to save the raw document (or chunk of document). |
 | Vector Store Max. Documents To Retrieve | Maximum number of documents to retrieve from the Vector Store. |
 | Vector Store Min. Score Distance | Minimum distance beyond which retrieved documents from the Vector Store are discarded. |
-| Chain RAG System Prompts File Path | ath to the file containing system prompts for the RAG model. If omitted, the application will use a standard system prompt. |
-| Chain RAG User Prompts File Path | Path to the file containing user prompts for the RAG model. If omitted, the application will use a standard system prompt. |
+| Chain Aggregate Max Token Number | Maximum number of tokens extracted from the retrieved documents from the Vector Store to be included in the prompt (1 token is approximately 4 characters). Default is `2000`. |
+| Chain RAG System Prompts File Path | Path to the file containing system prompts for the RAG model. If omitted, the application will use a standard system prompt. More details in the [dedicated paragraph](#configure-your-own-system-and-user-prompts). |
+| Chain RAG User Prompts File Path | Path to the file containing user prompts for the RAG model. If omitted, the application will use a standard system prompt. More details in the [dedicated paragraph](#configure-your-own-system-and-user-prompts). |
 
 ### Supported LLM providers
 
@@ -379,6 +380,32 @@ Currently, the supported Embeddings providers are:
   | `url` | URL of the Azure OpenAI service to call. |
   | `apiVersion` | API version of the Azure OpenAI service. |
 
+### Configure your own system and user prompts
+
+The application sends to the LLM a prompt that is composed of a _system prompt_ and a _user prompt_:
+
+- the _system prompt_ is a message that provides instructions to the LLM on how to respond to the user's input.
+- the _user prompt_ is a message that contains the user's input.
+
+A default version of these prompts are included in the application, but you can also use your own prompts to instruct the LLM to behave in a more specific way, such as behaving as a generic assistant in any field or as an expert in a specific field related to the embedding documents you are using.
+
+Both the system and user prompts are optional, but if you want to use your own prompts, you need to create a text file with the content of the prompt and specify the path to the file in the configuration at `chain.rag.systemPromptsFilePath` and `chain.rag.userPromptsFilePath` respectively.
+
+Moreover, the _system prompt_ must include the following placeholders:
+
+- `{chat_history}`: placeholder that will be replaced by the chat history, which is a list of messages exchanged between the user and the chatbot until then (received via the `chat_history` property from the body of the [`/chat/completions` endpoint](#chat-endpoint-chatcompletions))
+- `{output_text}`: placeholder that will be replaced by the text extracted from the embedding documents
+
+> **Note**
+>
+> The application already includes some context text to explain what the chat history is and what the output text is, so you don't need to add your explanation to the system prompt.
+
+Also, the _user prompt_ must include the following placeholder:
+
+- `{query}`: placeholder that will be replaced by the user's input (received via the `chat_query` property from the body of the [`/chat/completions` endpoint](#chat-endpoint-chatcompletions))
+
+Generally speaking, it is suggested to have a _system prompt_ tailored to the needs of your application, to specify what type of information the chatbot should provide and the tone and style of the responses. The _user prompt_ can be omitted unless you need to specify particular instructions or constraints specific to each question.
+
 ## Local Development
 
 - Before getting started, make sure you have the following information:

diff --git a/docs/10_Overview_And_Usage.md b/docs/10_Overview_And_Usage.md
@@ -103,8 +103,9 @@ Description of configuration parameters:
 | Vector Store Text Key | Name of the field used to save the raw document (or chunk of document). |
 | Vector Store Max. Documents To Retrieve | Maximum number of documents to retrieve from the Vector Store. |
 | Vector Store Min. Score Distance | Minimum distance beyond which retrieved documents from the Vector Store are discarded. |
-| Chain RAG System Prompts File Path | ath to the file containing system prompts for the RAG model. If omitted, the application will use a standard system prompt. |
-| Chain RAG User Prompts File Path | Path to the file containing user prompts for the RAG model. If omitted, the application will use a standard system prompt. |
+| Chain Aggregate Max Token Number | Maximum number of tokens extracted from the retrieved documents from the Vector Store to be included in the prompt (1 token is approximately 4 characters). Default is `2000`. |
+| Chain RAG System Prompts File Path | Path to the file containing system prompts for the RAG model. If omitted, the application will use a standard system prompt. More details in the [dedicated paragraph](#configure-your-own-system-and-user-prompts). |
+| Chain RAG User Prompts File Path | Path to the file containing user prompts for the RAG model. If omitted, the application will use a standard system prompt. More details in the [dedicated paragraph](#configure-your-own-system-and-user-prompts). |
 
 ### Supported LLM providers
 
@@ -172,6 +173,32 @@ Currently, the supported Embeddings providers are:
   | `url` | URL of the Azure OpenAI service to call. |
   | `apiVersion` | API version of the Azure OpenAI service. |
 
+### Configure your own system and user prompts
+
+The application sends to the LLM a prompt that is composed of a _system prompt_ and a _user prompt_:
+
+- the _system prompt_ is a message that provides instructions to the LLM on how to respond to the user's input.
+- the _user prompt_ is a message that contains the user's input.
+
+A default version of these prompts are included in the application, but you can also use your own prompts to instruct the LLM to behave in a more specific way, such as behaving as a generic assistant in any field or as an expert in a specific field related to the embedding documents you are using.
+
+Both the system and user prompts are optional, but if you want to use your own prompts, you need to create a text file with the content of the prompt and specify the path to the file in the configuration at `chain.rag.systemPromptsFilePath` and `chain.rag.userPromptsFilePath` respectively.
+
+Moreover, the _system prompt_ must include the following placeholders:
+
+- `{chat_history}`: placeholder that will be replaced by the chat history, which is a list of messages exchanged between the user and the chatbot until then (received via the `chat_history` property from the body of the [`/chat/completions` endpoint](#chat-endpoint-chatcompletions))
+- `{output_text}`: placeholder that will be replaced by the text extracted from the embedding documents
+
+> **Note**
+>
+> The application already includes some context text to explain what the chat history is and what the output text is, so you don't need to add your explanation to the system prompt.
+
+Also, the _user prompt_ must include the following placeholder:
+
+- `{query}`: placeholder that will be replaced by the user's input (received via the `chat_query` property from the body of the [`/chat/completions` endpoint](#chat-endpoint-chatcompletions))
+
+Generally speaking, it is suggested to have a _system prompt_ tailored to the needs of your application, to specify what type of information the chatbot should provide and the tone and style of the responses. The _user prompt_ can be omitted unless you need to specify particular instructions or constraints specific to each question.
+
 ### Create a Vector Index
 
 :::info

diff --git a/docs/20_APIs.md b/docs/20_APIs.md
@@ -114,7 +114,7 @@ The `/embeddings/generateFromFile` endpoint is a HTTP POST method that takes as
 The file must be of format:
 
 - a text file (`.txt`)
-- a markdown file (`.md`)
+- a markdown file (`.md`, `.mdx`)
 - a PDF file (`.pdf`)
 - a zip file (formats available: `.zip`, `.tar`, `.gz`) containing files of the same formats as above (folders and other files will be skipped).
 

diff --git a/requirements.txt b/requirements.txt
@@ -21,7 +21,7 @@ coverage==7.5.0
 cycler==0.12.1
 cyclonedx-python-lib==7.6.2
 dataclasses-json==0.6.4
-datamodel-code-generator==0.26.3
+datamodel-code-generator==0.26.5
 defusedxml==0.7.1
 dill==0.3.8
 distlib==0.3.8
@@ -55,11 +55,11 @@ jsonpointer==3.0.0
 jsonschema==4.21.1
 jsonschema-specifications==2023.12.1
 kiwisolver==1.4.5
-langchain==0.3.12
+langchain==0.3.17
 langchain-community==0.3.12
-langchain-core==0.3.25
+langchain-core==0.3.33
 langchain-experimental==0.3.3
-langchain-openai==0.2.12
+langchain-openai==0.3.3
 langchain-text-splitters==0.3.3
 langsmith==0.1.147
 license-expression==30.4.0
@@ -124,7 +124,7 @@ soupsieve==2.6
 SQLAlchemy==2.0.29
 starlette==0.40.0
 stevedore==5.4.0
-tavily-python==0.3.3
+tavily-python==0.5.0
 tenacity==8.2.3
 testcontainers==4.4.0
 tiktoken==0.8.0
@@ -135,8 +135,8 @@ tqdm==4.66.3
 typing-inspect==0.9.0
 typing_extensions==4.12.2
 urllib3==2.2.2
-uvicorn==0.29.0
-virtualenv==20.26.6
+virtualenv==20.29.1
+uvicorn==0.34.0
 webencodings==0.5.1
 wrapt==1.16.0
 yarl==1.18.3
diff --git a/src/api/controllers/embeddings/embeddings_handler.py b/src/api/controllers/embeddings/embeddings_handler.py
@@ -4,9 +4,10 @@
 from zipfile import BadZipFile
 from fastapi import APIRouter, BackgroundTasks, File, HTTPException, Request, UploadFile, status
 
+from src.application.embeddings.file_parser.errors import InvalidFileError
 from src.api.schemas.status_ok_schema import StatusOkResponseSchema
 from src.application.embeddings.embedding_generator import EmbeddingGenerator
-from src.application.embeddings.file_parser import FileParser
+from src.application.embeddings.file_parser.file_parser import FileParser
 from src.api.schemas.embeddings_schemas import GenerateEmbeddingsInputSchema, GenerateStatusOutputSchema
 from src.constants import SUPPORTED_CONTENT_TYPES_TUPLE
 from src.context import AppContext
@@ -145,9 +146,6 @@ def generate_embeddings_from_file(request: Request, background_tasks: Background
         file (UploadFile): The file received.
         background_tasks (BackgroundTasks): The background tasks object.
     """
-
-    if file.content_type not in SUPPORTED_CONTENT_TYPES_TUPLE:
-        raise HTTPException(status_code=400, detail=f"Application does not support this file type (content type: {file.content_type}).")
 
     request_context: AppContext = request.state.app_context
     request_context.logger.info(f"Generate embeddings request received for file {file.filename} (content type: {file.content_type})")
@@ -157,6 +155,8 @@ def generate_embeddings_from_file(request: Request, background_tasks: Background
         docs = list(file_parser.extract_documents_from_file(file))
     except (BadZipFile, BadGzipFile, TarError) as ex:
         raise HTTPException(status_code=400, detail="The file uploaded is not a valid archive file.") from ex
+    except InvalidFileError as ex:
+        raise HTTPException(status_code=400, detail=str(ex)) from ex
     except Exception as ex:
         raise HTTPException(status_code=500, detail=f"Error parsing file: {str(ex)}") from ex
 

diff --git a/src/app.py b/src/app.py
@@ -22,7 +22,7 @@ def create_app(context: AppContext) -> FastAPI:
         openapi_url="/documentation/json",
         redoc_url=None,
         title="ai-rag-template",
-        version="0.5.1"
+        version="0.5.2"
     )
 
     app.add_middleware(AppContextMiddleware, app_context=context)

diff --git a/src/application/assistance/chains/combine_docs_chain.py b/src/application/assistance/chains/combine_docs_chain.py
@@ -10,7 +10,7 @@
 class AggregateDocsChunksChain(BaseCombineDocumentsChain):
 
     context: AppContext
-    aggreate_max_token_number: int = 2000
+    aggregate_max_token_number: int = 2000
     """The maximum token length of the combined documents, if exceeded a warning will be logged."""
     tokenizer_model_name: str = "gpt-3.5-turbo"
     """The language model to use for tokenization."""
@@ -26,19 +26,20 @@ def combine_docs(self, docs: List[Document], **kwargs: Any) -> Tuple[str | dict]
             docs)
         if limit_exceeded:
             self.context.logger.warning(
-                f"Combined text length exceeded {self.aggreate_max_token_number} tokens"
+                f"Combined text length exceeded {self.aggregate_max_token_number} tokens"
             )
         self.context.logger.debug(
             f"Combined text length: {token_count} tokens")
         return combined_text, {}
 
     def _aggregate_docs_until_token_limit(self, docs):
+        print(self.aggregate_max_token_number)
         combined_text = ''
         token_count = 0
         limit_exceeded = False
         for doc in docs:
             new_tokens = self.tokenizer.encode(doc.page_content)
-            if token_count + len(new_tokens) > self.aggreate_max_token_number:
+            if token_count + len(new_tokens) > self.aggregate_max_token_number:
                 limit_exceeded = True
                 break
             combined_text += f"\n\n{doc.page_content}"

diff --git a/src/application/assistance/service.py b/src/application/assistance/service.py
@@ -92,7 +92,7 @@ def _init_documentation_aggregator(self):
         return AggregateDocsChunksChain(
             context=self.app_context,
             tokenizer_model_name=tokenizer_config.name,
-            aggreate_max_token_number=chain_config.aggregateMaxTokenNumber
+            aggregate_max_token_number=chain_config.aggregateMaxTokenNumber
         )
 
     def _build_prompt(self) -> AssistantPromptTemplate:

diff --git a/src/application/embeddings/errors.py → ...lication/embeddings/file_parser/errors.py b/src/application/embeddings/errors.py → ...lication/embeddings/file_parser/errors.py
@@ -1,8 +1,8 @@
 
-from src.constants import SUPPORTED_EXT_TUPLE
+from src.constants import SUPPORTED_EXT_TUPLE, SUPPORTED_CONTENT_TYPES_TUPLE
 
 
-class InvalidFileExtensionError(Exception):
+class InvalidFileError(Exception):
     """
         Exception raised when a file does not have the expected extension.
 
@@ -13,5 +13,5 @@ class InvalidFileExtensionError(Exception):
         - archive files (.zip files, *.tar files or *.gz files) that includes only the above extensions.
     """
     def __init__(self, filename):
-        self.message = f"The file {filename} cannot be processed. Supported extensions are: {", ".join(SUPPORTED_EXT_TUPLE)}."
+        self.message = f"The file {filename} cannot be processed. File must include one of the specific ContentType: {", ".join(SUPPORTED_CONTENT_TYPES_TUPLE)}. Otherwise can have the following extensions: {", ".join(SUPPORTED_EXT_TUPLE)}."
         super().__init__(self.message)
diff --git a/src/application/embeddings/file_parser.py → ...ion/embeddings/file_parser/file_parser.py b/src/application/embeddings/file_parser.py → ...ion/embeddings/file_parser/file_parser.py
@@ -9,14 +9,15 @@
 from pymupdf import Document
 from fastapi import File, UploadFile
 
+from src.application.embeddings.file_parser.errors import InvalidFileError
+from src.application.embeddings.file_parser.get_file_type import FileType, get_file_type
 from src.constants import (
     MD_EXTENSION,
+    MDX_EXTENSION,
     PDF_EXTENSION,
-    SUPPORTED_CONTENT_TYPES_TUPLE,
     SUPPORTED_EXT_IN_COMPRESSED_FILE_TUPLE,
     TEXT_EXTENSION,
 )
-from src.application.embeddings.errors import InvalidFileExtensionError
 
 
 class FileParser:
@@ -62,13 +63,14 @@ def _convert_pdf_to_str(self, file: UploadFile) -> Generator[str, None, None]:
 
     def _convert_file_to_str(self, file: IO[bytes], file_name: str) -> Generator[str, None, None]:
         file_content = file.read()
+        file_extension = file_name.split('.')[-1]
 
-        if file_name.endswith(PDF_EXTENSION):
+        if file_extension == PDF_EXTENSION:
             doc = Document(stream=file_content)
             yield from self._convert_from_doc_to_str(doc)
-        if file_name.endswith(TEXT_EXTENSION):
+        elif file_extension == TEXT_EXTENSION:
             yield self._convert_bytes_to_str(file_content)
-        if file_name.endswith(MD_EXTENSION):
+        elif file_extension in (MD_EXTENSION, MDX_EXTENSION):
             yield self._convert_bytes_to_str(file_content)
 
     def _extract_documents_from_zip_file(self, file: UploadFile = File(...)) -> Generator[str, None, None]:
@@ -176,32 +178,35 @@ def extract_documents_from_file(self, file: UploadFile = File(...)) -> Generator
             Generator[str, None, None]: A generator that yields strings of text content
             
         Raises:
-            InvalidFileExtensionError: If the file extension is not supported
+            InvalidFileError: If the file extension is not supported
             BadZipFile: If the zip file is corrupted or invalid
             TarError: If the tar file is corrupted or invalid
             BadGzipFile: If the gzip file is corrupted or invalid
             Exception: For general processing errors
         
         """
         self.logger.info(f"Extracting documents from file {file.filename}")
-        if file.content_type not in SUPPORTED_CONTENT_TYPES_TUPLE:
-            raise InvalidFileExtensionError(filename=file.filename)
+
+        file_type = get_file_type(file)
+
+        if file_type is None:
+            raise InvalidFileError(filename=file.filename)
 
         result: list[str] | Generator[str, None, None] = []
 
-        match file.content_type:
-            case "text/plain" | "text/markdown":
+        match file_type:
+            case FileType.TEXT:
                 result = [self._convert_text_to_str(file)]
-            case "application/pdf":
+            case FileType.PDF:
                 result = self._convert_pdf_to_str(file)
-            case "application/zip":
+            case FileType.ZIP:
                 result = self._extract_documents_from_zip_file(file)
-            case "application/x-tar":
+            case FileType.TAR:
                 result = self._extract_documents_from_tar_file(file)
-            case "application/gzip":
+            case FileType.GZIP:
                 result = self._extract_documents_from_gzip_file(file)
             case _:
-                raise InvalidFileExtensionError(filename=file.filename)
+                raise InvalidFileError(filename=file.filename)
 
         self.logger.info(f"Completed documents extraction from file {file.filename}")
         yield from result