The indexer is a component in a Retrieval Augmented Generation (RAG) application. One example of such application is a chatbot that can answer questions from confluence. The indexer creates a vector index of confluence pages. This vector index is used from the chatbot to find the most relevant pages for the question.
You don't neccessarily need anything else to build a working chatbot, for example once confluence is indexed to Azure AI Search, you chat with it using the Azure AI Playground.
This application is intended to be used as a batch run. It should figure out itself what assets need to be updated. Currently, it supports Azure AI Search, but it should be relatively easy to add other vector databases.
You need to set plenty of environment variables to make this work. See the .env.example file for a list of them.
It has been tested to work with both Cloud and Server version of confluence. It should in theory also work with the datacenter version (it uses the Confluence python SDK)🤞.
You can run it with docker:
docker run --env-file .env ghcr.io/piizei/confluence-vector-indexer:latest
Check .env.example for values table of configuration (environment) values
Name | Description | Default |
---|---|---|
AZURE_SEARCH_ENDPOINT | the URL of azure ai search | |
AZURE_SEARCH_KEY= | Admin key for search. If not specified, should use managed identity. | |
AZURE_SEARCH_API_VERSION | Version of azure ai search api (2023-11-01 or later from rel1.0) | 2023-11-01 |
AZURE_SEARCH_CONFLUENCE_INDEX | Index name to be created for confluence. | confluence |
AZURE_SEARCH_EMBEDDING_MODEL | The deployment name in Azure OpenAi or model name, usually text-embedding-ada-002 | text-embedding-ada-002 |
AZURE_SEARCH_FULL_REINDEX | (true, false) Reindex every page (normally just the ones that changed after last index) | false |
OPENAI_API_KEY | Key to openai service (no managed identity support as now) | |
OPENAI_API_VERSION | The api version (2023-05-15 for example) | |
OPENAI_API_TYPE | azure or none, the none is not tested. | |
OPENAI_API_BASE | Azure open-ai service full url (https://myazureopenai.openai.azure.com/) | |
CONFLUENCE_URL | URL for your confluence cloud instance. | |
CONFLUENCE_USER_NAME | Your indexer user username | |
CONFLUENCE_PASSWORD | Api token created from confluence (not login password) | |
CONFLUENCE_SPACE_FILTER | Comma separated list of Spaces in confluence (without whitespaces) that will be indexed. | |
CONFLUENCE_TEST_SPACE | A space against which the integration test runs (your personal space for example) | |
LOG_LEVEL | one of DEBUG, INFO, WARNING | WARNING |
CONFLUENCE_AUTH_METHOD | one of PASSWORD, TOKEN(*) | PASSWORD |
INDEX_ATTACHMENTS | Index also attachments (See attachment indexing for more info) | false |
(*) The value of CONFLUENCE_PASSWORD variable is also used for token. If password is set for CONFLUENCE_AUTH_METHOD, it uses BASIC authentication, and if Token is set, it sends the password (...token) as Bearer token. This is functionality of the confluence python SDK.gi
You can add custom headers to the requests to confluence by adding CONFLUENCE_HEADER_XXX variables, where XXX is the number of custom header-value pair. This is useful if you want for example to use Cloudflare Service Tokens to connect to on-prem confluence server.
Example of using Cloudflare Service Tokens:
Name | Value |
---|---|
CONFLUENCE_EXTRA_HEADER_KEY_1 | CF-Access-Client-Id |
CONFLUENCE_EXTRA_HEADER_KEY_2 | CF-Access-Client-Secret |
CONFLUENCE_EXTRA_HEADER_VALUE_1 | 123.access |
CONFLUENCE_EXTRA_HEADER_VALUE_2 | abc123qwertysecret |
The attachment indexing is not enabled by default. You can enable it by setting INDEX_ATTACHMENTS to true. The supported document types vary by the document indexer. The default implementation is Azure Document Intelligence that supports PDFs, images, office files (docx, xlsx, pptx), and HTML. Azure AI Document intelligence region must support preview api version 2023-10-31-preview (at this date East US. West US2. West Europe).
The Git tags match with the docker-container tags. The releases are not guaranteed to be backward compatible. Example of breaking change is the update of AI Search API version from preview to GA (rel-0.6 to rel-1.0). The indexed fields are compatible (but more maybe added). This means the Chat application using the index should not break, but you would need to reindex the confluence. If it works, no need to update.
- poetry
- For Azure AI Search, you need to have an Azure account and access to Azure OpenAI
Set your personal (or some other equivalent good testing space) to CONFLUENCE_TEST_SPACE and then run
poetry run pytest
To add your own vector database, just implement the same interface as the Azure AI Search, and add it to the search.py file.
- Figure out how to remember what removed pages are removed from index
- Azure AI search skill
- Add more vector databases / search indexes
- Figure out how to handle dependencies to various search engines