Create vector indexes for confluence

Quick intro

What can I use this for?

The indexer is a component in a Retrieval Augmented Generation (RAG) application. One example of such application is a chatbot that can answer questions from confluence. The indexer creates a vector index of confluence pages. This vector index is used from the chatbot to find the most relevant pages for the question.

You don't neccessarily need anything else to build a working chatbot, for example once confluence is indexed to Azure AI Search, you chat with it using the Azure AI Playground.

Usage

This application is intended to be used as a batch run. It should figure out itself what assets need to be updated. Currently, it supports Azure AI Search, but it should be relatively easy to add other vector databases.

You need to set plenty of environment variables to make this work. See the .env.example file for a list of them.

It has been tested to work with both Cloud and Server version of confluence. It should in theory also work with the datacenter version (it uses the Confluence python SDK)🤞.

Running

You can run it with docker: docker run --env-file .env ghcr.io/piizei/confluence-vector-indexer:latest

Configuration

Check .env.example for values table of configuration (environment) values

Name	Description	Default
AZURE_SEARCH_ENDPOINT	the URL of azure ai search
AZURE_SEARCH_KEY=	Admin key for search. If not specified, should use managed identity.
AZURE_SEARCH_API_VERSION	Version of azure ai search api (2023-11-01 or later from rel1.0)	2023-11-01
AZURE_SEARCH_CONFLUENCE_INDEX	Index name to be created for confluence.	confluence
AZURE_SEARCH_EMBEDDING_MODEL	The deployment name in Azure OpenAi or model name, usually text-embedding-ada-002	text-embedding-ada-002
AZURE_SEARCH_FULL_REINDEX	(true, false) Reindex every page (normally just the ones that changed after last index)	false
OPENAI_API_KEY	Key to openai service (no managed identity support as now)
OPENAI_API_VERSION	The api version (2023-05-15 for example)
OPENAI_API_TYPE	azure or none, the none is not tested.
OPENAI_API_BASE	Azure open-ai service full url (https://myazureopenai.openai.azure.com/)
CONFLUENCE_URL	URL for your confluence cloud instance.
CONFLUENCE_USER_NAME	Your indexer user username
CONFLUENCE_PASSWORD	Api token created from confluence (not login password)
CONFLUENCE_SPACE_FILTER	Comma separated list of Spaces in confluence (without whitespaces) that will be indexed.
CONFLUENCE_TEST_SPACE	A space against which the integration test runs (your personal space for example)
LOG_LEVEL	one of DEBUG, INFO, WARNING	WARNING
CONFLUENCE_AUTH_METHOD	one of PASSWORD, TOKEN(*)	PASSWORD
INDEX_ATTACHMENTS	Index also attachments (See attachment indexing for more info)	false

(*) The value of CONFLUENCE_PASSWORD variable is also used for token. If password is set for CONFLUENCE_AUTH_METHOD, it uses BASIC authentication, and if Token is set, it sends the password (...token) as Bearer token. This is functionality of the confluence python SDK.gi

Very special configurations

You can add custom headers to the requests to confluence by adding CONFLUENCE_HEADER_XXX variables, where XXX is the number of custom header-value pair. This is useful if you want for example to use Cloudflare Service Tokens to connect to on-prem confluence server.

Example of using Cloudflare Service Tokens:

Name	Value
CONFLUENCE_EXTRA_HEADER_KEY_1	CF-Access-Client-Id
CONFLUENCE_EXTRA_HEADER_KEY_2	CF-Access-Client-Secret
CONFLUENCE_EXTRA_HEADER_VALUE_1	123.access
CONFLUENCE_EXTRA_HEADER_VALUE_2	abc123qwertysecret

Attachment indexing

The attachment indexing is not enabled by default. You can enable it by setting INDEX_ATTACHMENTS to true. The supported document types vary by the document indexer. The default implementation is Azure Document Intelligence that supports PDFs, images, office files (docx, xlsx, pptx), and HTML. Azure AI Document intelligence region must support preview api version 2023-10-31-preview (at this date East US. West US2. West Europe).

Updates & Upgrades

The Git tags match with the docker-container tags. The releases are not guaranteed to be backward compatible. Example of breaking change is the update of AI Search API version from preview to GA (rel-0.6 to rel-1.0). The indexed fields are compatible (but more maybe added). This means the Chat application using the index should not break, but you would need to reindex the confluence. If it works, no need to update.

DEV

Prerequisites

poetry
For Azure AI Search, you need to have an Azure account and access to Azure OpenAI

Testing

Set your personal (or some other equivalent good testing space) to CONFLUENCE_TEST_SPACE and then run poetry run pytest

Extending

To add your own vector database, just implement the same interface as the Azure AI Search, and add it to the search.py file.

TODO

Figure out how to remember what removed pages are removed from index
Azure AI search skill
Add more vector databases / search indexes
Figure out how to handle dependencies to various search engines

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.github/workflows		.github/workflows
confluence_vector_sync		confluence_vector_sync
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Create vector indexes for confluence

Quick intro

What can I use this for?

Usage

Running

Configuration

Very special configurations

Attachment indexing

Updates & Upgrades

DEV

Prerequisites

Testing

Extending

TODO

About

Releases 15

Packages

Languages

License

piizei/confluence-vector-indexer

Folders and files

Latest commit

History

Repository files navigation

Create vector indexes for confluence

Quick intro

What can I use this for?

Usage

Running

Configuration

Very special configurations

Attachment indexing

Updates & Upgrades

DEV

Prerequisites

Testing

Extending

TODO

About

Resources

License

Stars

Watchers

Forks

Releases 15

Packages 0

Languages

Packages