Skip to content

Commit 3647bc2

Browse files
robodev-r2d2NewDev16a-klos
authored
feat: add cql support for confluence extractor (#153)
This pull request introduces support for Confluence Query Language (CQL) filtering in the document extraction workflow, allowing users to specify either a `space_key` or a `cql` query to select Confluence pages for processing. The frontend and backend have been updated to handle these new parameters, including user interface improvements and stricter validation to ensure that at least one filtering option is provided. Documentation and tests have also been added to clarify and verify the new behavior. **Backend changes for Confluence extraction:** * The backend now accepts either a `space_key` or a `cql` parameter for Confluence extraction, with validation to require at least one; empty values for these parameters are ignored, and an error is raised if both are missing. [[1]](diffhunk://#diff-2b5524f0cb01b11e336def1a99356a243662de61a73be6dd5da1be89227cf112L57-R80) [[2]](diffhunk://#diff-abd3edfc8fadc978097bb0fa2dbc6996a3bc15fa110f2269ae43541b9bf98c64L36-R40) [[3]](diffhunk://#diff-abd3edfc8fadc978097bb0fa2dbc6996a3bc15fa110f2269ae43541b9bf98c64R128) * The `ConfluenceParameters` model and extraction logic were updated to support the optional `cql` parameter and propagate it to the loader. [[1]](diffhunk://#diff-abd3edfc8fadc978097bb0fa2dbc6996a3bc15fa110f2269ae43541b9bf98c64L36-R40) [[2]](diffhunk://#diff-abd3edfc8fadc978097bb0fa2dbc6996a3bc15fa110f2269ae43541b9bf98c64R62) * The extraction process always sets `content_format` to `VIEW` for consistency. * Tests were added to verify CQL support and validation logic for required parameters. **Frontend changes for Confluence configuration:** * The Confluence upload UI now includes fields for both `spaceKey` and `cql` (both optional), updates the payload sent to the backend, and clarifies placeholders and descriptions. [[1]](diffhunk://#diff-a6fc8bcaabdced0bd0b5b642bd5a4aa9cb124a5bbebd0762e76f9dcb0df884c1R25) [[2]](diffhunk://#diff-a6fc8bcaabdced0bd0b5b642bd5a4aa9cb124a5bbebd0762e76f9dcb0df884c1L78-R80) [[3]](diffhunk://#diff-a6fc8bcaabdced0bd0b5b642bd5a4aa9cb124a5bbebd0762e76f9dcb0df884c1L185-R196) [[4]](diffhunk://#diff-0f7547155cd6592b947aae6327e72dbe57073ae43aba24e82ad7ef78fee08153L12-R13) [[5]](diffhunk://#diff-0f7547155cd6592b947aae6327e72dbe57073ae43aba24e82ad7ef78fee08153L58-R72) * Localization strings were updated to describe the new CQL filtering feature and improve user guidance. [[1]](diffhunk://#diff-e485c1eda5b61acd7bba3807afc19b489ad515ba3a6feddd627596986245c334L13-R15) [[2]](diffhunk://#diff-430c5bb0cfd37251a3388659a69ca7cff0726cd2cc40d592b79b55c9f644050dL15-R17) **Documentation update:** * The README was updated to document the new Confluence extraction parameters and their behavior. --------- Co-authored-by: Andreas Klos <[email protected]> Co-authored-by: Andreas Klos <[email protected]>
1 parent 707ddaf commit 3647bc2

File tree

8 files changed

+132
-19
lines changed

8 files changed

+132
-19
lines changed

libs/README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -321,6 +321,8 @@ The following types of information can be extracted:
321321
- `TABLE`: data in tabular form found in the document
322322
- `IMAGE`: image found in the document
323323

324+
For Confluence sources, provide the instance `url` and API `token` and include either a `space_key` or a `cql` filter (empty values are ignored). Optional flags such as `include_attachments`, `keep_markdown_format`, and `keep_newlines` mirror the parameters supported by LangChain's `ConfluenceLoader`.
325+
324326
For sitemap sources, additional parameters can be provided, e.g.:
325327

326328
- `web_path`: The URL of the XML sitemap to crawl

libs/extractor-api-lib/src/extractor_api_lib/impl/extractors/confluence_extractor.py

Lines changed: 25 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,8 @@
1010
from extractor_api_lib.impl.mapper.confluence_langchain_document2information_piece import (
1111
ConfluenceLangchainDocument2InformationPiece,
1212
)
13+
from langchain_community.document_loaders.confluence import ContentFormat
14+
1315

1416
logger = logging.getLogger(__name__)
1517

@@ -54,11 +56,28 @@ async def aextract_content(
5456
A list of information pieces extracted from Confluence.
5557
"""
5658
# Convert list of key value pairs to dict
57-
confluence_loader_parameters = {
58-
x.key: int(x.value) if x.value.isdigit() else x.value for x in extraction_parameters.kwargs
59-
}
60-
if not confluence_loader_parameters.get("max_pages") or isinstance(
61-
confluence_loader_parameters.get("max_pages"), str
59+
confluence_loader_parameters = {}
60+
for key_value in extraction_parameters.kwargs or []:
61+
if key_value is None or key_value.key is None:
62+
continue
63+
64+
value = key_value.value
65+
if isinstance(value, str):
66+
value = value.strip()
67+
if not value and key_value.key in {"space_key", "cql"}:
68+
# Skip empty optional parameters
69+
continue
70+
if value.isdigit():
71+
value = int(value)
72+
73+
confluence_loader_parameters[key_value.key] = value
74+
75+
if "cql" not in confluence_loader_parameters and "space_key" not in confluence_loader_parameters:
76+
raise ValueError("Either 'space_key' or 'cql' must be provided for Confluence extraction.")
77+
if (
78+
"max_pages" in confluence_loader_parameters
79+
and not confluence_loader_parameters.get("max_pages")
80+
or isinstance(confluence_loader_parameters.get("max_pages"), str)
6281
):
6382
logging.warning(
6483
"max_pages parameter is not set or invalid discarding it. ConfluenceLoader will use default value."
@@ -67,6 +86,7 @@ async def aextract_content(
6786
# Drop the document_name parameter as it is not used by the ConfluenceLoader
6887
if "document_name" in confluence_loader_parameters:
6988
confluence_loader_parameters.pop("document_name", None)
89+
confluence_loader_parameters["content_format"] = ContentFormat.VIEW
7090
document_loader = ConfluenceLoader(**confluence_loader_parameters)
7191
documents = document_loader.load()
7292
return [self._mapper.map_document2informationpiece(x, extraction_parameters.document_name) for x in documents]

libs/extractor-api-lib/src/extractor_api_lib/models/confluence_parameters.py

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,11 @@ class ConfluenceParameters(BaseModel):
3333

3434
url: StrictStr = Field(description="url of the confluence space.")
3535
token: StrictStr = Field(description="api key to access confluence.")
36-
space_key: StrictStr = Field(description="the space key of the confluence pages.")
36+
space_key: Optional[StrictStr] = Field(default=None, description="the space key of the confluence pages.")
37+
cql: Optional[StrictStr] = Field(
38+
default=None,
39+
description="Optional Confluence Query Language (CQL) expression used to filter pages.",
40+
)
3741
include_attachments: Optional[StrictBool] = Field(
3842
default=False,
3943
description="whether to include file attachments (e.g., images, documents) in the parsed content. Default is `false`.",
@@ -55,6 +59,7 @@ class ConfluenceParameters(BaseModel):
5559
"url",
5660
"token",
5761
"space_key",
62+
"cql",
5863
"include_attachments",
5964
"keep_markdown_format",
6065
"keep_newlines",
@@ -120,6 +125,7 @@ def from_dict(cls, obj: Dict) -> Self:
120125
"url": obj.get("url"),
121126
"token": obj.get("token"),
122127
"space_key": obj.get("space_key"),
128+
"cql": obj.get("cql"),
123129
"include_attachments": (
124130
obj.get("include_attachments") if obj.get("include_attachments") is not None else False
125131
),
Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
"""Tests for the ConfluenceExtractor."""
2+
3+
import pytest
4+
from unittest.mock import MagicMock, patch
5+
from langchain_core.documents import Document as LangchainDocument
6+
7+
from extractor_api_lib.impl.extractors.confluence_extractor import ConfluenceExtractor
8+
from extractor_api_lib.models.extraction_parameters import ExtractionParameters
9+
from extractor_api_lib.models.key_value_pair import KeyValuePair
10+
from extractor_api_lib.models.dataclasses.internal_information_piece import InternalInformationPiece
11+
from extractor_api_lib.impl.types.content_type import ContentType
12+
13+
14+
@pytest.fixture
15+
def confluence_mapper():
16+
"""Return a mapper mock that produces predictable information pieces."""
17+
mapper = MagicMock()
18+
mapper.map_document2informationpiece.return_value = InternalInformationPiece(
19+
type=ContentType.TEXT,
20+
metadata={"document": "doc", "id": "id", "related": []},
21+
page_content="content",
22+
)
23+
return mapper
24+
25+
26+
@pytest.mark.asyncio
27+
@patch("extractor_api_lib.impl.extractors.confluence_extractor.ConfluenceLoader")
28+
async def test_aextract_content_supports_cql(mock_loader_cls, confluence_mapper):
29+
"""Ensure the extractor forwards the CQL parameter to the loader."""
30+
extractor = ConfluenceExtractor(mapper=confluence_mapper)
31+
extraction_parameters = ExtractionParameters(
32+
document_name="confluence_doc",
33+
source_type="confluence",
34+
kwargs=[
35+
KeyValuePair(key="url", value="https://example.atlassian.net"),
36+
KeyValuePair(key="token", value="token"),
37+
KeyValuePair(key="cql", value="type=page"),
38+
],
39+
)
40+
41+
mock_loader_instance = MagicMock()
42+
mock_loader_instance.load.return_value = [LangchainDocument(page_content="content", metadata={"title": "Doc"})]
43+
mock_loader_cls.return_value = mock_loader_instance
44+
45+
results = await extractor.aextract_content(extraction_parameters)
46+
47+
assert len(results) == 1
48+
confluence_mapper.map_document2informationpiece.assert_called_once()
49+
loader_kwargs = mock_loader_cls.call_args.kwargs
50+
assert loader_kwargs["cql"] == "type=page"
51+
assert "space_key" not in loader_kwargs
52+
53+
54+
@pytest.mark.asyncio
55+
async def test_aextract_content_requires_space_key_or_cql(confluence_mapper):
56+
"""The extractor must receive either a space key or a CQL expression."""
57+
extractor = ConfluenceExtractor(mapper=confluence_mapper)
58+
extraction_parameters = ExtractionParameters(
59+
document_name="confluence_doc",
60+
source_type="confluence",
61+
kwargs=[
62+
KeyValuePair(key="url", value="https://example.atlassian.net"),
63+
KeyValuePair(key="token", value="token"),
64+
],
65+
)
66+
67+
with pytest.raises(ValueError, match="Either 'space_key' or 'cql' must be provided for Confluence extraction."):
68+
await extractor.aextract_content(extraction_parameters)

services/frontend/libs/admin-app/data-access/document.api.ts

Lines changed: 15 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,8 @@ axios.defaults.auth = {
99

1010
// confluence configuration interface
1111
export interface ConfluenceConfig {
12-
spaceKey: string;
12+
spaceKey?: string;
13+
cql?: string;
1314
token: string;
1415
url: string;
1516
maxPages?: number;
@@ -55,11 +56,20 @@ export class DocumentAPI {
5556
static async loadConfluence(config: ConfluenceConfig): Promise<void> {
5657
try {
5758
// convert config to list of key/value items for backend
58-
const payload = [
59-
{ key: 'url', value: config.url },
59+
const payload: { key: string; value: string }[] = [
60+
{ key: 'url', value: config.url.trim() },
6061
{ key: 'token', value: config.token },
61-
{ key: 'space_key', value: config.spaceKey },
62-
] as { key: string; value: string }[];
62+
];
63+
64+
const spaceKey = config.spaceKey?.trim();
65+
if (spaceKey) {
66+
payload.push({ key: 'space_key', value: spaceKey });
67+
}
68+
69+
const cql = config.cql?.trim();
70+
if (cql) {
71+
payload.push({ key: 'cql', value: cql });
72+
}
6373

6474
if (typeof config.maxPages === 'number') {
6575
payload.push({ key: 'max_pages', value: String(config.maxPages) });

services/frontend/libs/admin-app/feature-document/DocumentUploadContainer.vue

Lines changed: 9 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@ const spaceKey = ref('');
2222
const confluenceToken = ref('');
2323
const confluenceUrl = ref('');
2424
const maxPages = ref<number>();
25+
const confluenceCql = ref('');
2526
2627
// sitemap configuration refs
2728
const sitemapName = ref('');
@@ -75,7 +76,8 @@ const handleConfluenceUpload = () => {
7576
spaceKey: spaceKey.value,
7677
token: confluenceToken.value,
7778
url: confluenceUrl.value,
78-
maxPages: maxPages.value
79+
maxPages: maxPages.value,
80+
cql: confluenceCql.value,
7981
});
8082
}
8183
@@ -182,13 +184,16 @@ const getErrorMessage = (errorType: string) => {
182184
<label for="confluenceName" class="sr-only"> Confluence Name</label>
183185
<input v-model="confluenceName" type="text" placeholder="Name" class="input input-bordered w-full" />
184186
<label for="spaceKey" class="sr-only">Space key</label>
185-
<input v-model="spaceKey" type="text" placeholder="Space key" class="input input-bordered w-full" />
187+
<input v-model="spaceKey" type="text" placeholder="Space key (optional)" class="input input-bordered w-full" />
188+
<label for="confluenceCql" class="sr-only">CQL</label>
189+
<input v-model="confluenceCql" type="text" placeholder="CQL query (optional)" class="input input-bordered w-full" />
186190
<label for="confluenceToken" class="sr-only">Token</label>
187191
<input v-model="confluenceToken" type="password" placeholder="Token" class="input input-bordered w-full" />
188192
<label for="maxPages" class="sr-only">Max pages</label>
189-
<input v-model.number="maxPages" type="number" placeholder="Max number of pages" class="input input-bordered w-full" />
193+
<input v-model.number="maxPages" type="number" placeholder="Max number of pages (optional)" class="input input-bordered w-full" />
190194
</div>
191-
<p class="text-xs opacity-50 mb-4">{{ t('documents.confluenceLoadDescription') }}</p>
195+
<p class="text-xs opacity-50">{{ t('documents.confluenceLoadDescription') }}</p>
196+
<p class="text-xs opacity-50 mb-4">{{ t('documents.confluenceQueryHint') }}</p>
192197
<button class="btn btn-sm btn-accent" @click="handleConfluenceUpload">
193198
{{ t('documents.loadConfluence') }}
194199
</button>

services/frontend/libs/i18n/admin/de.json

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,8 +10,9 @@
1010
"uploadingDocument": "Wird hochgeladen...",
1111
"fileUpload": "Datei-Upload",
1212
"confluenceUpload": "Confluence",
13-
"confluenceLoadTitle": "Confluence-Seiten laden",
14-
"confluenceLoadDescription": "Klicken Sie auf den Button unten, um Seiten aus Confluence zu laden",
13+
"confluenceLoadTitle": "Confluence-Inhalte laden",
14+
"confluenceLoadDescription": "Geben Sie Ihre Confluence-Zugangsdaten an und wählen Sie einen Space-Key oder einen CQL-Filter",
15+
"confluenceQueryHint": "Lassen Sie die Felder leer, um den gesamten Bereich zu laden, oder geben Sie einen Confluence Query Language (CQL) Ausdruck zum Filtern der Seiten ein",
1516
"loadConfluence": "Laden starten",
1617
"fileTypeNotAllowedTitle": "Dateityp nicht erlaubt",
1718
"fileTypeNotAllowedDescription": "Nur PDF-, DOCX-, PPTX- und XML-Dateien sind erlaubt",

services/frontend/libs/i18n/admin/en.json

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,8 +12,9 @@
1212
"fileTypeNotAllowedDescription": "Only PDF, DOCX, PPTX, and XML files are allowed",
1313
"fileUpload": "File Upload",
1414
"confluenceUpload": "Confluence",
15-
"confluenceLoadTitle": "Load all Confluence pages from a space",
16-
"confluenceLoadDescription": "Click the button below to load pages from Confluence",
15+
"confluenceLoadTitle": "Load Confluence content",
16+
"confluenceLoadDescription": "Provide your Confluence credentials and choose a space key or CQL filter",
17+
"confluenceQueryHint": "Leave fields blank to load the whole space or supply a Confluence Query Language (CQL) expression to filter pages",
1718
"loadConfluence": "Load Confluence",
1819
"sitemapUpload": "Sitemap",
1920
"sitemapLoadTitle": "Load content from a sitemap",

0 commit comments

Comments
 (0)